Depending on your needs, you may use either of the 4 techniques below in order to randomly select columns from Pandas DataFrame:
(1) Randomly select a single column:
df = df.sample(axis='columns')
(2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3:
df = df.sample(n=3,axis='columns')
(3) Allow a random selection of the same column more than once (by setting replace=True):
df = df.sample(n=3,axis='columns',replace=True)
(4) Randomly select a specified fraction of the total number of columns (for example, if you have 6 columns, and you set frac=0.50, then you’ll get a random selection of 50% of the total columns, meaning that 3 columns will be randomly selected):
df = df.sample(frac=0.50,axis='columns')
In the next section, you’ll see how to apply each of the above cases in practice.
The Example
To begin with a simple example, let’s create a DataFrame with 6 columns that contain data about boxes:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) print (df)
Run the code in Python, and you’ll get the following DataFrame:
The goal is to randomly select columns from the above DataFrame across 4 different cases.
4 Cases to Randomly Select Columns from Pandas DataFrame
Case 1: randomly select a single column
To randomly select a single column, simply add df = df.sample(axis=’columns’) to the code:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(axis='columns') print (df)
Run the code, and you’ll see that a single column was randomly selected:
Case 2: randomly select a specified number of columns
Let’s suppose that you want to randomly select 3 columns from the DataFrame. In that case, you’ll need to set n=3:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(n=3,axis='columns') print (df)
As you can see, 3 columns were randomly selected:
Case 3: allow a random selection of the same column more than once
What if you want to allow the random selection of the same column more than once?
In such a case, you’ll need to set replace=True in the code:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(n=3,axis='columns',replace=True) print (df)
As can be observed, the ‘Length’ column was randomly selected more than once:
Note that setting replace=True doesn’t guarantee that you’ll get the random selection of the same column more than once.
Case 4: randomly select a specified fraction of the total number of columns
Suppose that you want to randomly select a specified fraction of the total number of columns.
For example, if you set frac=0.50, then 50% of the total number of columns will be selected (meaning that 3 columns, out of the total of 6 columns, will be randomly selected):
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(frac=0.50,axis='columns') print (df)
As you can see, 3 columns were indeed randomly selected:
You can read more about df.sample() by visiting the Pandas Documentation.
Alternatively, you can check the following guide to learn how to randomly select rows from Pandas DataFrame.