Depending on your needs, you may use either of the 4 techniques below in order to randomly select columns from Pandas DataFrame:
(1) Randomly select a single column:
df = df.sample(axis='columns')
(2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3:
df = df.sample(n=3,axis='columns')
(3) Allow a random selection of the same column more than once (by setting replace=True):
df = df.sample(n=3,axis='columns',replace=True)
(4) Randomly select a specified fraction of the total number of columns (for example, if you have 6 columns, and you set frac=0.50, then you’ll get a random selection of 50% of the total columns, meaning that 3 columns will be randomly selected):
df = df.sample(frac=0.50,axis='columns')
In the next section, you’ll see how to apply each of the above cases in practice.
The Example
To begin with a simple example, let’s create a DataFrame with 6 columns that contain data about boxes:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) print (df)
Run the code in Python, and you’ll get the following DataFrame:
Color Shape Material Length Width Height
0 Blue Square Wood 15 8 30
1 Blue Square Cardboard 25 5 35
2 Green Square Wood 25 5 35
3 Green Rectangle Wood 15 4 40
4 Green Rectangle Wood 15 8 30
5 Red Rectangle Cardboard 15 8 35
6 Red Square Cardboard 20 5 40
7 Red Rectangle Wood 25 4 40
The goal is to randomly select columns from the above DataFrame across 4 different cases.
4 Cases to Randomly Select Columns from Pandas DataFrame
Case 1: randomly select a single column
To randomly select a single column, simply add df = df.sample(axis=’columns’) to the code:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(axis='columns') print (df)
Run the code, and you’ll see that a single column was randomly selected:
Length
0 15
1 25
2 25
3 15
4 15
5 15
6 20
7 25
Case 2: randomly select a specified number of columns
Let’s suppose that you want to randomly select 3 columns from the DataFrame. In that case, you’ll need to set n=3:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(n=3,axis='columns') print (df)
As you can see, 3 columns were randomly selected:
Color Width Height
0 Blue 8 30
1 Blue 5 35
2 Green 5 35
3 Green 4 40
4 Green 8 30
5 Red 8 35
6 Red 5 40
7 Red 4 40
Case 3: allow a random selection of the same column more than once
What if you want to allow the random selection of the same column more than once?
In such a case, you’ll need to set replace=True in the code:
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(n=3,axis='columns',replace=True) print (df)
As can be observed, the ‘Length’ column was randomly selected more than once:
Color Length Length
0 Blue 15 15
1 Blue 25 25
2 Green 25 25
3 Green 15 15
4 Green 15 15
5 Red 15 15
6 Red 20 20
7 Red 25 25
Note that setting replace=True doesn’t guarantee that you’ll get the random selection of the same column more than once.
Case 4: randomly select a specified fraction of the total number of columns
Suppose that you want to randomly select a specified fraction of the total number of columns.
For example, if you set frac=0.50, then 50% of the total number of columns will be selected (meaning that 3 columns, out of the total of 6 columns, will be randomly selected):
import pandas as pd boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'], 'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'], 'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'], 'Length': [15,25,25,15,15,15,20,25], 'Width': [8,5,5,4,8,8,5,4], 'Height': [30,35,35,40,30,35,40,40] } df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height']) df = df.sample(frac=0.50,axis='columns') print (df)
As you can see, 3 columns were indeed randomly selected:
Material Height Shape
0 Wood 30 Square
1 Cardboard 35 Square
2 Wood 35 Square
3 Wood 40 Rectangle
4 Wood 30 Rectangle
5 Cardboard 35 Rectangle
6 Cardboard 40 Square
7 Wood 40 Rectangle
You can read more about df.sample() by visiting the Pandas Documentation.
Alternatively, you can check the following guide to learn how to randomly select rows from Pandas DataFrame.