Here are 4 ways to randomly select columns in Pandas DataFrame:
(1) Randomly select a single column:
df = df.sample(axis="columns")
(2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3:
df = df.sample(n=3, axis="columns")
(3) Allow a random selection of the same column more than once (by setting replace=True):
df = df.sample(n=3, axis="columns", replace=True)
(4) Randomly select a specified fraction of the total number of columns (for example, if you set frac=0.50, then you’ll get a random selection of 50% of the total columns):
df = df.sample(frac=0.50, axis="columns")
The Example
To begin with a simple example, create a DataFrame with 6 columns that contain data about boxes:
import pandas as pd
data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}
df = pd.DataFrame(data)
print(df)
Run the code in Python, and you’ll get the following DataFrame:
Color Shape Material Length Width Height
0 Blue Square Wood 15 8 30
1 Blue Square Cardboard 25 5 35
2 Green Square Wood 25 5 35
3 Green Rectangle Wood 15 4 40
4 Green Rectangle Wood 15 8 30
5 Red Rectangle Cardboard 15 8 35
6 Red Square Cardboard 20 5 40
7 Red Rectangle Wood 25 4 40
The goal is to randomly select columns from the above DataFrame across 4 different cases.
4 Cases to Randomly Select Columns in Pandas DataFrame
Case 1: randomly select a single column
To randomly select a single column, simply add df = df.sample(axis=”columns”) to the code:
import pandas as pd
data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}
df = pd.DataFrame(data)
df = df.sample(axis="columns")
print(df)
Run the code, and you’ll see that a single column was randomly selected:
Length
0 15
1 25
2 25
3 15
4 15
5 15
6 20
7 25
Case 2: randomly select a specified number of columns
To randomly select 3 columns in the DataFrame, set n=3:
import pandas as pd
data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}
df = pd.DataFrame(data)
df = df.sample(n=3, axis="columns")
print(df)
As you can see, 3 columns were randomly selected:
Color Width Height
0 Blue 8 30
1 Blue 5 35
2 Green 5 35
3 Green 4 40
4 Green 8 30
5 Red 8 35
6 Red 5 40
7 Red 4 40
Case 3: allow a random selection of the same column more than once
What if you want to allow a random selection of the same column more than once?
In that case, set replace=True:
import pandas as pd
data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}
df = pd.DataFrame(data)
df = df.sample(n=3, axis="columns", replace=True)
print(df)
As can be observed, the ‘Length’ column was randomly selected more than once:
Color Length Length
0 Blue 15 15
1 Blue 25 25
2 Green 25 25
3 Green 15 15
4 Green 15 15
5 Red 15 15
6 Red 20 20
7 Red 25 25
Note that setting replace=True doesn’t guarantee that you’ll get a random selection of the same column more than once.
Case 4: randomly select a specified fraction of the total number of columns
For example, set frac=0.50 (for 50% random selection):
import pandas as pd
data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}
df = pd.DataFrame(data)
df = df.sample(frac=0.50, axis="columns")
print(df)
As you can see, 3 columns (out of 6) were indeed randomly selected:
Material Height Shape
0 Wood 30 Square
1 Cardboard 35 Square
2 Wood 35 Square
3 Wood 40 Rectangle
4 Wood 30 Rectangle
5 Cardboard 35 Rectangle
6 Cardboard 40 Square
7 Wood 40 Rectangle
You can read more about df.sample() by visiting the Pandas Documentation.
Alternatively, you can check the following guide to learn how to randomly select rows from Pandas DataFrame.