Randomly Select Columns in Pandas DataFrame

Here are 4 ways to randomly select columns in Pandas DataFrame:

(1) Randomly select a single column:

df = df.sample(axis="columns")

(2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3:

df = df.sample(n=3, axis="columns")

 (3) Allow a random selection of the same column more than once (by setting replace=True):

df = df.sample(n=3, axis="columns", replace=True)

(4) Randomly select a specified fraction of the total number of columns (for example, if you set frac=0.50, then you’ll get a random selection of 50% of the total columns):

df = df.sample(frac=0.50, axis="columns")

The Example

To begin with a simple example, create a DataFrame with 6 columns that contain data about boxes:

import pandas as pd

data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}

df = pd.DataFrame(data)

print(df)

Run the code in Python, and you’ll get the following DataFrame:

   Color      Shape   Material  Length  Width  Height
0   Blue     Square       Wood      15      8      30
1   Blue     Square  Cardboard      25      5      35
2  Green     Square       Wood      25      5      35
3  Green  Rectangle       Wood      15      4      40
4  Green  Rectangle       Wood      15      8      30
5    Red  Rectangle  Cardboard      15      8      35
6    Red     Square  Cardboard      20      5      40
7    Red  Rectangle       Wood      25      4      40

The goal is to randomly select columns from the above DataFrame across 4 different cases.

4 Cases to Randomly Select Columns in Pandas DataFrame

Case 1: randomly select a single column

To randomly select a single column, simply add df = df.sample(axis=”columns”) to the code:

import pandas as pd

data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}

df = pd.DataFrame(data)

df = df.sample(axis="columns")

print(df)

Run the code, and you’ll see that a single column was randomly selected:

   Length
0      15
1      25
2      25
3      15
4      15
5      15
6      20
7      25

Case 2: randomly select a specified number of columns

To randomly select 3 columns in the DataFrame, set n=3:

import pandas as pd

data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}

df = pd.DataFrame(data)

df = df.sample(n=3, axis="columns")

print(df)

As you can see, 3 columns were randomly selected:

   Color  Width  Height
0   Blue      8      30
1   Blue      5      35
2  Green      5      35
3  Green      4      40
4  Green      8      30
5    Red      8      35
6    Red      5      40
7    Red      4      40

Case 3: allow a random selection of the same column more than once

What if you want to allow a random selection of the same column more than once?

In that case, set replace=True:

import pandas as pd

data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}

df = pd.DataFrame(data)

df = df.sample(n=3, axis="columns", replace=True)

print(df)

As can be observed, the ‘Length’ column was randomly selected more than once:

   Color  Length  Length
0   Blue      15      15
1   Blue      25      25
2  Green      25      25
3  Green      15      15
4  Green      15      15
5    Red      15      15
6    Red      20      20
7    Red      25      25

Note that setting replace=True doesn’t guarantee that you’ll get a random selection of the same column more than once.

Case 4: randomly select a specified fraction of the total number of columns

For example, set frac=0.50 (for 50% random selection):

import pandas as pd

data = {
"Color": ["Blue", "Blue", "Green", "Green", "Green", "Red", "Red", "Red"],
"Shape": ["Square", "Square", "Square", "Rectangle", "Rectangle", "Rectangle", "Square", "Rectangle"],
"Material": ["Wood", "Cardboard", "Wood", "Wood", "Wood", "Cardboard", "Cardboard", "Wood"],
"Length": [15, 25, 25, 15, 15, 15, 20, 25],
"Width": [8, 5, 5, 4, 8, 8, 5, 4],
"Height": [30, 35, 35, 40, 30, 35, 40, 40],
}

df = pd.DataFrame(data)

df = df.sample(frac=0.50, axis="columns")

print(df)

As you can see, 3 columns (out of 6) were indeed randomly selected:

    Material  Height      Shape
0       Wood      30     Square
1  Cardboard      35     Square
2       Wood      35     Square
3       Wood      40  Rectangle
4       Wood      30  Rectangle
5  Cardboard      35  Rectangle
6  Cardboard      40     Square
7       Wood      40  Rectangle

You can read more about df.sample() by visiting the Pandas Documentation.

Alternatively, you can check the following guide to learn how to randomly select rows from Pandas DataFrame.

Leave a Comment