How to Randomly Select Columns from Pandas DataFrame

Depending on your needs, you may use either of the 4 techniques below in order to randomly select columns from Pandas DataFrame:

(1) Randomly select a single column:

df = df.sample(axis='columns')

(2) Randomly select a specified number of columns. For example, to select 3 random columns, set n=3:

df = df.sample(n=3,axis='columns')

 (3) Allow a random selection of the same column more than once (by setting replace=True):

df = df.sample(n=3,axis='columns',replace=True)

(4) Randomly select a specified fraction of the total number of columns (for example, if you have 6 columns, and you set frac=0.50, then you’ll get a random selection of 50% of the total columns, meaning that 3 columns will be randomly selected):

df = df.sample(frac=0.50,axis='columns')

In the next section, you’ll see how to apply each of the above cases in practice.

The Example

To begin with a simple example, let’s create a DataFrame with 6 columns that contain data about boxes:

import pandas as pd

boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'],
         'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'],
      'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'],
        'Length': [15,25,25,15,15,15,20,25],
         'Width': [8,5,5,4,8,8,5,4],
        'Height': [30,35,35,40,30,35,40,40]
        }

df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height'])

print (df)

Run the code in Python, and you’ll get the following DataFrame:

   Color      Shape   Material  Length  Width  Height
0   Blue     Square       Wood      15      8      30
1   Blue     Square  Cardboard      25      5      35
2  Green     Square       Wood      25      5      35
3  Green  Rectangle       Wood      15      4      40
4  Green  Rectangle       Wood      15      8      30
5    Red  Rectangle  Cardboard      15      8      35
6    Red     Square  Cardboard      20      5      40
7    Red  Rectangle       Wood      25      4      40

The goal is to randomly select columns from the above DataFrame across 4 different cases.

4 Cases to Randomly Select Columns from Pandas DataFrame

Case 1: randomly select a single column

To randomly select a single column, simply add df = df.sample(axis=’columns’) to the code:

import pandas as pd

boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'],
         'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'],
      'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'],
        'Length': [15,25,25,15,15,15,20,25],
         'Width': [8,5,5,4,8,8,5,4],
        'Height': [30,35,35,40,30,35,40,40]
        }

df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height'])

df = df.sample(axis='columns')

print (df)

Run the code, and you’ll see that a single column was randomly selected:

   Length
0      15
1      25
2      25
3      15
4      15
5      15
6      20
7      25

Case 2: randomly select a specified number of columns

Let’s suppose that you want to randomly select 3 columns from the DataFrame. In that case, you’ll need to set n=3:

import pandas as pd

boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'],
         'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'],
      'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'],
        'Length': [15,25,25,15,15,15,20,25],
         'Width': [8,5,5,4,8,8,5,4],
        'Height': [30,35,35,40,30,35,40,40]
        }

df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height'])

df = df.sample(n=3,axis='columns')

print (df)

As you can see, 3 columns were randomly selected:

   Color  Width  Height
0   Blue      8      30
1   Blue      5      35
2  Green      5      35
3  Green      4      40
4  Green      8      30
5    Red      8      35
6    Red      5      40
7    Red      4      40

Case 3: allow a random selection of the same column more than once

What if you want to allow the random selection of the same column more than once?

In such a case, you’ll need to set replace=True in the code:

import pandas as pd

boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'],
         'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'],
      'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'],
        'Length': [15,25,25,15,15,15,20,25],
         'Width': [8,5,5,4,8,8,5,4],
        'Height': [30,35,35,40,30,35,40,40]
        }

df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height'])

df = df.sample(n=3,axis='columns',replace=True)

print (df)

As can be observed, the ‘Length’ column was randomly selected more than once:

   Color  Length  Length
0   Blue      15      15
1   Blue      25      25
2  Green      25      25
3  Green      15      15
4  Green      15      15
5    Red      15      15
6    Red      20      20
7    Red      25      25

Note that setting replace=True doesn’t guarantee that you’ll get the random selection of the same column more than once.

Case 4: randomly select a specified fraction of the total number of columns

Suppose that you want to randomly select a specified fraction of the total number of columns.

For example, if you set frac=0.50, then 50% of the total number of columns will be selected (meaning that 3 columns, out of the total of 6 columns, will be randomly selected):

import pandas as pd

boxes = {'Color': ['Blue','Blue','Green','Green','Green','Red','Red','Red'],
         'Shape': ['Square','Square','Square','Rectangle','Rectangle','Rectangle','Square','Rectangle'],
      'Material': ['Wood','Cardboard','Wood','Wood','Wood','Cardboard','Cardboard','Wood'],
        'Length': [15,25,25,15,15,15,20,25],
         'Width': [8,5,5,4,8,8,5,4],
        'Height': [30,35,35,40,30,35,40,40]
        }

df = pd.DataFrame(boxes, columns = ['Color','Shape','Material','Length','Width','Height'])

df = df.sample(frac=0.50,axis='columns')

print (df)

As you can see, 3 columns were indeed randomly selected:

    Material  Height      Shape
0       Wood      30     Square
1  Cardboard      35     Square
2       Wood      35     Square
3       Wood      40  Rectangle
4       Wood      30  Rectangle
5  Cardboard      35  Rectangle
6  Cardboard      40     Square
7       Wood      40  Rectangle

You can read more about df.sample() by visiting the Pandas Documentation.

Alternatively, you can check the following guide to learn how to randomly select rows from Pandas DataFrame.