In this article, we’ll look into the fastest way to convert integers to strings in Pandas DataFrame.
The approaches that will be measured are:
(1) map(str)
df['DataFrame Column'] = df['DataFrame Column'].map(str)
(2) apply(str)
df['DataFrame Column'] = df['DataFrame Column'].apply(str)
(3) astype(str)
df['DataFrame Column'] = df['DataFrame Column'].astype(str)
(4) values.astype(str)
df['DataFrame Column'] = df['DataFrame Column'].values.astype(str)
The Experiment
For the experiment, we’ll use Numpy, where:
- 5 million random integers will be created
- Each integer will fall within the range of 10 to 99 (to keep each integer to two digits)
We will then use the time package to measure which approach is the fastest way to convert the integers to strings in Pandas DataFrame.
Note that the results may vary depending on the versions of Python, Pandas, and Numpy that you’re using, as well as your computer. For this experiment, we’ll use:
- Python version: 3.7.2
- Pandas version: 0.24.1
- Numpy version: 1.16.2
You may apply the following code in order to check the versions on your computer:
import pandas as pd import numpy as np import sys print('Python Version: ' + sys.version) print('Pandas Version: ' + pd.__version__) print('Numpy Version: ' + np.version.version)
In my case, I got the following versions:
Which approach is the Fastest way to Convert Integers to Strings in Pandas DataFrame?
So which approach is really the fastest?
Let’s find out by running the code below:
import pandas as pd import numpy as np import sys import time df_1 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_2 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_3 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_4 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) start_time_1 = time.time() df_1['Random numbers'] = df_1['Random numbers'].map(str) execution_time_1 = (time.time() - start_time_1) print('Execution time in seconds using map(str): ' + str(execution_time_1)) start_time_2 = time.time() df_2['Random numbers'] = df_2['Random numbers'].apply(str) execution_time_2 = (time.time() - start_time_2) print('Execution time in seconds using apply(str): ' + str(execution_time_2)) start_time_3 = time.time() df_3['Random numbers'] = df_3['Random numbers'].astype(str) execution_time_3 = (time.time() - start_time_3) print('Execution time in seconds using astype(str): ' + str(execution_time_3)) start_time_4 = time.time() df_4['Random numbers'] = df_4['Random numbers'].values.astype(str) execution_time_4 = (time.time() - start_time_4) print('Execution time in seconds using values.astype(str): ' + str(execution_time_4))
Based on our experiment (and considering the versions used), the fastest way to convert integers to string in Pandas DataFrame is apply(str), while map(str) is close second:
I then ran the code using more recent versions of Python, Pandas and Numpy and got similar results:
To take things further, I ran the code below in Anaconda Spyder (where the versions are different):
import pandas as pd import numpy as np import sys import time print('Python Version: ' + sys.version) print('Pandas Version: ' + pd.__version__) print('Numpy Version: ' + np.version.version) df_1 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_2 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_3 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) df_4 = pd.DataFrame(np.random.randint(10,99,size=(5000000, 1)), columns=['Random numbers']) start_time_1 = time.time() df_1['Random numbers'] = df_1['Random numbers'].map(str) execution_time_1 = (time.time() - start_time_1) print('Execution time in seconds using map(str): ' + str(execution_time_1)) start_time_2 = time.time() df_2['Random numbers'] = df_2['Random numbers'].apply(str) execution_time_2 = (time.time() - start_time_2) print('Execution time in seconds using apply(str): ' + str(execution_time_2)) start_time_3 = time.time() df_3['Random numbers'] = df_3['Random numbers'].astype(str) execution_time_3 = (time.time() - start_time_3) print('Execution time in seconds using astype(str): ' + str(execution_time_3)) start_time_4 = time.time() df_4['Random numbers'] = df_4['Random numbers'].values.astype(str) execution_time_4 = (time.time() - start_time_4) print('Execution time in seconds using values.astype(str): ' + str(execution_time_4))
As you may observe, the results in Anaconda are consistent where apply(str) is slightly faster than map(str):
Conclusion
So which approach should you apply?
If speed is what you need, then you may consider either apply(str) or map(str).
You’ll need to take into consideration additional factors, such as the versions installed, as well as the computer used.
You may also want to check the following guide for the complete steps to convert integers to strings in your DataFrame.