Descriptive statistics of pandas DataFrames
In this lesson, you will learn how to get helpful statistical information about pandas datasets, such as the mean, maximum, and standard deviation. This will help you better understand your data and facilitate further data analysis.
So far, we haven't learned much about the data in our companies DataFrame.
However, the most important aspect of data analysis is understanding what the data is actually telling us. What can we learn from all these numbers and strings?
One important step is to get some descriptive statistics about the columns in your DataFrame.
A very important tool to get a quick overview is the DataFrame.describe()
method.
Let's apply DataFrame.describe()
to the 'Expenses (USD)' column of our DataFrame and see what happens:
import pandas as pd
df = pd.read_csv('/data/companies.csv')
print(df['Expenses (USD)'].describe())
We obtain various summary statistics about the column.
However, the numbers themselves are not very readable.
So, before we proceed, I'd like to show you a trick to control how floating-point numbers of pandas data frames are displayed.
Whenever I work with pandas, I include the following line of code immediately after importing the library:
pd.options.display.float_format = '{:,.2f}'.format
This makes all numerical outputs from pandas more readable.
Let's include this line and rerun our example from above:
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
df = pd.read_csv('/data/companies.csv')
print(df['Expenses (USD)'].describe())
This output is much more helpful and tells us a lot about the expenses of the companies in our DataFrame.
Let's have a look at the information that we obtain from the DataFrame.describe()
method.
count
is the number of non-NA/null observations.
That means we have the expenses from 49 companies in our dataset. Since the entire dataset has 49 companies, there's no missing information.
The mean value indicates that the companies spent an average of 3,844,465.98
USD.
std
stands for standard deviation and shows how spread out the values are around the mean.
With a standard deviation of 1,199,195.19
, which is almost a third of the mean, there's quite a bit of variation among the different companies.
The minimum expenses a company in our data incurred are indicated by min
, the maximum by max
.
The 25%
, 50%
, and 75%
values are also known as the 1st, 2nd, and 3rd quartiles.
The 1st quartile, 25%
, tells us that 25% of all data points are less than or equal to this value.
The 2nd quartile, 50%
, indicates that half of the data points are below this value, and half are above it. This is also called the median.
The 3rd quartile, 75%
, indicates that 75% of the data points are less than or equal to this value.
While this information is very helpful, sometimes you might want to calculate only one of the summary statistics.
Luckily, there's a method for each of these values:
# get number of non-NA/null observations
df['Expenses (USD)'].count()
# get average value
df['Expenses (USD)'].mean()
# get standard deviation
df['Expenses (USD)'].std()
# get minimum value
df['Expenses (USD)'].min()
# get maximum value
df['Expenses (USD)'].max()
# get 25th percentile
df['Expenses (USD)'].quantile(0.25)
# get median
df['Expenses (USD)'].median()
# get 75th percentile
df['Expenses (USD)'].quantile(0.75)
Another very important method in the context of pandas DataFrames is value_counts()
.
value_counts()
returns the counts of unique values or rows in a specific column or the entire DataFrame.
For instance, here we obtain the counts of unique values in the 'Industry' column:
import pandas as pd
df = pd.read_csv('/data/companies.csv')
print(df['Industry'].value_counts())
To obtain all unique values of a column, you can use unique()
:
import pandas as pd
df = pd.read_csv('/data/companies.csv')
print(df['Industry'].unique())