Marathon Data

The Inconsistency Problem

With duplicates gone, our marathon results have 50 rows. But the category column still has problems: 14 unique values when there should be 8. M30-39, m30-39, and M 30-39 are the same group entered three different ways. The str accessor fixes that.

Remember the first few rows of our marathon results? The dnf column had 'N', 'No', and 'n' for what should be the same value. The category column has similar problems. Let's look at how many unique category values there actually are.

Let's see how bad it is. nunique() counts the distinct values, and unique() shows them all:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
print(df['category'].nunique())
print(df['category'].unique())

Python

Output

The .str accessor exposes string methods for an entire column at once. Methods like .str.lower() apply to every value in the Series.

str.lower() converts every value to lowercase. Half the variation is gone in one call:

import pandas as pd
 
df = pd.DataFrame({
    'cat': ['M30-39', 'm30-39', 'M 30-39']
})
df['cat'] = df['cat'].str.lower()
print(df['cat'].tolist())

Python

Output

What will be the output?

import pandas as pd
 
s = pd.Series([
    'OSLO', 'Paris', 'berlin'
])
print(s.str.lower().tolist())

Python

str.strip() removes leading and trailing whitespace from each value:

import pandas as pd
 
s = pd.Series([
    '  Alice  ', 'Bob', ' Carol'
])
print(s.str.strip().tolist())

Python

Output

str.strip() only handles edges. Internal spaces like in 'M 30-39' need str.replace().

str.replace(old, new) replaces substrings across the whole column:

import pandas as pd
 
s = pd.Series([
    'm30-39', 'm 30-39', 'm30-39'
])
print(s.str.replace(' ', '', regex=False)
       .tolist())

Python

Output

What will be the output?

import pandas as pd
 
s = pd.Series([
    'f 18-29', 'f18-29', 'f 40-49'
])
result = s.str.replace(' ', '', regex=False)
print(result.tolist())

Python

Apply both together on the marathon categories, and the 14 variants collapse to 7:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
df['category'] = (
    df['category']
    .str.lower()
    .str.replace(' ', '', regex=False)
)
print(df['category'].nunique())

Python

Output

The dnf column in our marathon data has the same kind of mess. Six different values where two would do:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
df['dnf'] = df['dnf'].str.lower()
df['dnf'] = df['dnf'].str.replace('yes', 'y')
df['dnf'] = df['dnf'].str.replace('no', 'n')
print(sorted(df['dnf'].unique()))

Python

Output

What will be the output?

import pandas as pd
 
s = pd.Series([
    'Y', 'n', 'Yes', 'No', 'y'
])
s = s.str.lower()
s = s.str.replace('yes', 'y')
s = s.str.replace('no', 'n')
print(sorted(s.unique()))

Python

Once the text is clean, you can use str.contains() to filter. Let's find all 50+ runners in our marathon data:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
df['category'] = df['category'].str.lower()
df['category'] = df['category'].str.replace(
    ' ', '', regex=False)
seniors = df[df['category'].str.contains('50')]
print(len(seniors))

Python

Output

What will be the output?

import pandas as pd
 
s = pd.Series([
    'm18-29', 'm50+', 'f30-39', 'f50+'
])
print(s.str.contains('50').tolist())

Python

What will be the output?

import pandas as pd
 
s = pd.Series([
    'alpha', 'beta', 'alphabet'
])
print(s.str.contains('alpha').tolist())

Python