Marathon Data

The Inconsistency Problem

With duplicates gone, our marathon results have 50 rows. But the category column still has problems: 14 unique values when there should be 8. M30-39, m30-39, and M 30-39 are the same group entered three different ways. The str accessor fixes that.


Remember the first few rows of our marathon results? The dnf column had 'N', 'No', and 'n' for what should be the same value. The category column has similar problems. Let's look at how many unique category values there actually are.

Let's see how bad it is. nunique() counts the distinct values, and unique() shows them all:

Python
Output

The .str accessor exposes string methods for an entire column at once. Methods like .str.lower() apply to every value in the Series.

str.lower() converts every value to lowercase. Half the variation is gone in one call:

Python
Output

What will be the output?

Python

str.strip() removes leading and trailing whitespace from each value:

Python
Output

str.strip() only handles edges. Internal spaces like in 'M 30-39' need str.replace().

str.replace(old, new) replaces substrings across the whole column:

Python
Output

What will be the output?

Python

Apply both together on the marathon categories, and the 14 variants collapse to 7:

Python
Output

The dnf column in our marathon data has the same kind of mess. Six different values where two would do:

Python
Output

What will be the output?

Python

Once the text is clean, you can use str.contains() to filter. Let's find all 50+ runners in our marathon data:

Python
Output

What will be the output?

Python

What will be the output?

Python