Marathon Data

Getting the Data Ready

In reality, datasets rarely arrive with clean column names. In this series, we work with marathon race data: a results file with finish times and a separate registration file with runner details. The results file has columns like full_name and dnf that need renaming before anything else. rename() and drop() get us started.


In the real world, datasets rarely arrive clean and ready to use. In this series, we work with data from a city marathon. There are two CSV files: one with race results (finish times, categories) and one with runner registrations (age, country, experience). Let's explore both before changing anything.

Start with the results file. How big is it?

Python
Output

55 rows and 6 columns. Let's look at the first few rows to see what's actually in there:

Python
Output

Already some things stand out. The dnf column has 'N' and 'No' and 'n' for what should be the same value. One name is in ALL CAPS. We'll deal with those later. For now, let's check the other file.

Load the registrations file and check its size:

Python
Output

52 rows, also 6 columns. Different size from the results file. Let's peek at the data:

Python
Output

Both files share a runner_id column. That's how we'll connect them later. But notice: the results file calls the name column full_name, while the registrations file uses name.

Compare the column names side by side:

Python
Output

The results file has full_name, the registrations file has name. Both refer to the same thing. Before we can combine these files, the column names need to match. That's what rename() does.

rename() takes a dict mapping old names to new ones:

Python

Rename full_name to name in the results file so it matches the registrations:

Python
Output

What will be the output?

Python

You can rename multiple columns in a single call:

Python
Output

What will be the output?

Python

The results file also has a city column, but the registrations file already has the runner's location under country. We can remove columns we don't need with drop().

Syntax for drop():

Python

Drop the city column from the results:

Python
Output

What will be the output?

Python

What will be the output?

Python

What will be the output?

Python