Getting the Data Ready
In reality, datasets rarely arrive with clean column names. In this series, we work with marathon race data: a results file with finish times and a separate registration file with runner details. The results file has columns like full_name and dnf that need renaming before anything else. rename() and drop() get us started.
In the real world, datasets rarely arrive clean and ready to use. In this series, we work with data from a city marathon. There are two CSV files: one with race results (finish times, categories) and one with runner registrations (age, country, experience). Let's explore both before changing anything.
Start with the results file. How big is it?
55 rows and 6 columns. Let's look at the first few rows to see what's actually in there:
Already some things stand out. The dnf column has 'N' and 'No' and 'n' for what should be the same value. One name is in ALL CAPS. We'll deal with those later. For now, let's check the other file.
Load the registrations file and check its size:
52 rows, also 6 columns. Different size from the results file. Let's peek at the data:
Both files share a runner_id column. That's how we'll connect them later. But notice: the results file calls the name column full_name, while the registrations file uses name.
Compare the column names side by side:
The results file has full_name, the registrations file has name. Both refer to the same thing. Before we can combine these files, the column names need to match. That's what rename() does.
rename() takes a dict mapping old names to new ones:
Rename full_name to name in the results file so it matches the registrations:
What will be the output?
You can rename multiple columns in a single call:
What will be the output?
The results file also has a city column, but the registrations file already has the runner's location under country. We can remove columns we don't need with drop().
Syntax for drop():
Drop the city column from the results:
What will be the output?
What will be the output?
What will be the output?