Marathon Data

One Runner, One Row

Our marathon results file has 55 rows, but only 50 runners competed. Some rows appear more than once, which is common when data passes through multiple systems. An average finish time means nothing if some runners are counted twice. duplicated() finds them, drop_duplicates() removes them.


In the previous lesson, we explored both marathon files and noticed the results file has 55 rows. But only 50 runners actually competed. That means some rows appear more than once. Before any analysis, we need to find and remove these duplicates.

duplicated() returns a boolean Series. True marks each row that is an exact copy of a previous row:

Python
Output

duplicated() marks the second (and later) occurrence of identical rows as True. The first appearance stays False.

Since Python treats True as 1 and False as 0, calling .sum() on the result counts how many duplicates there are:

Let's try it on our small example. Two of the three rows are unique, one is a duplicate:

Python
Output

What will be the output?

Python

Let's check our marathon results. How many duplicates are hiding in there?

Python
Output

drop_duplicates() removes all rows marked as duplicates and returns a new DataFrame:

Python
Output

What will be the output?

Python

Let's clean up the marathon results. After dropping duplicates, we should have 50 unique rows:

Python
Output

The subset= parameter limits duplicate checking to specific columns:

Python

Use subset= to deduplicate by runner_id only, ignoring other column differences:

Python
Output

What will be the output?

Python

What will be the output?

Python