One Runner, One Row
Our marathon results file has 55 rows, but only 50 runners competed. Some rows appear more than once, which is common when data passes through multiple systems. An average finish time means nothing if some runners are counted twice. duplicated() finds them, drop_duplicates() removes them.
In the previous lesson, we explored both marathon files and noticed the results file has 55 rows. But only 50 runners actually competed. That means some rows appear more than once. Before any analysis, we need to find and remove these duplicates.
duplicated() returns a boolean Series. True marks each row that is an exact copy of a previous row:
duplicated() marks the second (and later) occurrence of identical rows as True. The first appearance stays False.
Since Python treats True as 1 and False as 0, calling .sum() on the result counts how many duplicates there are:
Let's try it on our small example. Two of the three rows are unique, one is a duplicate:
What will be the output?
Let's check our marathon results. How many duplicates are hiding in there?
drop_duplicates() removes all rows marked as duplicates and returns a new DataFrame:
What will be the output?
Let's clean up the marathon results. After dropping duplicates, we should have 50 unique rows:
The subset= parameter limits duplicate checking to specific columns:
Use subset= to deduplicate by runner_id only, ignoring other column differences:
What will be the output?
What will be the output?