Marathon Data

One Runner, One Row

Our marathon results file has 55 rows, but only 50 runners competed. Some rows appear more than once, which is common when data passes through multiple systems. An average finish time means nothing if some runners are counted twice. duplicated() finds them, drop_duplicates() removes them.

In the previous lesson, we explored both marathon files and noticed the results file has 55 rows. But only 50 runners actually competed. That means some rows appear more than once. Before any analysis, we need to find and remove these duplicates.

duplicated() returns a boolean Series. True marks each row that is an exact copy of a previous row:

import pandas as pd
 
df = pd.DataFrame({
    'runner_id': [1001, 1002, 1001],
    'name': ['Alice', 'Bob', 'Alice'],
    'finish_time': ['3:15', '2:48', '3:15']
})
print(df.duplicated().tolist())

Python

Output

duplicated() marks the second (and later) occurrence of identical rows as True. The first appearance stays False.

Since Python treats True as 1 and False as 0, calling .sum() on the result counts how many duplicates there are:

Let's try it on our small example. Two of the three rows are unique, one is a duplicate:

import pandas as pd
 
df = pd.DataFrame({
    'runner_id': [1001, 1002, 1001],
    'name': ['Alice', 'Bob', 'Alice'],
    'finish_time': ['3:15', '2:48', '3:15']
})
print(df.duplicated().sum())

Python

Output

What will be the output?

import pandas as pd
 
df = pd.DataFrame({
    'a': [1, 2, 1, 3, 2],
    'b': ['x', 'y', 'x', 'z', 'y']
})
print(df.duplicated().sum())

Python

Let's check our marathon results. How many duplicates are hiding in there?

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
print(df.duplicated().sum())

Python

Output

drop_duplicates() removes all rows marked as duplicates and returns a new DataFrame:

import pandas as pd
 
df = pd.DataFrame({
    'runner_id': [1001, 1002, 1001],
    'name': ['Alice', 'Bob', 'Alice'],
    'finish_time': ['3:15', '2:48', '3:15']
})
clean = df.drop_duplicates()
print(clean.shape)

Python

Output

What will be the output?

import pandas as pd
 
df = pd.DataFrame({
    'x': [1, 2, 1, 3, 2, 4],
    'y': ['a', 'b', 'a', 'c', 'b', 'd']
})
clean = df.drop_duplicates()
print(clean.shape)

Python

Let's clean up the marathon results. After dropping duplicates, we should have 50 unique rows:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
clean = df.drop_duplicates()
print(clean.shape)

Python

Output

The subset= parameter limits duplicate checking to specific columns:

df = df.drop_duplicates(subset=['runner_id'])

Python

Use subset= to deduplicate by runner_id only, ignoring other column differences:

import pandas as pd
 
df = pd.DataFrame({
    'runner_id': [1001, 1002, 1001],
    'lap': [1, 1, 2],
    'time': ['3:15', '2:48', '3:18']
})
clean = df.drop_duplicates(subset=['runner_id'])
print(clean.shape)

Python

Output

What will be the output?

import pandas as pd
 
df = pd.DataFrame({
    'id': [1, 2, 1, 3],
    'score': [90, 80, 75, 85]
})
clean = df.drop_duplicates(subset=['id'])
print(clean.shape)

Python

What will be the output?

import pandas as pd
 
df = pd.DataFrame({
    'a': [1, 2, 1]
})
df.drop_duplicates()
print(len(df))

Python