Marathon Data

Two Files, One Picture

Over the last four lessons, we cleaned both marathon files: renamed columns, removed duplicates, fixed text, and converted types. Both share a runner_id column. Now it's time to combine them. pd.concat() stacks DataFrames. merge() joins them side by side on a shared key.

We've spent the last four lessons getting both marathon files into shape: renaming columns, removing duplicates, fixing text, converting types. Both DataFrames share a runner_id column. Now we can use that to combine them into one table.

pd.concat() stacks DataFrames with the same columns on top of each other:

import pandas as pd
 
top = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'time': ['3:15', '2:48']
})
bottom = pd.DataFrame({
    'name': ['Carol', 'Dan'],
    'time': ['3:42', '4:01']
})
combined = pd.concat([top, bottom])
print(combined.shape)

Python

Output

What will be the output?

import pandas as pd
 
a = pd.DataFrame({
    'x': [1, 2, 3]
})
b = pd.DataFrame({
    'x': [4, 5]
})
result = pd.concat([a, b])
print(result.shape)

Python

Imagine the marathon results got split across two files during export. pd.concat() puts them back together:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
half1 = df.iloc[:27]
half2 = df.iloc[27:]
combined = pd.concat([half1, half2])
print(combined.shape)

Python

Output

pd.concat() stacks rows. merge() joins two DataFrames side by side using a shared column as the key.

merge() syntax with on= and how= parameters:

merged = pd.merge(
    left_df,
    right_df,
    on='shared_column',
    how='inner'
)

Python

An inner merge keeps only rows where the key appears in both DataFrames:

import pandas as pd
 
results = pd.DataFrame({
    'id': [1, 2, 3],
    'time': ['3:15', '2:48', '4:01']
})
reg = pd.DataFrame({
    'id': [1, 2],
    'age': [34, 28]
})
merged = pd.merge(results, reg, on='id')
print(merged.shape)

Python

Output

What will be the output?

import pandas as pd
 
left = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'val': ['a', 'b', 'c', 'd']
})
right = pd.DataFrame({
    'id': [2, 3, 5],
    'score': [10, 20, 30]
})
merged = pd.merge(left, right, on='id')
print(merged.shape)

Python

Let's bring our two marathon files together. After deduplicating the results, merge them with the registrations on runner_id:

import pandas as pd
 
df = pd.read_csv('/data/marathon_results.csv')
df = df.drop_duplicates()
reg = pd.read_csv(
    '/data/marathon_registrations.csv')
merged = pd.merge(df, reg, on='runner_id')
print(merged.shape)

Python

Output

how='left' keeps all rows from the left DataFrame even when no match exists in the right. Missing values become NaN.

A left merge keeps every result row, even runners without a registration match:

import pandas as pd
 
results = pd.DataFrame({
    'id': [1, 2, 3],
    'time': ['3:15', '2:48', '4:01']
})
reg = pd.DataFrame({
    'id': [1, 2],
    'age': [34, 28]
})
merged = pd.merge(
    results, reg, on='id', how='left'
)
print(merged.shape)

Python

Output

What will be the output?

import pandas as pd
 
left = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'val': ['a', 'b', 'c', 'd']
})
right = pd.DataFrame({
    'id': [2, 3],
    'score': [10, 20]
})
merged = pd.merge(
    left, right, on='id', how='left'
)
print(merged.shape)

Python

What will be the output?

import pandas as pd
 
a = pd.DataFrame({
    'id': [1, 2],
    'name': ['Alice', 'Bob']
})
b = pd.DataFrame({
    'id': [1, 2],
    'score': [90, 85]
})
merged = pd.merge(a, b, on='id')
print(list(merged.columns))

Python

What will be the output?

import pandas as pd
 
a = pd.DataFrame({
    'id': [1, 2, 3],
    'x': [10, 20, 30]
})
b = pd.DataFrame({
    'id': [1, 2, 3],
    'y': [100, 200, 300]
})
merged = pd.merge(a, b, on='id')
print(merged.shape)

Python