Exploring a Dataset
The Pandas series taught how to load and filter DataFrames. This lesson covers what analysts do first when facing an unknown dataset: checking its size, column types, and summary statistics with .shape, .dtypes, and .describe().
The Pandas series covered loading and filtering DataFrames. But when facing an unknown dataset, where does analysis actually start? With a few quick checks that reveal what the data contains.
The first question: how much data is there? .shape answers instantly:
49 rows, 6 columns. That single line tells you the size. The result is a tuple: rows first, columns second.
What will be the output?
Next: what information does each column hold? .columns lists every column name:
Column names alone don't reveal whether values are text or numbers. .dtypes answers that:
What will be the output?
.describe() gives you count, mean, standard deviation, min, and max for every numeric column. All in one call, no loops.
Pull out just the averages by accessing the 'mean' row of the describe() result:
Notice the formatting: Large numbers display with commas (e.g., 3,959,975.00) instead of scientific notation (e.g., 3.959975e+06). This is because we included the line pd.options.display.float_format = '{:,.2f}'.format.
This is worth adding to every analysis script. It configures how floating-point numbers are displayed across your entire Python session. Without it, large numbers become hard to read.
Without the formatter, the same output looks like this:
Scientific notation (3.959975e+06) is compact but hard to interpret at a glance. Formatted numbers with commas are immediately readable: 3,959,975 is clearly about 4 million.
Best practice: Add pd.options.display.float_format = '{:,.2f}'.format to the top of your analysis script, right after importing pandas. It makes all numeric output human-readable for the rest of your session.
What will be the output?
Glancing at actual records helps confirm the data loaded correctly. .tail() shows the last few rows:
What will be the output?
What will be the output?
What will be the output?