pandas

Introduction to pandas

In this lesson, we will talk about another important data science library: pandas. You will learn what this library is used for and make your first steps by importing and displaying data frames.


Let's talk about pandas, another crucial Python library for data science.

In data science, when working with data, that data usually comes in large datasets with hundreds or thousands of rows and various columns.

Like a really big excel spreadsheet.

pandas is the perfect library to work with these datasets.

You might be wondering: Didn't we just learn how to work with multiple columns and rows using NumPy?

That's right, but you can think of pandas as an extension on top of NumPy.

While NumPy is really good for efficient numerical computations on single or multi-dimensional arrays, pandas is the ideal tool for preparing, manipulating, and analyzing large, structured datasets.

In fact, pandas utilizes many of NumPy's computational methods.

Enough of the talking. Let's start exploring pandas.

First, ensure pandas is installed in your Python environment, then import it. pandas is commonly imported using the alias pd.

# import pandas under alias 'pd' import pandas as pd
Python

To start using pandas, we need a dataset to work with.

In this tutorial, we'll use a dataset with information about some fictional companies.

It's stored in a csv file under the path /data/companies.csv.

We'll load this dataset using pandas' pd.read_csv() function:

import pandas as pd df = pd.read_csv('/data/companies.csv') print(df)
Python
Output

Here, we loaded the dataset and assigned it to a variable df. You can choose any variable name you prefer. df is a common name because pandas utilizes a data structure called DataFrame.

When printing the dataset, we see a lot of data and the information that the dataset has 49 rows and 6 columns.

However, printing the entire dataset isn't really practical for obtaining an overview.

To get a quick first look, you can use the DataFrame.head() function:

import pandas as pd df = pd.read_csv('/data/companies.csv') # show first 5 rows print(df.head())
Python
Output

This will print the first 5 rows of the data frame.

If you want to see more than 5 rows, you can pass the desired number of rows to DataFrame.head(n_rows)

# show first 5 rows print(df.head()) # show first 13 rows print(df.head(13))
Python

Let's get some basic information about the data frame:

import pandas as pd df = pd.read_csv('/data/companies.csv') print(df.shape) print(df.columns)
Python
Output

From NumPy we are already familiar with the shape attribute, which indicates that the data frame has 49 rows and 6 columns.

What's new is the column attribute.

In contrast to NumPy, the columns in pandas data frames have a label/name.

You can use the labels to access individual columns using square brackets. For instance, let's extract the column with label 'CEO':

import pandas as pd df = pd.read_csv('/data/companies.csv') # get column 'CEO' print(df['CEO'])
Python
Output

To extract multiple columns at once, you can pass a list of labels:

import pandas as pd df = pd.read_csv('/data/companies.csv') # get columns 'CEO' and 'Industry' print(df[['CEO', 'Industry']])
Python
Output

To extract a specific row from the data frame, you can use DataFrame.loc[]. For example, to extract the second row, we do this:

import pandas as pd df = pd.read_csv('/data/companies.csv') # get second row print(df.loc[1])
Python
Output

Since DataFrame.loc[] is index based, we need to use index 1 to get the second row.

With DataFrame.loc[], you can also extract multiple rows:

import pandas as pd df = pd.read_csv('/data/companies.csv') # get rows 10 to 15 (inclusive) print(df.loc[10:15])
Python
Output

Side note: Slicing with loc is inclusive, meaning that DataFrame.loc[10:15] includes row 15, unlike NumPy or normal Python list slicing where list/array[10:15] excludes the 15th row.

So far, we have only passed one argument to DataFrame.loc[]. But you can also pass a second argument to specify the column or columns that you want to extract:

DataFrame.loc[row_index, col_label]
Python

Let's extract the first 3 rows of the column 'CEO':

import pandas as pd df = pd.read_csv('/data/companies.csv') # get first 3 CEOs print(df.loc[:2, 'CEO'])
Python
Output

That concludes your first pandas lesson. It's time to practice!