pandas

Introduction to pandas

In this lesson, we will talk about another important data science library: pandas. You will learn what this library is used for and make your first steps by importing and displaying data frames.


Let's talk about pandas, another crucial Python library for data science.

In data science, when working with data, that data usually comes in large datasets with hundreds or thousands of rows and various columns.

Like a really big excel spreadsheet.

pandas is the perfect library to work with these datasets.

You might be wondering: Didn't we just learn how to work with multiple columns and rows using NumPy?

That's right, but you can think of pandas as an extension on top of NumPy.

While NumPy is really good for efficient numerical computations on single or multi-dimensional arrays, pandas is the ideal tool for preparing, manipulating, and analyzing large, structured datasets.

In fact, pandas utilizes many of NumPy's computational methods.

Enough of the talking. Let's start exploring pandas.

First, ensure pandas is installed in your Python environment, then import it. pandas is commonly imported using the alias pd.

Python

To start using pandas, we need a dataset to work with.

In this tutorial, we'll use a dataset with information about some fictional companies.

It's stored in a csv file under the path /data/companies.csv.

We'll load this dataset using pandas' pd.read_csv() function:

Python
Output

Here, we loaded the dataset and assigned it to a variable df. You can choose any variable name you prefer. df is a common name because pandas utilizes a data structure called DataFrame.

When printing the dataset, we see a lot of data and the information that the dataset has 49 rows and 6 columns.

However, printing the entire dataset isn't really practical for obtaining an overview.

To get a quick first look, you can use the DataFrame.head() function:

Python
Output

This will print the first 5 rows of the data frame.

If you want to see more than 5 rows, you can pass the desired number of rows to DataFrame.head(n_rows)

Python

Let's get some basic information about the data frame:

Python
Output

From NumPy we are already familiar with the shape attribute, which indicates that the data frame has 49 rows and 6 columns.

What's new is the columns attribute.

In contrast to NumPy, the columns in pandas data frames have a label/name.

You can use the labels to access individual columns using square brackets. For instance, let's extract the column with label 'CEO':

Python
Output

To extract multiple columns at once, you can pass a list of labels:

Python
Output

To extract a specific row from the data frame, you can use DataFrame.loc[]. For example, to extract the second row, we do this:

Python
Output

Since DataFrame.loc[] is index based, we need to use index 1 to get the second row.

With DataFrame.loc[], you can also extract multiple rows:

Python
Output

Side note: Slicing with loc is inclusive, meaning that DataFrame.loc[10:15] includes row 15, unlike NumPy or normal Python list slicing where list/array[10:15] excludes the 15th row.

So far, we have only passed one argument to DataFrame.loc[]. But you can also pass a second argument to specify the column or columns that you want to extract:

Python

Let's extract the first 3 rows of the column 'CEO':

Python
Output

That concludes your first pandas lesson. It's time to practice!