What is a Pandas DataFrame? Getting Started with Pandas Data Structures

What is a Pandas DataFrame? Getting Started with Pandas Data Structures

What is a Pandas DataFrame?

Pandas DataFrame is a 2-dimensional data structure consisting of rows and columns. It allows you to store and query data just like you would normally do in Excel or SQL.

Each DataFrame column represents a variable/feature/predictor, such as age, gender, square footage, and so on. Similarly, each row represents one record/observation, such as housing information, or employee data. Take a look at the following figure for a visual explanation:

Image 1 - Excel Spreadsheet vs Pandas DataFrame (Image by author)

Image 1 - Excel Spreadsheet vs Pandas DataFrame (Image by author)

Pandas allows you to store pretty much any data type inside a DataFrame, including numerical, categorical, and textual. You can even store more complex data types, such as JSON, but that’s a story for another time. Today is all about the basics.


How to Create a Pandas DataFrame (pd DataFrame)?

Working with DataFrames in Python is utterly easy due to the sheer power of the Pandas library. This section will walk you through the basics of creating Pandas DataFrames with the pd.DataFrame constructor and many different Python data types.

Here are a couple of possible arguments you can pass in when creating a Pandas DataFrame:

Argument Description
data An n-dimensional array representing your data, can be an iterable, dictionary, array, series, and much more
index Index to use for the DataFrame, range index by default (integers from 0 to N, where N is the number of rows - 1)
columns Column labels to assign to DataFrame columns, range index by default
dtype A single data type to force on the entire DataFrame, if not supplied then Pandas infers data types automatically
copy Boolean, whether or not Pandas should copy data from inputs

Let’s now go over some hands-on use cases for creating Pandas DataFrames. These are the library imports you’ll need at the beginning of your notebook or script:

import numpy as np
import pandas as pd
from datetime import datetime

Convert List to Pandas DataFrame

A common data type that often gets converted to a DataFrame is a Python list. You can represent each DataFrame row as a single list, which is convenient if you have a small number of features, but get’s messy as the dataset grows in width.

Take a look at the following code - it creates four lists, one for each employee. Then, it passes a list of employees (list of lists) to the data argument of the pd.DataFrame() function:

emp1 = ["Bob", "Doe", "[email protected]", datetime(2023, 2, 15)]
emp2 = ["Mark", "Doe", "[email protected]", datetime(2023, 3, 10)]
emp3 = ["Jane", "Doe", "[email protected]", datetime(2023, 3, 12)]
emp4 = ["Patrick", "Doe", "[email protected]", datetime(2023, 3, 18)]

data = pd.DataFrame(data=[emp1, emp2, emp3, emp4])
data

This is what the resulting Pandas DataFrame looks like:

Image 2 - Creating DataFrame from List (1) (Image by author)

Image 2 - Creating DataFrame from List (1) (Image by author)

As you can see, the column names are missing. You can supply yours by passing them to the columns argument of the pd.DataFrame() function:

data = pd.DataFrame(
 data=[emp1, emp2, emp3, emp4],
 columns=["First Name", "Last Name", "Email", "Created At"]
)
data

Our DataFrame has column names now:

Image 3 - Creating DataFrame from List (2) (Image by author)

Image 3 - Creating DataFrame from List (2) (Image by author)

Let’s see how we can do the same with dictionaries.

Want to learn more about converting a List to Pandas DataFrame? Read our comprehensive guide.

Convert Dictionary to Pandas DataFrame

Python dictionaries are a powerful data structure, especially when working with Pandas. You can easily convert a dict to Pandas DataFrame by passing in a list of dictionaries, each one representing a single row of data.

Because dictionaries are key-value pairs, we essentially specify the column names and respective values at once. Take a look at the following snippet:

employees = [
 {"First Name": "Bob", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 2, 15)},
 {"First Name": "Mark", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 10)},
 {"First Name": "Jane", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 12)},
 {"First Name": "Patrick", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 18)}
]

data = pd.DataFrame(employees)
data

We get the same DataFrame without specifying the columns explicitly:

Image 4 - Creating DataFrame from Dictionary (Image by author)

Image 4 - Creating DataFrame from Dictionary (Image by author)

Up next, let’s see how to do the same with Numpy arrays.

Convert Numpy Array to Pandas DataFrame

Numpy and Pandas go hand in hand. Both libraries are used together in most data science projects, so it makes sense to incorporate Numpy arrays in Pandas DataFrames. The following code snippet shows you how to convert the Numpy array to Pandas DataFrames. It’s frankly the same thing as with Python lists:

emp1 = np.array(["Bob", "Doe", "[email protected]", datetime(2023, 2, 15)])
emp2 = np.array(["Mark", "Doe", "[email protected]", datetime(2023, 3, 10)])
emp3 = np.array(["Jane", "Doe", "[email protected]", datetime(2023, 3, 12)])
emp4 = np.array(["Patrick", "Doe", "[email protected]", datetime(2023, 3, 18)])

data = pd.DataFrame(
 data=[emp1, emp2, emp3, emp4],
 columns=["First Name", "Last Name", "Email", "Created At"]
)
data

The resulting DataFrame looks familiar:

Image 5 - Creating DataFrame from Numpy Array (Image by author)

Image 5 - Creating DataFrame from Numpy Array (Image by author)

And that’s how you can construct Pandas DataFrames from zero, but how can you expand them? In other words, how can you add rows and columns? That’s what we’ll answer next.


How to Add Rows and Columns to a Pandas DataFrame

This section will explain some basic ways to add rows and columns to Pandas DataFrames. There will be dedicated articles that cover the same in much more depth, so make sure to stay tuned to Practical Pandas.

Let’s start with columns.

Add Column to Pandas DataFrame

Data science and data analytics often require you to make derived columns. Put simply, these columns represent data in a new way that is probably easier to understand, or easier to plug into a machine learning model.

We’ll keep things simple today, and only show you how to append column to DataFrame. For example, imagine if we wanted to add a Date of Birth attribute to our dataset. This is one way to do it:

dobs = [datetime(1985, 1, 15), datetime(1990, 5, 14), datetime(1997, 7, 9), datetime(1960, 5, 5)]

data["Date of Birth"] = dobs
data

The new attribute gets appended to the end of the DataFrame:

Image 6 - Adding a column to Pandas DataFrame (1) (Image by author)

Image 6 - Adding a column to Pandas DataFrame (1) (Image by author)

But what if you want it at a certain location, perhaps after the Last Name column? You can use the insert() function to specify the index location (keep in mind that indexes in Python and Pandas start at 0):

dobs = [datetime(1985, 1, 15), datetime(1990, 5, 14), datetime(1997, 7, 9), datetime(1960, 5, 5)]

data.insert(loc=2, column="Date of Birth", value=dobs)
data

This approach allows for more control, as you can see from the image below:

Image 7 - Adding a column to Pandas DataFrame (2) (Image by author)

Image 7 - Adding a column to Pandas DataFrame (2) (Image by author)

Next, let’s dive into adding rows to the DataFrame.

Add Row to Pandas DataFrame

We’ll now add a couple more employees to our DataFrame. Until recently, the recommended way for adding rows was to use the append() function, but this one will be deprecated in future versions of Pandas. Instead, you should opt for the concat() function.

It expects two or more Pandas DataFrames, so each new row has to be converted first. Luckily, you already know how to do that!

Here’s an example of adding one row:

data = pd.concat([
 data,
 pd.DataFrame(data=[
 {"First Name": "John", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 21)}
 ])
], ignore_index=True)

data

The dataset is now a bit longer:

Image 8 - Adding a row to Pandas DataFrame (Image by author)

Image 8 - Adding a row to Pandas DataFrame (Image by author)

But what if you want to add more rows? Do you have to call pd.concat() twice? Absolutely not. We’re already constructing the second DataFrame from a list of dictionaries, so simply add another dictionary for the second row:

data = pd.concat([
 data,
 pd.DataFrame(data=[
 {"First Name": "Linda", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 23)},
 {"First Name": "Kelly", "Last Name": "Doe", "Email": "[email protected]", "Created At": datetime(2023, 3, 25)}
 ])
], ignore_index=True)

data

Here’s the resulting DataFrame:

Image 9 - Adding rows to Pandas DataFrame (Image by author)

Image 9 - Adding rows to Pandas DataFrame (Image by author)

And that pretty much concludes the introduction to Pandas DataFrames. We’ll dive much, much deeper shortly, but this is enough for now.


Summing up

Pandas DataFrames are the data structure where all the magic happens in Python and Pandas. You saw how easy it is to create DataFrames from plain Python objects, such as lists and dictionaries, but also from Numpy arrays. You’ve also learned how to add rows and columns, which are the basic data manipulation techniques you’ll use daily.

Up next, we’ll dive much deeper into each of the subtopics discussed today. Make sure to stay tuned to Practical Pandas, and we’ll make sure to publish the next piece shortly.