Skip to content

Pandas DataFrame – create one in 2 minutes!

The DataFrame (short: DF) in the Python library Pandas can most easily be thought of as a table-like object consisting of data stored in rows and columns. Pandas offers, simply speaking, the same possibilities as structured arrays in NumPy with the difference that the rows and columns can be addressed by name instead of having to call them by number index. This makes working with large data sets and many columns easier and the code more understandable. 

What is Pandas?

Pandas is a powerful open-source data manipulation library for Python. It provides a powerful, easy-to-use data structure called DataFrame that is particularly useful for data analysis tasks. The easiest way to think of Pandas is as the “Excel of Python,” because many functionalities from Microsoft Excel can also be performed with Pandas. In addition, however, Pandas has many more functionalities and is also significantly more performant.

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional, labeled data structure that resembles a spreadsheet or SQL table. It consists of rows and columns, where each column can have a different data type. The rows and columns are assigned a unique index or column name, which makes it easy to select, filter, and manipulate data.

Representation of a DF in Python | Source: Author

How can DataFrames be understood?

Before we can start creating DFs, we need to understand what objects they can be composed of. This knowledge is necessary because there are different ways to create DFs and they depend on which data object you use as a base. This is the only way to understand the possibilities of DFs and how they differ from pure tables, with which they are often compared.

To understand the following sections, a basic knowledge of data structures in Python and Pandas is a prerequisite. If you don’t have this or want to refresh your knowledge, feel free to use our articles on Pandas Series and Python Dictionary.

Python Dictionary
Example of a dictionary in Python | Source: Author

DataFrame as a collection of Series objects

The Series object in Pandas is a one-dimensional array with a mutable index for calling individual entries. In Python you can create such an object with the following command:

import pandas as pd

area_dict = {‘California’: 423967, ‘Texas’: 695662, ‘New York’: 141297, ‘Florida’: 170312, ‘Illinois’: 149995}
area = pd.Series(area_dict)

The Series has as index different American states and the corresponding area of the state in km². A second series with the same index, i.e. the same five American states, contains the number of inhabitants per state.

population_dict = {‘California’: 38332521, ‘Texas’: 26448193, ‘New York’: 19651127, ‘Florida’: 19552860, ‘Illinois’: 12882135}
population = pd.Series(population_dict)

Since both Series objects have the same index, we can combine them into one DataFrame object, with the index values (the five states) as rows and the categories (area and population) as columns:

states_df = pd.DataFrame({‘population’: population, ‘area’: area})

Just like the Series objects before it, the DF still has an index that can be used to target the rows:

states.index

In addition, the columns of the table-like DataFrame can also be accessed by their names:

states.columns

DataFrame as a specialized dictionary

Another approach to interpreting DF objects is to think of it as a specialized dictionary, where the DF maps a column to a Series object within the column, just as a dictionary maps a key to a value. We can also query it in the same way as a dictionary, but get the whole column rather than just a specific value:

states[‘area’]

How to create a Pandas DataFrame?

In general, there are four different ways to create a DF, all of which can be useful depending on the use case:

  1. From a single Series object. The DF is a collection of multiple Series objects. However, it can also be created from a single Series and then have only one column:
pd.DataFrame(population, columns=[‘Population’]

2. From a list of dictionaries. Even if not all dictionaries have the same keys, the missing values are filled with NaN (‘not a number’). The number of columns is therefore the number of unique keys and the number of rows is the number of dictionaries:

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

3. From a dictionary of Series objects. This way has already been described in detail in the previous sections:

pd.DataFrame({‘population’: population, ‘area’: area})

4. From a two-dimensional Numpy array. Multiple two-dimensional Numpy arrays can be combined into one DF. If no labels are maintained as column names, numbers are used as column index: 

pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=[col1, col2'], index=['row1', row2’])

How can an Excel or CSV file be read in?

If you don’t want to create a new object and instead want to use an already existing file, this can also be implemented with Pandas. Using the functions “read_csv()” or “read_excel()” the corresponding files can be read and are directly converted into a DataFrame.

import pandas as pd

df = pd.read_csv("name_of_csv.csv")
df2 = pd.read_excel("name_of_excel.xlsx")

It is important to note the correct extension of the file in the name. Otherwise, errors may occur.

Can you create a DataFrame in a for loop?

In other programming languages, such as R, it is normal to create an empty object and then fill it step by step with a loop, as you can do with lists in Python, for example. This approach is possible with Panda’s DataFrames, but should not be used if possible. Data should be stored in lists or dictionaries first, and then summarized as a DF in one step. This approach saves a lot of time and storage capacity (see also Stack Overflow).

This is what you should take with you

  • The Pandas DataFrame is a very important element in data preparation for artificial intelligence.
  • It can be understood as a SQL table with rows and columns, but it offers even more functionalities.
  • The DataFrame can be understood as a collection of Series objects or as a specialized dictionary.
  • It can be created either from a single Series object, from a list of dictionaries, from a two-dimensional NumPy array or from a dictionary of Series objects.
  • In addition, it is possible to read the DataFrame directly from a file, such as a CSV or Excel file.

Other Articles on the Topic of Pandas DataFrame

  • You can find the official documentation of Pandas here.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner