Skip to content

Pandas DataFrame – explained in 2 minutes!

The DataFrame in the Python library Pandas can most easily be thought of as a table-like object consisting of data stored in rows and columns. Pandas offers, simply speaking, the same possibilities as structured arrays in Numpy with the difference that the rows and columns can be addressed by name instead of having to call them by number index.  This makes working with large data sets and many columns easier and the code more understandable. 

DataFrame as a collection of Series objects

The Series object in Pandas is a one-dimensional array with a mutable index for calling individual entries. In Python you can create such an object with the following command:

import pandas as pd

area_dict = {‘California’: 423967, ‘Texas’: 695662, ‘New York’: 141297, ‘Florida’: 170312, ‘Illinois’: 149995}
area = pd.Series(area_dict)

The Series has as index different American states and the corresponding area of the state in km². A second series with the same index, i.e. the same five American states, contains the number of inhabitants per state.

population_dict = {‘California’: 38332521, ‘Texas’: 26448193, ‘New York’: 19651127, ‘Florida’: 19552860, ‘Illinois’: 12882135}
population = pd.Series(population_dict)

Since both Series objects have the same index, we can combine them into one DataFrame object, with the index values (the five states) as rows and the categories (area and population) as columns:

states_df = pd.DataFrame({‘population’: population, ‘area’: area})

Just like the Series objects before it, the DataFrame still has an index that can be used to target the rows:

states.index

In addition, the columns of the table-like DataFrame can also be accessed by their names:

states.columns

DataFrame as a specialized dictionary

Another approach to interpreting DataFrame objects is to think of it as a specialized dictionary, where the DataFrame maps a column to a Series object within the column, just as a dictionary maps a key to a value. We can also query it in the same way as a dictionary, but get the whole column rather than just a specific value:

states[‘area’]

Create a DataFrame

In general, there are four different ways to create a DataFrame, all of which can be useful depending on the use case:

  1. From a single Series object. The DataFrame is a collection of multiple Series objects. However, it can also be created from a single Series and then have only one column:
pd.DataFrame(population, columns=[‘Population’]

2. From a list of dictionaries. Even if not all dictionaries have the same keys, the missing values are filled with NaN (‘not a number’). The number of columns is therefore the number of unique keys and the number of rows is the number of dictionaries:

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

3. From a dictionary of Series objects. This way has already been described in detail in the previous sections:

pd.DataFrame({‘population’: population, ‘area’: area})

4. From a two-dimensional Numpy array. Multiple two-dimensional Numpy arrays can be combined into one DataFrame. If no labels are maintained as column names, numbers are used as column index: 

pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=[col1, col2'], index=['row1', row2’])

In other programming languages, such as R, it is normal to create an empty object and then fill it step by step with a loop, as you can do with lists in Python, for example. This approach is possible with Panda’s DataFrames, but should not be used if possible. Data should be stored in lists or dictionaries first, and then summarized as a DataFrame in one step. This approach saves a lot of time and storage capacity (see also Stack Overflow).

This is what you should take with you

  • The Pandas DataFrame is a very important element in data preparation for machine learning.
  • The DataFrame can be understood as a collection of Series objects or as a specialized dictionary.

Other Articles on the Topic of Pandas DataFrame

  • You can find the official documentation of Pandas here.
close
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner