Skip to content

Pandas DataFrame – create one in 2 minutes!

The DataFrame (short: DF) in the Python library Pandas can most easily be thought of as a table-like object consisting of data stored in rows and columns. Pandas offers, simply speaking, the same possibilities as structured arrays in NumPy with the difference that the rows and columns can be addressed by name instead of having to call them by number index. This makes working with large data sets and many columns easier and the code more understandable. 

What is Pandas?

Pandas is a powerful open-source data manipulation library for Python. It provides a powerful, easy-to-use data structure called DataFrame that is particularly useful for data analysis tasks. The easiest way to think of Pandas is as the “Excel of Python,” because many functionalities from Microsoft Excel can also be performed with Pandas. In addition, however, Pandas has many more functionalities and is also significantly more performant.

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional, labeled data structure that resembles a spreadsheet or SQL table. It consists of rows and columns, where each column can have a different data type. The rows and columns are assigned a unique index or column name, which makes it easy to select, filter, and manipulate data.

Representation of a DF in Python | Source: Author

How can DataFrames be understood?

Before we can start creating DFs, we need to understand what objects they can be composed of. This knowledge is necessary because there are different ways to create DFs and they depend on which data object you use as a base. This is the only way to understand the possibilities of DFs and how they differ from pure tables, with which they are often compared.

To understand the following sections, a basic knowledge of data structures in Python and Pandas is a prerequisite. If you don’t have this or want to refresh your knowledge, feel free to use our articles on Pandas Series and Python Dictionary.

Python Dictionary
Example of a dictionary in Python | Source: Author

DataFrame as a collection of Series objects

The Series object in Pandas is a one-dimensional array with a mutable index for calling individual entries. In Python you can create such an object with the following command:

import pandas as pd

area_dict = {‘California’: 423967, ‘Texas’: 695662, ‘New York’: 141297, ‘Florida’: 170312, ‘Illinois’: 149995}
area = pd.Series(area_dict)

The Series has as index different American states and the corresponding area of the state in km². A second series with the same index, i.e. the same five American states, contains the number of inhabitants per state.

population_dict = {‘California’: 38332521, ‘Texas’: 26448193, ‘New York’: 19651127, ‘Florida’: 19552860, ‘Illinois’: 12882135}
population = pd.Series(population_dict)

Since both Series objects have the same index, we can combine them into one DataFrame object, with the index values (the five states) as rows and the categories (area and population) as columns:

states_df = pd.DataFrame({‘population’: population, ‘area’: area})

Just like the Series objects before it, the DF still has an index that can be used to target the rows:


In addition, the columns of the table-like DataFrame can also be accessed by their names:


DataFrame as a specialized dictionary

Another approach to interpreting DF objects is to think of it as a specialized dictionary, where the DF maps a column to a Series object within the column, just as a dictionary maps a key to a value. We can also query it in the same way as a dictionary, but get the whole column rather than just a specific value:


How to create a Pandas DataFrame?

In general, there are four different ways to create a DF, all of which can be useful depending on the use case:

  1. From a single Series object. The DF is a collection of multiple Series objects. However, it can also be created from a single Series and then have only one column:
pd.DataFrame(population, columns=[‘Population’]

2. From a list of dictionaries. Even if not all dictionaries have the same keys, the missing values are filled with NaN (‘not a number’). The number of columns is therefore the number of unique keys and the number of rows is the number of dictionaries:

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

3. From a dictionary of Series objects. This way has already been described in detail in the previous sections:

pd.DataFrame({‘population’: population, ‘area’: area})

4. From a two-dimensional Numpy array. Multiple two-dimensional Numpy arrays can be combined into one DF. If no labels are maintained as column names, numbers are used as column index: 

pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=[col1, col2'], index=['row1', row2’])

How can an Excel or CSV file be read in?

If you don’t want to create a new object and instead want to use an already existing file, this can also be implemented with Pandas. Using the functions “read_csv()” or “read_excel()” the corresponding files can be read and are directly converted into a DataFrame.

import pandas as pd

df = pd.read_csv("name_of_csv.csv")
df2 = pd.read_excel("name_of_excel.xlsx")

It is important to note the correct extension of the file in the name. Otherwise, errors may occur.

Can you create a DataFrame in a for loop?

In other programming languages, such as R, it is normal to create an empty object and then fill it step by step with a loop, as you can do with lists in Python, for example. This approach is possible with Panda’s DataFrames, but should not be used if possible. Data should be stored in lists or dictionaries first, and then summarized as a DF in one step. This approach saves a lot of time and storage capacity (see also Stack Overflow).

What are the best practices when working with a Pandas DataFrame?

Pandas DataFrames are incredibly versatile and powerful, but to make the most of them and avoid common pitfalls, it’s essential to follow some best practices and tips. Whether you’re a beginner or an experienced data analyst, these guidelines will help you work more efficiently and effectively with DataFrames.

1. Import Pandas Correctly

When you start a Python script or Jupyter Notebook, always import pandas correctly:

import pandas as pd

This convention makes it clear that you’re using pandas and simplifies your code.

2. Use read_csv and read_excel for Data Input

When reading data from external sources like CSV files or Excel spreadsheets, use pd.read_csv() and pd.read_excel() functions. These functions handle various file formats and can automatically detect data types.

df = pd.read_csv('data.csv')

3. Set the Index Wisely

Select an appropriate column as the DataFrame index, especially when dealing with time series data. Setting the index improves data retrieval performance and enables easier data manipulation.

df.set_index('date', inplace=True)

4. Avoid Iterating Over Rows

Pandas is designed for vectorized operations. Avoid iterating over DataFrame rows using loops whenever possible; instead, use built-in functions for operations on entire columns.

# Bad practice (slow):
for index, row in df.iterrows():[index, 'new_column'] = row['old_column'] * 2

# Better practice (faster):
df['new_column'] = df['old_column'] * 2

5. Use .loc and .iloc for Selection

When selecting rows or columns, use .loc and .iloc for label-based and integer-based indexing, respectively. This is more efficient than traditional bracket notation.

# Label-based indexing
df.loc[df['column'] > 5]

# Integer-based indexing
df.iloc[2:5, 1:3]

6. Avoid Chained Indexing

Chained indexing, such as df['column']['row'], can lead to unpredictable behavior and should be avoided. Use .loc or .iloc for explicit and unambiguous indexing.

# Bad practice (chained indexing)

# Better practice (explicit indexing)
df.loc['row', 'column']

7. Handle Missing Data Appropriately

Use methods like .isna(), .fillna(), or .dropna() to handle missing data. Deciding whether to impute, remove, or leave missing values depends on your analysis and dataset.

# Replace missing values with the mean
df['column'].fillna(df['column'].mean(), inplace=True)

8. Avoid In-Place Modifications

While in-place modifications can be useful, they can also lead to unexpected changes. Be cautious when using methods like .drop() or .fillna() in place. Consider creating a new DataFrame instead.

# In-place modification (use with caution)
df.drop('column', axis=1, inplace=True)

# Safer approach (creates a new DataFrame)
new_df = df.drop('column', axis=1)

9. Optimize Memory Usage

DataFrames can consume a lot of memory, especially with large datasets. To optimize memory usage:

  • Choose appropriate data types (e.g., use int instead of float if precision isn’t critical).
  • Use the category dtype for columns with a limited number of unique values.
  • Convert numerical columns with constant values to a single value.
df['category_column'] = df['category_column'].astype('category')

10. Document Your Code and Workflow

Data analysis can become complex. Document your code, provide clear explanations in comments, and maintain a record of your data preprocessing and analysis steps to make your work reproducible.

# This code cleans and preprocesses the data
df_cleaned = clean_data(df)

# Save the cleaned data to a new file
df_cleaned.to_csv('cleaned_data.csv', index=False)

By following these best practices and tips, you can make your data analysis projects more efficient, maintainable, and less error-prone when working with pandas DataFrames. Remember that pandas offers a rich set of functionality, so exploring the official documentation and learning new techniques is a valuable part of your journey as a data analyst or scientist.

This is what you should take with you

  • The Pandas DataFrame is a very important element in data preparation for artificial intelligence.
  • It can be understood as a SQL table with rows and columns, but it offers even more functionalities.
  • The DataFrame can be understood as a collection of Series objects or as a specialized dictionary.
  • It can be created either from a single Series object, from a list of dictionaries, from a two-dimensional NumPy array or from a dictionary of Series objects.
  • In addition, it is possible to read the DataFrame directly from a file, such as a CSV or Excel file.

Other Articles on the Topic of Pandas DataFrame

  • You can find the official documentation of Pandas here.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner