Skip to content

Introduction to Pandas

Pandas and Numpy are the most basic libraries when it comes to data manipulation in Python. This post gives insight into the most important basics in this library and can also be used as a cheat sheet for the most common commands.

Pandas brings all the tools needed for any form of data manipulation and analysis. With the help of special data structures, table-like objects and time series data can be stored and processed. Pandas builds in many cases on Numpy and is therefore not in competition with this library, as is sometimes claimed.

import pandas as pd
import numpy as np

Pandas Object Creation

Pandas uses different data structures to store and process information. Unfortunately, we cannot go into detail about the different structures in this article and therefore refer to our other articles, e.g. about Pandas DataFrames.

Series

The Pandas Series object is similar to the one-dimensional Numpy Array and can hold various data structures, such as integers, floats, or strings.

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Out: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In the output, we see two columns, although we have only defined values for the Series. The left column is the index, which by default numbers the values. We can access the index and the values with the following commands.

print(s.index)
print(s.values)

Out:
RangeIndex(start=0, stop=6, step=1)
[ 1.  3.  5. nan  6.  8.]

Of course, we can also define the index freely, making it easier and more understandable to access the elements by index. This also makes the code easier to read if you use talking text for access instead of numbers.

fruits = ['apples', 'oranges', 'cherries', 'pears']
quantities = [20, 33, 52, 10]
S = pd.Series(quantities, index=fruits)
print(S)

Out:
apples      20
oranges     33
cherries    52
pears       10
dtype: int64

DataFrame

For a detailed explanation of DataFrames and many code examples, feel free to read our separate post on Pandas DataFrames. Here we will show only the most basic commands for the sake of completeness.

We can create a DataFrame by passing a Numpy array and defining the column names. We can call the individual rows of the table via the index, similar to the Series.

dates = pd.date_range("20220101", periods=6)
dates

Out:
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 2), columns=['column 1', 'column 2'])
df

Out:
            column 1  column 2
2022-01-01  1.764052  0.400157
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357
2022-01-05 -0.103219  0.410599
2022-01-06  0.144044  1.454274 

View data

We can view the first and last rows of a DataFrame with the following commands. In parentheses, we specify the number of rows we want to have an output from. The default value is five.

df.head()

Out:
            column 1  column 2
2022-01-01  1.764052  0.400157
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357
2022-01-05 -0.103219  0.410599

df.tail(3)

Out:
            column 1  column 2
2022-01-04  0.950088 -0.151357
2022-01-05 -0.103219  0.410599
2022-01-06  0.144044  1.454274 

If we want to get a brief statistical overview of the data in each column, we can do that with df.describe():

df.describe()

Out:
       column 1  column 2
count  6.000000  6.000000
mean   0.933544  0.562881
std    0.807791  1.143871
min   -0.103219 -0.977278
25%    0.345555 -0.013479
50%    0.964413  0.405378
75%    1.567724  1.193355
max    1.867558  2.240893

In addition, we can also look at the data sorted directly by specifying the column name by whose values we want to sort.

df.sort_values(by='column 2')

Out:
            column 1  column 2
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357
2022-01-01  1.764052  0.400157
2022-01-05 -0.103219  0.410599
2022-01-06  0.144044  1.454274
2022-01-02  0.978738  2.240893

Select Data

Our execution in this chapter also applies to Panda’s Series objects with a few exceptions, so we will spare the examples for Series objects. We can select a column of the DataFrame by calling the name directly.

df['column 1']

Out:
2022-01-01    1.764052
2022-01-02    0.978738
2022-01-03    1.867558
2022-01-04    0.950088
2022-01-05   -0.103219
2022-01-06    0.144044
Freq: D, Name: column 1, dtype: float64

We call individual lines of the DataFrame either via the desired numbering or via the indexes/names we have assigned to them.

df[0:3]

Out: 
            column 1  column 2
2022-01-01  1.764052  0.400157
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278

df["20220102":"20220104"]

Out:
            column 1  column 2
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357

If we want to filter only the values that meet a certain condition, we define the column and the value that must be met. We have to keep in mind that in Python conditions with the equal sign, always need a double equal sign.

df[df['column 1'] > 0.978738]

Out:
            column 1  column 2
2022-01-01  1.764052  0.400157
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357

That should be it with a short introduction to the most basic commands in Pandas. The second part will follow in a few days.

This is what you should take with you

  • Pandas is a basic library for data manipulation and analysis.
  • It is a complement to Numpy and builds on Numpy arrays, among other things.

Other Articles on the Topic of Pandas

  • The official documentation of Pandas can be found here.
  • This post is mainly based on the tutorial from Pandas. You can find it here.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner