Pandas and Numpy are the most basic libraries when it comes to data manipulation in Python. This post gives insight into the most important basics in this library and can also be used as a cheat sheet for the most common commands.
Pandas brings all the tools needed for any form of data manipulation and analysis. With the help of special data structures, table-like objects and time series data can be stored and processed. Pandas builds in many cases on Numpy and is therefore not in competition with this library, as is sometimes claimed.
import pandas as pd
import numpy as np
Pandas Object Creation
Pandas uses different data structures to store and process information. Unfortunately, we cannot go into detail about the different structures in this article and therefore refer to our other articles, e.g. about Pandas DataFrames.
Series
The Pandas Series object is similar to the one-dimensional Numpy Array and can hold various data structures, such as integers, floats, or strings.
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Out:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
In the output, we see two columns, although we have only defined values for the Series. The left column is the index, which by default numbers the values. We can access the index and the values with the following commands.
print(s.index)
print(s.values)
Out:
RangeIndex(start=0, stop=6, step=1)
[ 1. 3. 5. nan 6. 8.]
Of course, we can also define the index freely, making it easier and more understandable to access the elements by index. This also makes the code easier to read if you use talking text for access instead of numbers.
fruits = ['apples', 'oranges', 'cherries', 'pears']
quantities = [20, 33, 52, 10]
S = pd.Series(quantities, index=fruits)
print(S)
Out:
apples 20
oranges 33
cherries 52
pears 10
dtype: int64
DataFrame
For a detailed explanation of DataFrames and many code examples, feel free to read our separate post on Pandas DataFrames. Here we will show only the most basic commands for the sake of completeness.
We can create a DataFrame by passing a Numpy array and defining the column names. We can call the individual rows of the table via the index, similar to the Series.
dates = pd.date_range("20220101", periods=6)
dates
Out:
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 2), columns=['column 1', 'column 2'])
df
Out:
column 1 column 2
2022-01-01 1.764052 0.400157
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
2022-01-05 -0.103219 0.410599
2022-01-06 0.144044 1.454274
View data
We can view the first and last rows of a DataFrame with the following commands. In parentheses, we specify the number of rows we want to have an output from. The default value is five.
df.head()
Out:
column 1 column 2
2022-01-01 1.764052 0.400157
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
2022-01-05 -0.103219 0.410599
df.tail(3)
Out:
column 1 column 2
2022-01-04 0.950088 -0.151357
2022-01-05 -0.103219 0.410599
2022-01-06 0.144044 1.454274
If we want to get a brief statistical overview of the data in each column, we can do that with df.describe():
df.describe()
Out:
column 1 column 2
count 6.000000 6.000000
mean 0.933544 0.562881
std 0.807791 1.143871
min -0.103219 -0.977278
25% 0.345555 -0.013479
50% 0.964413 0.405378
75% 1.567724 1.193355
max 1.867558 2.240893
In addition, we can also look at the data sorted directly by specifying the column name by whose values we want to sort.
df.sort_values(by='column 2')
Out:
column 1 column 2
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
2022-01-01 1.764052 0.400157
2022-01-05 -0.103219 0.410599
2022-01-06 0.144044 1.454274
2022-01-02 0.978738 2.240893
Select Data
Our execution in this chapter also applies to Panda’s Series objects with a few exceptions, so we will spare the examples for Series objects. We can select a column of the DataFrame by calling the name directly.
df['column 1']
Out:
2022-01-01 1.764052
2022-01-02 0.978738
2022-01-03 1.867558
2022-01-04 0.950088
2022-01-05 -0.103219
2022-01-06 0.144044
Freq: D, Name: column 1, dtype: float64
We call individual lines of the DataFrame either via the desired numbering or via the indexes/names we have assigned to them.
df[0:3]
Out:
column 1 column 2
2022-01-01 1.764052 0.400157
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
df["20220102":"20220104"]
Out:
column 1 column 2
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
If we want to filter only the values that meet a certain condition, we define the column and the value that must be met. We have to keep in mind that in Python conditions with the equal sign, always need a double equal sign.
df[df['column 1'] > 0.978738]
Out:
column 1 column 2
2022-01-01 1.764052 0.400157
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
That should be it with a short introduction to the most basic commands in Pandas. The second part will follow in a few days.
This is what you should take with you
- Pandas is a basic library for data manipulation and analysis.
- It is a complement to Numpy and builds on Numpy arrays, among other things.