In the first part of this series, we already looked at basic panda objects and how to query them. In this chapter we will also look at how to handle empty fields and how to merge multiple objects. As a starting example, we will use the same DataFrame as in the first part.
import pandas as pd
dates = pd.date_range("20220101", periods=6)
df = pd.DataFrame(np.random.randn(6, 2), index=dates, columns=['column 1', 'column 2'])
df
Out:
column 1 column 2
2022-01-01 1.764052 0.400157
2022-01-02 0.978738 2.240893
2022-01-03 1.867558 -0.977278
2022-01-04 0.950088 -0.151357
2022-01-05 -0.103219 0.410599
2022-01-06 0.144044 1.454274
Pandas Missing Values
When we pass empty values to a DataFrame or other Pandas object, they are automatically replaced by Numpy NaNs (Not a Number). For calculations, such as averaging, these fields are not included and ignored. We can simply extend the existing DataFrame with an empty third column containing a value only for the index 01/01/2022. The remaining values are then automatically set as NaN.
df.loc['20220101', 'column 3'] = 1.0
df
Out:
column 1 column 2 column 3
2022-01-01 1.764052 0.400157 1.0
2022-01-02 0.978738 2.240893 NaN
2022-01-03 1.867558 -0.977278 NaN
2022-01-04 0.950088 -0.151357 NaN
2022-01-05 -0.103219 0.410599 NaN
2022-01-06 0.144044 1.454274 NaN
If we want to delete all rows that have an empty value in at least one of the columns, we can do that using the following command.
df.dropna(how="any")
Out:
column 1 column 2 column 3
2022-01-01 1.764052 0.400157 1.0
If we want to delete columns with missing values instead, we use the same command and additionally set ‘axis = 1’. Otherwise we can also fill the empty fields with predefined values, for example with the value 0.
df.fillna(value=0)
Out:
column 1 column 2 column 3
2022-01-01 1.764052 0.400157 1.0
2022-01-02 0.978738 2.240893 0.0
2022-01-03 1.867558 -0.977278 0.0
2022-01-04 0.950088 -0.151357 0.0
2022-01-05 -0.103219 0.410599 0.0
2022-01-06 0.144044 1.454274 0.0
In some cases it can also be useful to display the missing values as Boolean values (True/False). In most cases, however, the DataFrame objects are too large and this is not a useful representation.
pd.isna(df)
Out:
column 1 column 2 column 3
2022-01-01 False False False
2022-01-02 False False True
2022-01-03 False False True
2022-01-04 False False True
2022-01-05 False False True
2022-01-06 False False True
Merge Pandas Objects
Pandas provides several ways to concatenate Series or DataFrame objects. The concat command extends the first named object by the second named object if they are of the same type. The command can of course be executed with more than two data structures.
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])
Out:
0 a
1 b
0 c
1 d
dtype: object
With DataFrames the code line looks the same. The addition ‘ignore_index’ is used to assign a new continuous index and not the index from the original object.
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df1
Out:
letter number
0 a 1
1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
df2
Out:
letter number
0 c 3
1 d 4
pd.concat([df1, df2], ignore_index=True)
Out:
letter number
0 a 1
1 b 2
2 c 3
3 d 4
Pandas also allows join possibilities with ‘Merge’, which most people are probably familiar with from SQL.
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
left
Out:
key lval
0 foo 1
1 foo 2
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right
Out:
key rval
0 foo 4
1 foo 5
pd.merge(left, right, on='key')
Out:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
If we want to perform an inner join instead of left or right joins, we again use the Concat command with the addition ‘join = “inner”‘.
pd.concat([left, right], join="inner")
Out:
key
0 foo
1 foo
0 foo
1 foo
This is what you should take with you
- Pandas offers many possibilities to deal with missing values. You can either delete the columns/rows in question or replace the fields with a value.
- With Pandas we have the same join possibilities as with SQL.