Skip to content

Basic Commands for Pandas DataFrame

In the first part of this series, we already looked at basic panda objects and how to query them. In this chapter we will also look at how to handle empty fields and how to merge multiple objects. As a starting example, we will use the same DataFrame as in the first part.

import pandas as pd

dates = pd.date_range("20220101", periods=6)


df = pd.DataFrame(np.random.randn(6, 2), index=dates, columns=['column 1', 'column 2'])
df

Out: 
            column 1  column 2
2022-01-01  1.764052  0.400157
2022-01-02  0.978738  2.240893
2022-01-03  1.867558 -0.977278
2022-01-04  0.950088 -0.151357
2022-01-05 -0.103219  0.410599
2022-01-06  0.144044  1.454274

How to delete Empty Fields from a Pandas DataFrame?

When we pass empty values to a DataFrame or other Pandas object, they are automatically replaced by Numpy NaNs (Not a Number). For calculations, such as averaging, these fields are not included and are ignored. We can simply extend the existing DataFrame with an empty third column containing a value only for the index 01/01/2022. The remaining values are then automatically set as NaN.

df.loc['20220101', 'column 3'] = 1.0
df

Out:
            column 1  column 2  column 3
2022-01-01  1.764052  0.400157       1.0
2022-01-02  0.978738  2.240893       NaN
2022-01-03  1.867558 -0.977278       NaN
2022-01-04  0.950088 -0.151357       NaN
2022-01-05 -0.103219  0.410599       NaN
2022-01-06  0.144044  1.454274       NaN

If we want to delete all rows that have an empty value in at least one of the columns, we can do that using the following command.

df.dropna(how="any")

Out:
            column 1  column 2  column 3
2022-01-01  1.764052  0.400157       1.0

If we want to delete columns with missing values instead, we use the same command and additionally set ‘axis = 1’. Otherwise, we can also fill the empty fields with predefined values, for example with the value 0.

df.fillna(value=0)

Out:
            column 1  column 2  column 3
2022-01-01  1.764052  0.400157       1.0
2022-01-02  0.978738  2.240893       0.0
2022-01-03  1.867558 -0.977278       0.0
2022-01-04  0.950088 -0.151357       0.0
2022-01-05 -0.103219  0.410599       0.0
2022-01-06  0.144044  1.454274       0.0

In some cases, it can also be useful to display the missing values as Boolean values (True/False). In most cases, however, the DataFrame objects are too large and this is not a useful representation.

pd.isna(df)

Out:
            column 1  column 2  column 3
2022-01-01     False     False     False
2022-01-02     False     False      True
2022-01-03     False     False      True
2022-01-04     False     False      True
2022-01-05     False     False      True
2022-01-06     False     False      True

How can I delete specific Rows from a DataFrame?

If we don’t want to just delete empty values from our DataFrame, there are two ways we can do that. First, we can delete the rows from the DataFrame by using the index of the row we want to delete. In our case, this is a concrete date, such as 01.01.2022:

df.drop(pd.to_datetime('2022-01-01'), axis=0)

Out: 

                column 1	column 2
2022-01-02	-0.497788	1.244450
2022-01-03	-0.429310	0.484361
2022-01-04	-0.929285	-0.222924
2022-01-05	0.656892	0.282631
2022-01-06	0.971478	-0.894619

By doing this, we have deleted the first line in this object. In most cases, however, we will not yet know the specific row that we want to delete. Then we can also filter the DataFrame to the rows we want to delete and then output the indexes of the corresponding rows.

df.drop(df[df['column 1'] > 0.1].index)

Out: 
	        column 1	column 2
2022-01-01	-1.111707	-0.028735
2022-01-03	-1.791537	-0.279329
2022-01-05	-0.706618	-0.020599
2022-01-06	-1.212609	-2.467449

In this case, we delete all lines for which a value greater than 0.1 is detected in “column 1”. This leaves a total of four lines in the “df” object.

How to add a Column?

Again, there are several ways to add new columns to the existing DataFrame. By simply defining the new column with square brackets it will be added as a new column to the DataFrame from the right.

df['column 3'] = np.random.randn(6, 1)
df

Out: 
	        column 1	column 2	column 3
2022-01-01	0.478728	-0.740822	-0.871291
2022-01-02	-0.497788	1.244450	-0.299211
2022-01-03	-0.429310	0.484361	2.026658
2022-01-04	-0.929285	-0.222924	1.030583
2022-01-05	0.656892	0.282631	-0.158718
2022-01-06	0.971478	-0.894619	1.078394

If instead we want to insert the new column at a specific index, we can use “df.insert()” for that:

df.insert(0,"new column 1", np.random.randn(6, 1), allow_duplicates = True)
df

Out: 
	        new column 1	column 1	column 2
2022-01-01	0.566667	1.343228	-1.722417
2022-01-02	-0.450743	-1.390555	-1.013505
2022-01-03	-0.063856	1.269601	0.162296
2022-01-04	0.374292	-0.420009	0.130430
2022-01-05	0.723524	0.205791	-0.336845
2022-01-06	-0.262750	0.232997	-0.226452

The first value passed to this function is the index of the new column to be inserted, then the name of the column, and thirdly the object to be inserted as the column. The last parameter specifies whether duplicates of this column are allowed. So if the column with the name and the same values already exists and “allow_duplicates” is set to “False”, then you will get an error message.

How to delete a Column?

As with any good pandas command, there are several options for dropping columns. The two easiest are either using the function “df.drop()” and the name of the column, and “axis=1” for a column selection. Or you can use the standard Python function “del” and define the corresponding column:

df.drop('new column 1', axis=1)
del df['new column 1']

Merge Pandas Objects

Pandas provide several ways to concatenate Series or DataFrame objects. The concat command extends the first-named object by the second named object if they are of the same type. The command can of course be executed with more than two data structures.

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

Out:
0    a
1    b
0    c
1    d
dtype: object

With DataFrames the code line looks the same. The addition ‘ignore_index’ is used to assign a new continuous index and not the index from the original object.

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df1

Out:
  letter  number
0      a       1
1      b       2

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])
df2


Out:
  letter  number
0      c       3
1      d       4

pd.concat([df1, df2], ignore_index=True)

Out:
  letter  number
0      a       1
1      b       2
2      c       3
3      d       4

Pandas also allow joining possibilities with ‘Merge’, which most people are probably familiar with from SQL.

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

left

Out:
   key  lval
0  foo     1
1  foo     2

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right

Out:
   key  rval
0  foo     4
1  foo     5

pd.merge(left, right, on='key')

Out:
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

If we want to perform an inner join instead of left or right joins, we again use the Concat command with the addition ‘join = “inner”‘.

pd.concat([left, right], join="inner")

Out:
   key
0  foo
1  foo
0  foo
1  foo

This is what you should take with you

  • Pandas offers many possibilities to deal with missing values. You can either delete the columns/rows in question or replace the fields with a value.
  • With Pandas, we have the same join possibilities as with SQL.

Other Articles on the Topic of Pandas

  • The official documentation of Pandas can be found here.
  • This post is mainly based on the tutorial from Pandas. You can find it here.
close
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner