Pandas and Numpy are the most basic libraries when it comes to data manipulation in Python. This post gives insight into the most important basics in this library and can also be used as a cheat sheet for the most common commands.
Pandas brings all the tools needed for any form of data manipulation and analysis. With the help of special data structures, table-like objects and time series data can be stored and processed. Pandas builds in many cases on Numpy and is therefore not in competition with this library, as is sometimes claimed.
import pandas as pd import numpy as np
Pandas Object Creation
Pandas uses different data structures to store and process information. Unfortunately, we cannot go into detail about the different structures in this article and therefore refer to our other articles, e.g. about Pandas DataFrames.
The Pandas Series object is similar to the one-dimensional Numpy Array and can hold various data structures, such as integers, floats, or strings.
s = pd.Series([1, 3, 5, np.nan, 6, 8]) s Out: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
In the output, we see two columns, although we have only defined values for the Series. The left column is the index, which by default numbers the values. We can access the index and the values with the following commands.
print(s.index) print(s.values) Out: RangeIndex(start=0, stop=6, step=1) [ 1. 3. 5. nan 6. 8.]
Of course, we can also define the index freely, making it easier and more understandable to access the elements by index. This also makes the code easier to read if you use talking text for access instead of numbers.
fruits = ['apples', 'oranges', 'cherries', 'pears'] quantities = [20, 33, 52, 10] S = pd.Series(quantities, index=fruits) print(S) Out: apples 20 oranges 33 cherries 52 pears 10 dtype: int64
For a detailed explanation of DataFrames and many code examples, feel free to read our separate post on Pandas DataFrames. Here we will show only the most basic commands for the sake of completeness.
We can create a DataFrame by passing a Numpy array and defining the column names. We can call the individual rows of the table via the index, similar to the Series.
dates = pd.date_range("20220101", periods=6) dates Out: DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'], dtype='datetime64[ns]', freq='D') df = pd.DataFrame(np.random.randn(6, 2), columns=['column 1', 'column 2']) df Out: column 1 column 2 2022-01-01 1.764052 0.400157 2022-01-02 0.978738 2.240893 2022-01-03 1.867558 -0.977278 2022-01-04 0.950088 -0.151357 2022-01-05 -0.103219 0.410599 2022-01-06 0.144044 1.454274
We can view the first and last rows of a DataFrame with the following commands. In parentheses, we specify the number of rows we want to have an output from. The default value is five.
df.head() Out: column 1 column 2 2022-01-01 1.764052 0.400157 2022-01-02 0.978738 2.240893 2022-01-03 1.867558 -0.977278 2022-01-04 0.950088 -0.151357 2022-01-05 -0.103219 0.410599 df.tail(3) Out: column 1 column 2 2022-01-04 0.950088 -0.151357 2022-01-05 -0.103219 0.410599 2022-01-06 0.144044 1.454274
If we want to get a brief statistical overview of the data in each column, we can do that with df.describe():
df.describe() Out: column 1 column 2 count 6.000000 6.000000 mean 0.933544 0.562881 std 0.807791 1.143871 min -0.103219 -0.977278 25% 0.345555 -0.013479 50% 0.964413 0.405378 75% 1.567724 1.193355 max 1.867558 2.240893
In addition, we can also look at the data sorted directly by specifying the column name by whose values we want to sort.
df.sort_values(by='column 2') Out: column 1 column 2 2022-01-03 1.867558 -0.977278 2022-01-04 0.950088 -0.151357 2022-01-01 1.764052 0.400157 2022-01-05 -0.103219 0.410599 2022-01-06 0.144044 1.454274 2022-01-02 0.978738 2.240893
Our execution in this chapter also applies to Panda’s Series objects with a few exceptions, so we will spare the examples for Series objects. We can select a column of the DataFrame by calling the name directly.
df['column 1'] Out: 2022-01-01 1.764052 2022-01-02 0.978738 2022-01-03 1.867558 2022-01-04 0.950088 2022-01-05 -0.103219 2022-01-06 0.144044 Freq: D, Name: column 1, dtype: float64
We call individual lines of the DataFrame either via the desired numbering or via the indexes/names we have assigned to them.
df[0:3] Out: column 1 column 2 2022-01-01 1.764052 0.400157 2022-01-02 0.978738 2.240893 2022-01-03 1.867558 -0.977278 df["20220102":"20220104"] Out: column 1 column 2 2022-01-02 0.978738 2.240893 2022-01-03 1.867558 -0.977278 2022-01-04 0.950088 -0.151357
If we want to filter only the values that meet a certain condition, we define the column and the value that must be met. We have to keep in mind that in Python conditions with the equal sign, always need a double equal sign.
df[df['column 1'] > 0.978738] Out: column 1 column 2 2022-01-01 1.764052 0.400157 2022-01-02 0.978738 2.240893 2022-01-03 1.867558 -0.977278 2022-01-04 0.950088 -0.151357
That should be it with a short introduction to the most basic commands in Pandas. The second part will follow in a few days.
This is what you should take with you
- Pandas is a basic library for data manipulation and analysis.
- It is a complement to Numpy and builds on Numpy arrays, among other things.
What is Git?
Introduction to Git and useful terms
What is a Repository?
Explanation of different types of code repositories.
What is Bitbucket?
Introduction to Bitbucket, its features and pricing model.
What is a NumPy Array?
Introduction to NumPy arrays and basic commands.
How to use the Python Lambdas?
Explanation of anonymous functions and Python lambdas.
What are Tensors in Machine Learning?
Explanation of tensors with examples and their application in Machine Learning.
What are Python Operators?
Introduction to Python operators with examples of the different types.
Python for-Loop – easily explained!
Explanation of Python for loops including the commands break, continue and enumerate.
What is Numpy?
Explanation of NumPy and the NumPy arrays.
Python Try Except – easily explained!
Explanation of the try-except loop in Python with code examples.
Other Articles on the Topic of Pandas
- The official documentation of Pandas can be found here.
- This post is mainly based on the tutorial from Pandas. You can find it here.