Introduction

This tutorial is about Pandas which is an essential library in Data Science and Machine Learning fields when using Python.

Table of Contents:

Pandas in Python

We start by importing the Pandas library and introduce its data types. First, we explain the “Pandas Series” which is very similar to Numpy arrays but has some differences in logic. We will talk about how to define and use the properties of the Pandas Series. Secondly, we introduce “Pandas DataFrames” which is the most used data structure when we import datasets into Python environments. We will show the functions of DataFrames in a simple example of exploratory data analysis. In the last section, we summarize the tutorial.

Introduction to Pandas library

Most data science projects start with discovering and cleaning the data and these processes are very time-consuming. Thus, we need some libraries to facilitate the work. As we have talked about in one of our previous tutorials (link to the Numpy in Python tutorial), Numpy is very important when we work with numerical data. Pandas is also very important while we are working with big data and different data types in data preprocessing.

We import the Pandas library in the following line. “pd” is the most used abbreviation accepted by the community. We also import the Numpy library to compare Pandas with Numpy.

import pandas as pd
import numpy as np

Series and DataFrames are the fundamental data structures when we use Pandas for data analysis.

Pandas Series

Series are very similar to one-dimensional Numpy arrays. We create a Pandas Series like the following.

pandas_series = pd.Series(data, index, dtype)

Data parameter in this definition could be:

  • A constant value
  • A list
  • A Numpy array
  • A dictionary

Index parameter expresses the order of the element in the Series. If we do not pass the index parameter, it will be an array starting from 0 to (n-1)  by default for the data of n length. Dtype parameter refers to the data type of the Series.

First, we create a Numpy array and a Pandas Series, then explain their differences. We define a list containing numbers 1 to 9 and store it in both data structures. After that, we print and see the differences. Series has also an index column, unlike the Numpy arrays.

numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
numpy_array = np.array(numbers)
print(numpy_array)

——————————–

[0 1 2 3 4 5 6 7 8 9]

 

pandas_series = pd.Series(data=numbers)
print(pandas_series)

——————————–

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In the code script below, we used the specified “my_index” list instead of the default index. Here, the most important point is the length of the index list must be equal to the length of the data. Also, we set the data type of the series as float and we can see that its dtype is float64 in the output.

my_index = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'i','j']
pandas_series = pd.Series(data=numbers, index=my_index,               dtype=float)
print(pandas_series)

——————————–

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
g    6.0
h    7.0
i    8.0
j    9.0
dtype: float64

When we pass a dictionary as the data parameter of the series, you can see that the keys of the dictionary are used as indexes.

dict1 = dict(a=1, b=2, c=3, d=4)
print(pd.Series(dict1))

——————————–

a    1
b    2
c    3
d    4
dtype: int64

Now we will explain some attributes of the Pandas Series. These attributes are similar to Numpy arrays. We can get the dimension of the data by ‘ndim’. ‘dtype’ attribute gives us the type of the data and ‘shape’ keeps the size of the data.

print(pandas_series.ndim)

——————————–

1
print(pandas_series.dtype)

——————————–

float64

 

print(pandas_series.shape)

——————————–

(10,)

Pandas Series also has similar functions with Numpy arrays. ‘max()’ function returns the biggest element in the data. ‘min()’ function gives the smallest element. We can get the sum of the elements in the data by ‘sum()’ function. This function returns the concatenation of the elements for the string data. ‘mean()’ gives us the average of the data and it’s only for the numeric data.

print(pandas_series.max())

——————————–

9.0

 

print(pandas_series.min())

———————————

0.0

 

print(pandas_series.sum())

———————————

45.0

 

print(pandas_series.mean())

———————————

4.5

 

Now we talk about some mathematical operations. We can add up two Pandas series with the addition operator. This operation adds all elements at the same index, index column will remain the same at the output. Addition behaves like concatenation for the string data. There are also subtraction, multiplication, and division operations for numeric data.

print(pandas_series+pandas_series)

———————————-

a     0.0
b     2.0
c     4.0
d     6.0
e     8.0
f    10.0
g    12.0
h    14.0
i    16.0
j    18.0
dtype: float64

Pandas DataFrames

We can think of DataFrame as a table with different types of columns and rows so that we can process the data more easily. We can define a DataFrame as given in the following code script.

pandas_dataframe = pd.DataFrame(data,index,dtype)

The data parameter can be any of the following:

  • A dictionary of dictionaries, series, or lists
  • A one-dimensional or multidimensional Numpy array
  • Another DataFrame

For example, we define two dictionaries, combine them, and put them into a Pandas DataFrame below.

dict1 = dict(a=1, b=2, c=3, d=4)
dict2 = dict(a=5, b=6, c=7, d=8, e=9)
data_comb = dict(first=dict1, second=dict2)
pandas_dataframe = pd.DataFrame(data_comb)
print(pandas_dataframe)

———————————-

first  second
a    1.0       5
b    2.0       6
c    3.0       7
d    4.0       8
e    NaN       9

DataFrames are mostly used for importing data into Python environments. Now we will give a simple example of an exploratory data analysis. First, we will download the famous Wine-Quality data (winequality-red.csv) from the UCI website (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) and move it to the current folder we are working on. This data is separated by semicolons, so we will pass the ‘sep’ parameter as “;” to read the DataFrame correctly. After we read the data as a Pandas DataFrame we can get information about it with the ‘info()’ function.

data=pd.read_csv("winequality-red.csv", sep=";")
data.info()

———————————-

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
#   Column                Non-Null Count  Dtype
---  ------                --------------  -----
0   fixed acidity         1599 non-null   float64
1   volatile acidity      1599 non-null   float64
2   citric acid           1599 non-null   float64
3   residual sugar        1599 non-null   float64
4   chlorides             1599 non-null   float64
5   free sulfur dioxide   1599 non-null   float64
6   total sulfur dioxide  1599 non-null   float64
7   density               1599 non-null   float64
8   pH                    1599 non-null   float64
9   sulphates             1599 non-null   float64
10  alcohol               1599 non-null   float64
11  quality               1599 non-null   int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Here we can see that our data has 1599 elements with 12 columns. The first 11 columns are the features of the wine and the last column refers to the quality result which depends on these features. If we want to investigate the data more we can look at some of the elements. We can display the first elements of the data with the ‘head()’ function. ‘head()’ can take an argument that refers to the number of elements we want to see. If we don’t pass any parameter it will return the first 5 elements as default.

data.head()

———————————-

 

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulfates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

We can call the elements with their indexes. ‘iloc[]’ takes the index of the element that we want to see.

 

data.iloc[2]

———————————-

fixed acidity            7.800
volatile acidity         0.760
citric acid              0.040
residual sugar           2.300
chlorides                0.092
free sulfur dioxide     15.000
total sulfur dioxide    54.000
density                  0.997
pH                       3.260
sulfates                0.650
alcohol                  9.800
quality                  5.000
Name: 2, dtype: float64

We can get a whole column by its name with ‘data[‘column_name’]’ or ‘data.column_name’. For example, we display the ‘pH’ column of the data in the below code lines.

data['pH']

———————————-

0       3.51
1       3.20
2       3.26
3       3.16
4       3.51
...
1594    3.45
1595    3.52
1596    3.42
1597    3.57
1598    3.39
Name: pH, Length: 1599, dtype: float64

data.pH

———————————-

0       3.51
1       3.20
2       3.26
3       3.16
4       3.51
...
1594    3.45
1595    3.52
1596    3.42
1597    3.57
1598    3.39
Name: pH, Length: 1599, dtype: float64

 

We can take the elements which have the values we want in a specified attribute with the conditional operator. The output of the below line returns the elements whose pH degree is bigger than 4.

data[data['pH']>4]

———————————-

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulfates alcohol quality
1316 5.4 0.74 0.0 1.2 0.041 16.0 46.0 0.99258 4.01 0.59 12.5 6
1321 5.0 0.74 0.0 1.2 0.041 16.0 46.0 0.99258 4.01 0.59 12.5 6

We can add new columns by giving its name as an index to the Pandas DataFrame. We add a column called ‘ones’ and assign it 1 for all elements in the following line.

data['ones']=1
print(data)

———————————-

fixed acidity  volatile acidity  citric acid  …  alcohol  quality  ones

0               7.4             0.700         0.00  ...      9.4        5     1
1               7.8             0.880         0.00  ...      9.8        5     1
2               7.8             0.760         0.04  ...      9.8        5     1
3              11.2             0.280         0.56  ...      9.8        6     1
4               7.4             0.700         0.00  ...      9.4        5     1
...             ...               ...          ...  ...      ...      ...   ...
1594            6.2             0.600         0.08  ...     10.5        5     1
1595            5.9             0.550         0.10  ...     11.2        6     1
1596            6.3             0.510         0.13  ...     11.0        6     1
1597            5.9             0.645         0.12  ...     10.2        5     1
1598            6.0             0.310         0.47  ...     11.0        6     1
[1599 rows x 13 columns]

We can drop the columns with the ‘drop(column_name)’ function. We should assign the result to a variable to save the changed DataFrame.

data=data.drop(columns='ones')
print(data)

———————————-

fixed acidity  volatile acidity  citric acid  … sulfates alcohol  quality

0               7.4             0.700         0.00  ...       0.56      9.4        5
1               7.8             0.880         0.00  ...       0.68      9.8        5
2               7.8             0.760         0.04  ...       0.65      9.8        5
3              11.2             0.280         0.56  ...       0.58      9.8        6
4               7.4             0.700         0.00  ...       0.56      9.4        5
...             ...               ...          ...  ...        ...      ...      ...
1594            6.2             0.600         0.08  ...       0.58     10.5        5
1595            5.9             0.550         0.10  ...       0.76     11.2        6
1596            6.3             0.510         0.13  ...       0.75     11.0        6
1597            5.9             0.645         0.12  ...       0.71     10.2        5
1598            6.0             0.310         0.47  ...       0.66     11.0        6
[1599 rows x 12 columns]

Summary

Pandas is a very useful library for Data Analyzing and Preprocessing. It is used in many Data Science and Machine Learning projects. In this tutorial, we introduced the Pandas library with its data structures. First, we talked about Pandas Series which is very similar to Numpy arrays. Secondly, we explained the most used Pandas DataFrames. We showed its features and functions with examples in our exploratory data analysis.