This tutorial is about Pandas which is an essential library in Data Science and Machine Learning fields when using Python.
Table of Contents:
We start by importing the Pandas library and introduce its data types. First, we explain the “Pandas Series” which is very similar to Numpy arrays but has some differences in logic. We will talk about how to define and use the properties of the Pandas Series. Secondly, we introduce “Pandas DataFrames” which is the most used data structure when we import datasets into Python environments. We will show the functions of DataFrames in a simple example of exploratory data analysis. In the last section, we summarize the tutorial.
Introduction to Pandas library
Most data science projects start with discovering and cleaning the data and these processes are very time-consuming. Thus, we need some libraries to facilitate the work. As we have talked about in one of our previous tutorials, Numpy is very important when we work with numerical data. Pandas is also very important while we are working with big data and different data types in data preprocessing.
We import the Pandas library in the following line. “pd” is the most used abbreviation accepted by the community. We also import the Numpy library to compare Pandas with Numpy.
import pandas as pd import numpy as np
Series and DataFrames are the fundamental data structures when we use Pandas for data analysis.
Pandas Series
Series are very similar to one-dimensional Numpy arrays. We create a Pandas Series like the following.
pandas_series = pd.Series(data, index, dtype)
Data parameter in this definition could be:
- A constant value
- A list
- A Numpy array
- A dictionary
Index parameter expresses the order of the element in the Series. If we do not pass the index parameter, it will be an array starting from 0 to (n-1) by default for the data of n length. Dtype parameter refers to the data type of the Series.
First, we create a Numpy array and a Pandas Series, then explain their differences. We define a list containing numbers 1 to 9 and store it in both data structures. After that, we print and see the differences. Series has also an index column, unlike the Numpy arrays.
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] numpy_array = np.array(numbers) print(numpy_array)
——————————–
[0 1 2 3 4 5 6 7 8 9]
pandas_series = pd.Series(data=numbers) print(pandas_series)
——————————–
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
dtype: int64
In the code script below, we used the specified “my_index” list instead of the default index. Here, the most important point is the length of the index list must be equal to the length of the data. Also, we set the data type of the series as float and we can see that its dtype is float64 in the output.
my_index = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'i','j'] pandas_series = pd.Series(data=numbers, index=my_index, dtype=float) print(pandas_series)
——————————–
a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 f 5.0 g 6.0 h 7.0 i 8.0 j 9.0 dtype: float64
When we pass a dictionary as the data parameter of the series, you can see that the keys of the dictionary are used as indexes.
dict1 = dict(a=1, b=2, c=3, d=4) print(pd.Series(dict1))
——————————–
a 1 b 2 c 3 d 4 dtype: int64
Now we will explain some attributes of the Pandas Series. These attributes are similar to Numpy arrays. We can get the dimension of the data by ‘ndim’. ‘dtype’ attribute gives us the type of the data and ‘shape’ keeps the size of the data.
print(pandas_series.ndim)
——————————–
1
print(pandas_series.dtype)
——————————–
float64
print(pandas_series.shape)
——————————–
(10,)
Pandas Series also has similar functions with Numpy arrays. ‘max()’ function returns the biggest element in the data. ‘min()’ function gives the smallest element. We can get the sum of the elements in the data by ‘sum()’ function. This function returns the concatenation of the elements for the string data. ‘mean()’ gives us the average of the data and it’s only for the numeric data.
print(pandas_series.max())
——————————–
9.0
print(pandas_series.min())
———————————
0.0
print(pandas_series.sum())
———————————
45.0
print(pandas_series.mean())
———————————
4.5
Now we talk about some mathematical operations. We can add up two Pandas series with the addition operator. This operation adds all elements at the same index, index column will remain the same at the output. Addition behaves like concatenation for the string data. There are also subtraction, multiplication, and division operations for numeric data.
print(pandas_series+pandas_series)
———————————-
a 0.0 b 2.0 c 4.0 d 6.0 e 8.0 f 10.0 g 12.0 h 14.0 i 16.0 j 18.0 dtype: float64
Pandas DataFrames
We can think of DataFrame as a table with different types of columns and rows so that we can process the data more easily. We can define a DataFrame as given in the following code script.
pandas_dataframe = pd.DataFrame(data,index,dtype)
The data parameter can be any of the following:
- A dictionary of dictionaries, series, or lists
- A one-dimensional or multidimensional Numpy array
- Another DataFrame
For example, we define two dictionaries, combine them, and put them into a Pandas DataFrame below.
dict1 = dict(a=1, b=2, c=3, d=4) dict2 = dict(a=5, b=6, c=7, d=8, e=9) data_comb = dict(first=dict1, second=dict2) pandas_dataframe = pd.DataFrame(data_comb) print(pandas_dataframe)
———————————-
first second a 1.0 5 b 2.0 6 c 3.0 7 d 4.0 8 e NaN 9
DataFrames are mostly used for importing data into Python environments. Now we will give a simple example of an exploratory data analysis. First, we will download the famous Wine-Quality data (winequality-red.csv) from the UCI website (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) and move it to the current folder we are working on. This data is separated by semicolons, so we will pass the ‘sep’ parameter as “;” to read the DataFrame correctly. After we read the data as a Pandas DataFrame we can get information about it with the ‘info()’ function.
data=pd.read_csv("winequality-red.csv", sep=";") data.info()
———————————-
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
Here we can see that our data has 1599 elements with 12 columns. The first 11 columns are the features of the wine and the last column refers to the quality result which depends on these features. If we want to investigate the data more we can look at some of the elements. We can display the first elements of the data with the ‘head()’ function. ‘head()’ can take an argument that refers to the number of elements we want to see. If we don’t pass any parameter it will return the first 5 elements as default.
data.head()
———————————-
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulfates | alcohol | quality | |
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
We can call the elements with their indexes. ‘iloc[]’ takes the index of the element that we want to see.
data.iloc[2]
———————————-
fixed acidity 7.800 volatile acidity 0.760 citric acid 0.040 residual sugar 2.300 chlorides 0.092 free sulfur dioxide 15.000 total sulfur dioxide 54.000 density 0.997 pH 3.260 sulfates 0.650 alcohol 9.800 quality 5.000 Name: 2, dtype: float64
We can get a whole column by its name with ‘data[‘column_name’]’ or ‘data.column_name’. For example, we display the ‘pH’ column of the data in the below code lines.
data['pH']
———————————-
0 3.51 1 3.20 2 3.26 3 3.16 4 3.51 ... 1594 3.45 1595 3.52 1596 3.42 1597 3.57 1598 3.39 Name: pH, Length: 1599, dtype: float64
data.pH
———————————-
0 3.51 1 3.20 2 3.26 3 3.16 4 3.51 ... 1594 3.45 1595 3.52 1596 3.42 1597 3.57 1598 3.39 Name: pH, Length: 1599, dtype: float64
We can take the elements which have the values we want in a specified attribute with the conditional operator. The output of the below line returns the elements whose pH degree is bigger than 4.
data[data['pH']>4]
———————————-
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulfates | alcohol | quality | |
1316 | 5.4 | 0.74 | 0.0 | 1.2 | 0.041 | 16.0 | 46.0 | 0.99258 | 4.01 | 0.59 | 12.5 | 6 |
1321 | 5.0 | 0.74 | 0.0 | 1.2 | 0.041 | 16.0 | 46.0 | 0.99258 | 4.01 | 0.59 | 12.5 | 6 |
We can add new columns by giving their name as an index to the Pandas DataFrame. We add a column called ‘ones’ and assign it 1 for all elements in the following line.
data['ones']=1 print(data)
———————————-
fixed acidity volatile acidity citric acid … alcohol quality ones
0 7.4 0.700 0.00 ... 9.4 5 1 1 7.8 0.880 0.00 ... 9.8 5 1 2 7.8 0.760 0.04 ... 9.8 5 1 3 11.2 0.280 0.56 ... 9.8 6 1 4 7.4 0.700 0.00 ... 9.4 5 1 ... ... ... ... ... ... ... ... 1594 6.2 0.600 0.08 ... 10.5 5 1 1595 5.9 0.550 0.10 ... 11.2 6 1 1596 6.3 0.510 0.13 ... 11.0 6 1 1597 5.9 0.645 0.12 ... 10.2 5 1 1598 6.0 0.310 0.47 ... 11.0 6 1
[1599 rows x 13 columns]
We can drop the columns with the ‘drop(column_name)’ function. We should assign the result to a variable to save the changed DataFrame.
data=data.drop(columns='ones') print(data)
———————————-
fixed acidity volatile acidity citric acid … sulfates alcohol quality
0 7.4 0.700 0.00 ... 0.56 9.4 5 1 7.8 0.880 0.00 ... 0.68 9.8 5 2 7.8 0.760 0.04 ... 0.65 9.8 5 3 11.2 0.280 0.56 ... 0.58 9.8 6 4 7.4 0.700 0.00 ... 0.56 9.4 5 ... ... ... ... ... ... ... ... 1594 6.2 0.600 0.08 ... 0.58 10.5 5 1595 5.9 0.550 0.10 ... 0.76 11.2 6 1596 6.3 0.510 0.13 ... 0.75 11.0 6 1597 5.9 0.645 0.12 ... 0.71 10.2 5 1598 6.0 0.310 0.47 ... 0.66 11.0 6
[1599 rows x 12 columns]
Below are the significant Applications of Pandas Library.
Applications of Pandas
Economics
Since pandas provide a full range of capabilities, such as dataFrames and file-handling, economic data analysis is highly in demand. Economists must analyze data to establish patterns and recognize trends in how the economy in various sectors is growing. As a result, many economic experts have begun to use Python and Pandas to analyze large datasets.
Recommendation System
The great choices supplied by Spotify or Netflix after predicting user choice are pretty astonishing. These systems are Deep Learning miracles. One of the most critical applications of Pandas is the use of such models to provide suggestions. These models are in Python, with Pandas serving as the primary Python library used when dealing with data in such models. Pandas are well-known for their ability to manage massive volumes of data. And the recommendation system is only possible by learning and processing massive amounts of data. Functions like groupBy and mapping play a significant role in making these systems a reality.
Stock Prediction
The stock market is very unpredictable. Therefore, developers create models that forecast the stock market’s performance using Pandas and other libraries such as NumPy and matplotlib. It is feasible because a large amount of historical data on stocks shows how they behave. And, by learning these stock data, a model can forecast the next move to be made with some accuracy. Not only that but such prediction models may be used to automate the purchasing and selling of stocks.
Advertising
Advertising has advanced dramatically in the twenty-first century. Nowadays, advertising has become highly individualized, allowing businesses to gain an increasing number of clients. It, once again, has been made feasible by technologies such as Machine Learning and Deep Learning. Models that analyze client data learn to comprehend what the customer wants, providing businesses with outstanding advertisement ideas. This library renders customer data, and many Pandas functions are also helpful.
Big Data
Pandas can also operate with Big data, which is one of its uses. Python has a strong relationship with Hadoop and Spark, allowing Pandas to access Big Data. Pandas may also be used to write to Spark or Hadoop quickly.
NLP (Natural Language Processing)
NLP, or Natural Language Processing, has taken the globe by storm and is causing quite a stir. The fundamental idea is to interpret human language and its various nuances. It is pretty challenging, but with the help of the many Pandas and Scikit-learn apps, it is possible to develop an NLP model.
Summary
Pandas is a very useful library for Data Analyzing and Preprocessing. It is used in many Data Science and Machine Learning projects. In this tutorial, we introduced the Pandas library with its data structures. First, we talked about Pandas Series which is very similar to Numpy arrays. Secondly, we explained the most used Pandas DataFrames. We showed its features and functions with examples in our exploratory data analysis.