In this tutorial, we will learn about Matplotlib, which is a commonly used library to visualize the data and results when we use Python in Data Science and Machine Learning projects.
Table of Contents:
- Data visualization
- Introducing Matplotlib library
- plot() function in Matplotlib
- Some other charts with Matplotlib
- Summary
First, we will start by talking about data visualization and then introduce the Matplotlib library. We will consider drawing line charts and the general functions of them. Next, we will show some other frequently encountered chart types in Data Science projects. Finally, we will finish the tutorial by summarizing what we learned.
Data visualization
In the field of Data Science, we usually have complex and scattered data and should organize it to understand easily. This is an essential requirement to start the data processing. Data visualization is the task of concretizing the abstract data we have by making it visual to obtain preliminary information from it.
We should consider some points while performing data visualization as follows.
- Choosing the right chart type: It is necessary to know the data and the chart types. We should visualize our data with the most appropriate chart type.
- Understandable graphics: Charts should be plain and simple, without any excesses.
- Visual requirements: Correct placement of numbers, the harmony of colors, etc.
Introducing Matplotlib library
Matplotlib is an essential Python library for data visualization. It allows us to make 2 and 3-dimensional charts. It is generally used in 2D charts because there are more advanced libraries to make 3D charts.
First, we will load the matplotlib library into the python environment and define the data that we will work on for the line chart as Numpy arrays. You can get detailed information about numpy arrays in our “Numpy in Python” tutorial.
We import the libraries like the following. “plt” is the commonly used abbreviation for pyplot module of matplotlib library which contains functions for drawing charts.
import matplotlib.pyplot as plt import numpy as np
Let’s assume that the data we are working on here is the people given their ages and happiness rates like the following.
Person |
Age |
Happiness Rate |
person1 |
5 |
95 |
person2 |
45 |
60 |
person3 |
35 |
50 |
person4 |
80 |
30 |
person5 |
25 |
30 |
person6 |
15 |
70 |
person7 |
10 |
85 |
We define the information in each column with different Numpy arrays. In order to make our chart look smoother, we handle the data in order of age. Then we will show our data in a line chart in the next section.
age=np.array([5,10,15,25,35,45,80]) happy=np.array([95,85,70,30,50,60,30])
plot() function in Matplotlib
plt.plot() is the main command to draw a basic line chart. We first give the x-axis values, then y-axis values as parameters of this command. If we want to change some basic variables in the chart, we give some commands within this function. For example the color, thickness, or type of the line, etc.
plt.plot(age, happy)
————————————————————————————————–
The “color” parameter determines the color of the chart. For example, we draw a red chart with the following command line. We can also determine the type of the line with the “linestyle” parameter.
plt.plot(age, happy, color='red', linestyle='dashed')
——————————————————————————————————
We determine the line thickness with the “linewidth” parameter. If we want bolder charts we give large values to this parameter.
plt.plot(age, happy, color='red', linewidth=4.0)
——————————————————————————————————-
We decide how the data will be displayed in the chart with the “marker” parameter. For example, we mark the values with ‘x’ in the next line. We can also use other signs like o, +, . etc. We can determine the size of the marker with “markersize” parameter.
plt.plot(age, happy, color='red', marker='x', markersize=10.0)
——————————————————————————————————-
If we want only to show points of the values we can determine the general visualization of the chart basically with the third parameter of the plot function. The following line takes ‘r’ for the color and ‘x’ for the shape of the values to show.
plt.plot(age, happy, 'rx')
——————————————————————————————————–
We give the label of the chart with the “label” parameter. We call our chart “happiness by age” in the next line. In order to see the name of the chart, we use plt.legend() method.
plt.plot(age, happy, 'rx', label='happiness by age') plt.legend()
———————————————————————————————————
Now we add another data to our chart. Let’s say we add the income rate of the people. First, we define the income as a Numpy array and then show the new data in the same figure. We should give the legend() command at the end to see the names of all the line charts.
income=np.array([0, 2, 5, 30, 50, 70, 30]) plt.plot(age, income, 'bo', label='income by age') plt.legend()
—————————————————————————————————————
We determine the labels of the axes with “xlabel()” and “ylabel()” functions. We also add a title to the figure with the “title()” function.
plt.xlabel('age') plt.title('rates of happiness and income')
—————————————————————————————————————
We can draw multiple plots in the same figure with the subplot() function. In this function we give the number of plots in the horizontal line as the first parameter, number of plots in the vertical line as the second parameter, and which plot we are drawing now is given with the third parameter.
plt.subplot(2,1,1) plt.plot(age,happy, 'rx', label='happiness by age') plt.legend() plt.title('rates of happiness and income') plt.subplot(2,1,2) plt.plot(age, income, 'bo', label='income by age') plt.legend()
—————————————————————————————————————-
Some other charts with Matplotlib
In this section, we will visualize a real dataset with different chart types of the Matplotlib library. We first get the Wine-Quality dataset (winequality-red.csv) from the UCI website (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) and move it to the current folder. Then we read it as a Pandas DataFrame. For more information please see our “Pandas in Python“ tutorial at the following link.
import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv("winequality-red.csv", sep=";")
We will draw a histogram of the data by the class of the samples in the following line.
plt.hist(df['quality']) plt.title('numbers of the wines by quality')
—————————————————————————————————————–
We sometimes want to understand the relationship between features of the data. We can use scatter plots to visualize the distribution of the data according to particular columns.
plt.scatter(df['fixed acidity'], df['pH']) plt.title('relationship between fixed acidity and pH')
——————————————————————————————————–
Finally, we will draw a pie chart to see which class has the majority proportionally.
plt.pie(df['quality'].value_counts(),labels=df['quality'].unique())
———————————————————————————————————–
An important note: If you can not see the figure even though you have typed the commands correctly, you can call the plt.show() function to display it on the screen.
Summary
Visualizing the data is a necessity to understand and process it correctly. We introduced the Matplotlib library in this tutorial which is commonly used to explain the data with graphics in the fields of Data Science and Machine Learning. First, we showed the plot() function to draw a 2D line and point charts. Then we mentioned drawing some other chart types with a real dataset and saw the distribution of the samples by class in these exercises. In the last part, we showed how to draw a histogram, scatter plots, and pie charts.