In today’s world, Data is an important word. It is said that the one who controls the data controls the world around it. Thus, Data is an essential raw material in today’s tech governing world. Data is the major differentiator when it comes to research and devising plans and strategies. It is the most essential component in the statistical world. Lots of controversies and speculations have been publicized on the topic of accumulating private data in past few years. Hence, Data plays a major role in the virtual computing world.
Now, where and how do we accumulate a large amount of data?
Of course, we can manually enter it which will take a lot of time and effort. Hence, the simple solution to this can be to devise the Web Scraping Script.
Web scraping refers to a process that exercises automation in the field of extracting the data. The data is automatically extracted using a programmed script (like a robot) in the most efficient and quick way possible. With Web Scrapping, we can extract valuable data from any website in large quantity and store it in any required format.
Thus, these scrapped data can be used to implement a Dataset for machine learning and AI, for data mining and statistical analysis.
In this tutorial, we are going to learn how to scrap the data from a website using NodeJS with Puppeteer. Puppeteer is a popular Node library that enables our script to connect with chrome Dev tools protocol using high-level APIs. With it, we can run the webpage in the script and perform various processes such as evaluation, pdf creation, taking screenshots, etc. Here, the idea is to evaluate the webpage and then scrape the data from it using the query selector. For testing purposes, we are going to scrap the movie data from the IMDB website.
So, let’s get started!
Create a Node project
Such make a directory in your desired local system
And run the following command:
npm init
Fill in the required package information
You will get the package.json file in your repository. Open with your favorite code editor:
Create an index.js file in the same project repository.
Install the Puppeteer library
Our next step is to install the puppeteer library. This library offers high-level APIs to connect with Chrome DevTools. It enables us to do most of the activities that we can do using the Chrome browser. Some of the significant features it provides are to generate PDF files of webpages as well as take screenshots of the webpage. It allows us to crawl the webpages and produce pre-rendered content which is vital for SSR and SEO. Using the modules from this package, we can automate various processes such as form submissions, UI testing, inputs, etc. We can even test out the browser extensions using this library. Hence, the library is very powerful.
Here, we are going to use it to crawl and evaluate a webpage and in turn scrap some data from it.
Hence, in order to install this library, we need to execute the following command in the project terminal:
npm install puppeteer
Implementing Scrapper
From here on, we are going to start coding our automated web scraper script. For testing purposes, we are going to use the IMDB website to scrap the movie titles. This will give the basic and easy learning scope to the beginners about scrapping using puppeteer in the NodeJS environment.
First, we need to import the puppeteer plugin in our index.js file as shown in the code snippet below:
const puppeteer = require('puppeteer');
(async () => {
})();
Now, we can use any movie webpage from IMDB. Just navigate to the webpage and copy any movie URL and initialize it as a value of a variable as shown in the code snippet below:
(async () => {
let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
})();
Now, we going to launch the browser inside the script. This means we are going to connect our script with the Chrome browser. In order to do that, puppeteer
provides a launch
method. We can initialize the browser connection to browser
constant as shown in the code snippet below:
(async () => {
let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
const browser = await puppeteer.launch();
})();
The main function is asynchronous. Hence, we need to wait to browse to launch before executing any additional processes.
After the browser is launched, we need to open up a new page inside the script itself. It like opening a new tab in a Chrome browser. We can do that by using the newPage()
method provided by the browser
constant as shown in the code snippet below:
(async () => {
let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
const browser = await puppeteer.launch();
const page = await browser.newPage();
})();
Here, the page
instance will handle everything within that page. Since we have set the movie URL, we want to go to that specific URL page using the goto()
method provided by the page
constant as shown in the code snippet below:
(async () => {
let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(movieURL, {waitUntil: 'networkidle2'});
})();
Here, the waitUntil
parameter inside the goto
method will tell the browser to finish the navigation when there is no connection for half a second.
Now, we need to write the query selector string so that we can scrape the required information from the URL such as movie title, rating, count, etc.
In order to find out the query string, we need to get the class element inside which the data to be scrapped resides. For that, we need to go to the browser, open the developer tools, and inspect the title of the movie as directed in the screenshot below:
After formulating the query string, we can now scrap it in our script. For that, we are going to use the evaluate
method provided by the page
variable to evaluate everything on the page specified. Inside the callback of the evaluate
method, we are going to use the querySelector
method to scrap out the required data from the mentioned query string. The implementation is provided in the code snippet below:
let scrapedResult = await page.evaluate(()=>{
let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
});
Here, we have used the querySelector
with the class and HTML tag with the innerText
method to select the title of the movie.
Remember that, we identified the class element and tags by inspecting them in the browser as shown in the browser screenshot above.
Similarly, we can write the querySelector
for movie ratings and count as shown in the code snippet below:
let scrapedResult = await page.evaluate(()=>{
let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;
});
This will allow us to store the required data in the variables mentioned in the code snippet above.
Now that we have the scrapped data in the variables, we need to return them from the callback as shown in the code snippet below:
let scrapedResult = await page.evaluate(()=>{
let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;
return {
movieTitle,
movieRating,
movieRatingCount
}
});
console.log("Scrapped Result :" + JSON.stringify(scrapedResult));
Here, we have also logged the scrapped data.
Finally, we need to close the browser connection using the close method provided by the browser
instance:
await browser.close();
This marks the final step of creating the NodeJS script that scrapes the movie data from a website.
The full and final function is provided in the code snippet below:
(async () => {
let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(movieURL, {waitUntil: 'networkidle2'});
let scrapedResult = await page.evaluate(()=>{
let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;
return {
movieTitle,
movieRating,
movieRatingCount
}
});
console.log("Scrapped Result :" + JSON.stringify(scrapedResult));
await browser.close();
})();
Now, we need to run the script by executing the following command in the project terminal:
node index.js
Hence, we will get the result as shown in the screenshot below:
We can notice that we have got the movie title, ratings as well as count in the result object. Likewise, we can test for other elements in the webpage as well.
Finally, we have successfully created a NodeJS script that crawls the webpage and scraps the data out of it using puppeteer modules.
Conclusion
Well, I guess scrapping data using Puppeteer was simple enough in the NodeJS environment. With just a few lines of code, we could scrap the movie title, rating and ratings count from a movie on the IMDB website. Imagine how incredible it would be to scrap such data from a website with multiple entries and lists. We can take the example of eCommerce sites, where we can scrap the product details easily. Other valuable information such as banking investments, stock market information, etc. everything can be scrapped easily with just a few lines of code. And, the NodeJS with puppeteer library makes it even easier.
Now, your challenge is to scrap the list of titles from the IMDB top movie list. For, you can even navigate to some eCommerce site and scrap the product titles in any of their webpages. Then, export them in CSV or PDF format.