In today’s world, Data is an important word. It is said that the one who controls the data controls the world around it. Thus, Data is an essential raw material in today’s tech governing world. Data is the major differentiator when it comes to research and devising plans and strategies. It is the most essential component in the statistical world. Lots of controversies and speculations have been publicized on the topic of accumulating private data in past few years. Hence, Data plays a major role in the virtual computing world.

Now, where and how do we accumulate a large amount of data?

Web Scraping with Puppeteer and Nodejs

Of course, we can manually enter it which will take a lot of time and effort. Hence, the simple solution to this can be to devise the Web Scraping Script.

Web scraping refers to a process that exercises automation in the field of extracting the data. The data is automatically extracted using a programmed script (like a robot) in the most efficient and quick way possible. With Web Scrapping, we can extract valuable data from any website in large quantity and store it in any required format.

Thus, these scrapped data can be used to implement a Dataset for machine learning and AI, for data mining and statistical analysis.

In this tutorial, we are going to learn how to scrap the data from a website using NodeJS with Puppeteer. Puppeteer is a popular Node library that enables our script to connect with chrome Dev tools protocol using high-level APIs. With it, we can run the webpage in the script and perform various processes such as evaluation, pdf creation, taking screenshots, etc. Here, the idea is to evaluate the webpage and then scrape the data from it using the query selector. For testing purposes, we are going to scrap the movie data from the IMDB website.

So, let’s get started!

Create a Node project

Such make a directory in your desired local system

And run the following command:

npm init

Fill in the required package information

Scrapping Files

You will get the package.json file in your repository. Open with your favorite code editor:

Puppeteer and Nodejs

Create an index.js file in the same project repository.

Install the Puppeteer library

Our next step is to install the puppeteer library. This library offers high-level APIs to connect with Chrome DevTools. It enables us to do most of the activities that we can do using the Chrome browser. Some of the significant features it provides are to generate PDF files of webpages as well as take screenshots of the webpage. It allows us to crawl the webpages and produce pre-rendered content which is vital for SSR and SEO. Using the modules from this package, we can automate various processes such as form submissions, UI testing, inputs, etc. We can even test out the browser extensions using this library. Hence, the library is very powerful.

Here, we are going to use it to crawl and evaluate a webpage and in turn scrap some data from it.

Hence, in order to install this library, we need to execute the following command in the project terminal:

npm install puppeteer

Implementing Scrapper

From here on, we are going to start coding our automated web scraper script. For testing purposes, we are going to use the IMDB website to scrap the movie titles. This will give the basic and easy learning scope to the beginners about scrapping using puppeteer in the NodeJS environment.

First, we need to import the puppeteer plugin in our index.js file as shown in the code snippet below:

const puppeteer = require('puppeteer');
 
(async () => {
  
})();

Now, we can use any movie webpage from IMDB. Just navigate to the webpage and copy any movie URL and initialize it as a value of a variable as shown in the code snippet below:

(async () => {
  let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";
})();

Now, we going to launch the browser inside the script. This means we are going to connect our script with the Chrome browser. In order to do that, puppeteer provides a launch method. We can initialize the browser connection to browser constant as shown in the code snippet below:

(async () => {
  let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";

  const browser = await puppeteer.launch();
})();

The main function is asynchronous. Hence, we need to wait to browse to launch before executing any additional processes.

After the browser is launched, we need to open up a new page inside the script itself. It like opening a new tab in a Chrome browser. We can do that by using the newPage() method provided by the browser constant as shown in the code snippet below:

(async () => {
  let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
})();

Here, the page instance will handle everything within that page. Since we have set the movie URL, we want to go to that specific URL page using the goto() method provided by the page constant as shown in the code snippet below:

(async () => {
  let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(movieURL, {waitUntil: 'networkidle2'});
})();

Here, the waitUntil parameter inside the goto method will tell the browser to finish the navigation when there is no connection for half a second.

Now, we need to write the query selector string so that we can scrape the required information from the URL such as movie title, rating, count, etc.

In order to find out the query string, we need to get the class element inside which the data to be scrapped resides. For that, we need to go to the browser, open the developer tools, and inspect the title of the movie as directed in the screenshot below:

Introduction to Web Scraping with Puppeteer and Nodejs

After formulating the query string, we can now scrap it in our script. For that, we are going to use the evaluate method provided by the page variable to evaluate everything on the page specified. Inside the callback of the evaluate method, we are going to use the querySelector method to scrap out the required data from the mentioned query string. The implementation is provided in the code snippet below:

let scrapedResult = await page.evaluate(()=>{
    let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
});

Here, we have used the querySelector with the class and HTML tag with the innerText method to select the title of the movie.

Remember that, we identified the class element and tags by inspecting them in the browser as shown in the browser screenshot above.

Similarly, we can write the querySelector for movie ratings and count as shown in the code snippet below:

let scrapedResult = await page.evaluate(()=>{
    let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
    let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
    let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;
  });

This will allow us to store the required data in the variables mentioned in the code snippet above.

Now that we have the scrapped data in the variables, we need to return them from the callback as shown in the code snippet below:

let scrapedResult = await page.evaluate(()=>{
    let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
    let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
    let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;

    return {
      movieTitle,
      movieRating,
      movieRatingCount
    }
  });

  console.log("Scrapped Result :" + JSON.stringify(scrapedResult));

Here, we have also logged the scrapped data.

Finally, we need to close the browser connection using the close method provided by the browser instance:

await browser.close();

This marks the final step of creating the NodeJS script that scrapes the movie data from a website.

The full and final function is provided in the code snippet below:

(async () => {
  let movieURL = "<https://www.imdb.com/title/tt0468569/?ref_=nv_sr_srsg_0>";

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(movieURL, {waitUntil: 'networkidle2'});

  let scrapedResult = await page.evaluate(()=>{
    let movieTitle = document.querySelector('div[class="title_wrapper"] > h1').innerText;
    let movieRating = document.querySelector('span[itemprop="ratingValue"]').innerText;
    let movieRatingCount = document.querySelector('span[itemprop="ratingCount"]').innerText;

    return {
      movieTitle,
      movieRating,
      movieRatingCount
    }
  });

  console.log("Scrapped Result :" + JSON.stringify(scrapedResult));
  await browser.close();

})();

Now, we need to run the script by executing the following command in the project terminal:

node index.js

Hence, we will get the result as shown in the screenshot below:

Introduction to Web Scraping

We can notice that we have got the movie title, ratings as well as count in the result object. Likewise, we can test for other elements in the webpage as well.

Finally, we have successfully created a NodeJS script that crawls the webpage and scraps the data out of it using puppeteer modules.

Conclusion

Well, I guess scrapping data using Puppeteer was simple enough in the NodeJS environment. With just a few lines of code, we could scrap the movie title, rating and ratings count from a movie on the IMDB website. Imagine how incredible it would be to scrap such data from a website with multiple entries and lists. We can take the example of eCommerce sites, where we can scrap the product details easily. Other valuable information such as banking investments, stock market information, etc. everything can be scrapped easily with just a few lines of code. And, the NodeJS with puppeteer library makes it even easier.

Now, your challenge is to scrap the list of titles from the IMDB top movie list. For, you can even navigate to some eCommerce site and scrap the product titles in any of their webpages. Then, export them in CSV or PDF format.