How to Scrape Web Pages into PDF using PDF.co API in JavaScript (Node.js)

The tutorial and the sample source code elaborate on generating a PDF file from a web URL using PDF.co web APIs in JavaScript (Node js). The users can use this web API to store the important information available on the web pages on their local storage or the cloud in PDF files.

How to Scrape Web Pages with PDF from URL Endpoint

The users can use this PDF from the URL endpoint of PDF.co web API to generate PDF files from the web URL or link to HTML pages. This endpoint takes the URL link and other optional parameters to generate an output link to a PDF file. Moreover, the users can set the header and footer of the PDF files of their choice. Finally, the users can choose to download the file using various filing modules to store them on local storage.

Endpoint Parameters

Following are the parameters of PDF from URL endpoint:

1. url

It is a required parameter which is a string containing the web URL or the HTML file that the user wants to convert. The users can also provide links to their other files hosted on online storage systems such as google drive, Dropbox, and PDF.co cloud storage.

2. async

It is an optional parameter that the users can set to “true” to run it asynchronously. The users may encounter the error “405” if they try to process large documents synchronously. Therefore, they must set it to “true” to convert those documents or web pages or specify the page range.

3. name

It is an optional parameter, a string containing the name of the output file. It is set to “result.pdf” by default.

4. expiration

It is an optional parameter defining the output file’s link expiration in minutes. It is set to 60 minutes by default, and the users can set it to different periods depending on their subscription plan. The files get automatically deleted from the cloud after this period. However, the users can permanently store them using the PDF.co built-in files storage system.

5. margins

It is an optional parameter containing the margins of the output PDF file. The users can set the margins as they do in CSS styling, such as they can write “2px, 2px, 2px, 2px” to adjust the top, right, bottom, and left margins.

6. paperSize

It is an optional parameter to set the paper size of the output file. It is set to “Letter” by default. However, the users can set it to “Letter”, “Legal”, “Tabloid”, “Ledger”, “A0”, “A1”, “A2”, “A3”, “A4”, “A5”, “A6” or any custom size. The custom sizes can be in px (pixels), mm (millimeters), or in (inches). For instance, 100px, 200px to set width and height respectively.

7. orientation

It is an optional parameter to define the orientation of the output file’s pages. The users can set it to “Portrait” or “Landscape”. By default, it is set to “Portrait.”

8. printBackground

It is an optional parameter to disable or enable background printing. It is set to “true” by default.

9. DoNotWaitFullLoad

It is an optional parameter to explicitly control the waiting and skip the wait for a full load, such as large images and videos, to manage the total conversion time. It is set to “false” by default.

10. profiles

It is an optional parameter, a string that allows the users to set custom configurations.

11. header

It is an optional parameter to set the header of the output PDF file (every page of the file). The users can use HTML elements to design the header.

12. footer

It is an optional parameter to set the footer of the output PDF file (every page of the file). This parameter accepts HTML to apply at the end of the pages.

Note

The upper and lower margins while setting the footer and header of the page is important because they may overlap with the page content, thus, making it unreadable.

How to Inject Printing Values into Header and Footer

The users can use the following classes to inject the printing values into the header and footer:

1. date.

This class prints the formatted date

2. title

This class prints the document title.

3. url

This class prints the document location.

4. pageNumber

This class prints the current pageNumber of the document.

5. totalPages

This class prints the total pages in the document.

Note

The users can read more about the classes and see sample examples here.

Scrape Web Pages and Generate PDF using Javascript

The following source code explains to users how to generate PDF files from a web URL using PDF.co web API. This code takes the website URL of Wikipedia’s main page and converts it to a PDF file. Moreover, the parameter “DoNotWaitFullLoad” is set to “true” to reduce the time; the API needs to convert the whole website page to a PDF file. Finally, the users can see the output URL in their terminal and use that link to see the resulting file or use the Javascript file stream module to download the file contents and store them on the local storage as a PDF file.

Source Code for Web Scraping

Below is the sample code to generate PDF from a URL:

var request = require('request');
var API_KEY = '******************************'
var options = {
  'method': 'POST',
  'url': 'https://api.pdf.co/v1/pdf/convert/from/url',
  'headers': {
    'x-api-key': API_KEY
  },
  formData: {
    'url': 'https://en.wikipedia.org/wiki/Main_Page',
    'name': 'result.pdf',
    'DoNotWaitFullLoad': 'true',
}
};

request(options, function (error, response) {
  if (error) throw new Error(error);
  console.log(response.body)
});

PDF File Output

Below are the screenshots of the code output and the output file obtained in the API response:

Code to Scrape Web Pages into PDFs
PDF Output for Web Page Scraping

Sample Header and Footer

Below is the source code containing the header and footer parameters in addition to the above parameters. These HTML codes set the header to write the “left subheader” and “right subheader” at the top left and top right sections of every page respectively. Similarly, the footer code utilizes the classes “pageNumber” and “totalPages” to write “page N of NN” at the bottom right of each page. Moreover, the users can see from the coding example and the screenshot that they can style these spans and divs as they do in writing regular HTML

Code Snippet

var request = require('request');
var API_KEY = '******************'
var options = {
  'method': 'POST',
  'url': 'https://api.pdf.co/v1/pdf/convert/from/url',
  'headers': {
    'x-api-key': API_KEY
  },
  formData: {
    'url': 'https://en.wikipedia.org/wiki/Main_Page',
    'name': 'result.pdf',
    'DoNotWaitFullLoad': 'true',
    "header": "<div style='width:100%'><span style='font-size:10px;margin-left:20px;width:50%;float:left'>LEFT SUBHEADER</span><span style='font-size:8px;width:30%;float:right'>RIGHT SUBHEADER</span></div>",
    "footer": "<div style='width:100%;text-align:right'><span style='font-size:10px;margin-right:20px'>Page <span class='pageNumber'></span> of <span class='totalPages'></span>.</span></div>"
  }
};

request(options, function (error, response) {
  if (error) throw new Error(error);
  console.log(response.body)
});

Final PDF Output

Below are the screenshots of the code output and the output file obtained in the API response:

PDF from URL - Output
Generate PDF from URL in JavaScript Code

You can find an advanced example with the file downloading functions here.

Video Guide