How to Convert PDF to JSON from URL (Node for PDF to JSON API in JavaScript and PDF.co Web API

JavaScript Object Notation (aka JSON) is a very popular way to transfer data across systems as well as store data. We can even say that JSON format is considered standard when sharing or storing data across different systems irrespective of Operating system, Programming languages, or Database format. Almost all programming languages as well databases support JSON format, be it native support by external packages.

On the other hand, PDF (Portable Document Format) is standard for documents. A large share of reports is generated in PDF format. We can also say PDF is standard when creating reports and other documents.

Tons of data are stored in PDF format, and extracting information from PDF and converting it to JSON is very helpful for further information processing and to create something cool. PDF.co provides API endpoint /v1/pdf/convert/to/json for this purpose. This endpoint facilitates converting PDF to JSON format, the source PDF file can be either URL pointing to PDF file or direct PDF file; it supports em all.

In this article we’ll demonstrate a simple scenario to have the invoice PDF file data to JSON format. You can also visit our GitHub repository to get this source code along with source code in other programming languages.

PDF.co Web API is the Rest API that provides a set of data extraction functions, tools for document manipulation, splitting, and merging of pdf files. Includes built-in OCR, images recognition, can generate and read barcodes from images, scans, and pdf.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

app.js

      
var https = require("https"); var path = require("path"); var fs = require("fs"); // The authentication key (API Key). // Get your own by registering at https://app.pdf.co/documentation/api const API_KEY = "***********************************"; // Direct URL of source PDF file. const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-json/sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. const Pages = ""; // PDF document password. Leave empty for unprotected documents. const Password = ""; // Destination JSON file name const DestinationFile = "./result.json"; // Prepare request to `PDF To JSON` API endpoint var queryPath = `/v1/pdf/convert/to/json`; // JSON payload for api request var jsonPayload = JSON.stringify({ name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl }); var reqOptions = { host: "api.pdf.co", method: "POST", path: queryPath, headers: { "x-api-key": API_KEY, "Content-Type": "application/json", "Content-Length": Buffer.byteLength(jsonPayload, 'utf8') } }; // Send request var postRequest = https.request(reqOptions, (response) => { response.on("data", (d) => { // Parse JSON response var data = JSON.parse(d); if (data.error == false) { // Download JSON file var file = fs.createWriteStream(DestinationFile); https.get(data.url, (response2) => { response2.pipe(file) .on("close", () => { console.log(`Generated JSON file saved as "${DestinationFile}" file.`); }); }); } else { // Service reported error console.log(data.message); } }); }).on("error", (e) => { // Request error console.log(e); }); // Write request data postRequest.write(jsonPayload); postRequest.end();

package.json

      
{ "name": "test", "version": "1.0.0", "description": "PDF.co", "main": "app.js", "scripts": { }, "keywords": [ "pdf.co", "web", "api", "bytescout", "api" ], "author": "ByteScout & PDF.co", "license": "ISC", "dependencies": { "request": "^2.88.2" } }

Output

Now that we’ve analyzed the source code and its output, let’s analyze the source code a bit.

  1. Variables to Create PDF.co Request
  2. Create the Request Options
  3. Execute the Request
  4. Customize Data with Additional Parameters
  5. Video Guide

1. Variables to Create PDF.co Request

Initially, we’re referencing the packages that are used in this article, such as https, path and fs. Next we’re creating the variables useful for creating the PDF.co request, as shown below.

API_KEY This field stores PDF.co API key. PDF.co API key is used to authenticate request at PDF.co server. This field is being passed in request header with key x-api-key.
SourceFileUrl We’re storing the Direct URL of the source PDF file. Which is the invoice file in this case.
Pages This field holds page numbers whose data needs to be extracted in to JSON format. Page numbers are in the comma seperated format like “1,2,3”. Leave these fields empty to extract data from all pages.
Password For password-protected files, we can provide the document password in here.
DestinationFile This variable holds the path to the destination JSON file. After converting PDF to JSON, output JSON data will be stored at this location.

The PDF.co API endpoint /v1/pdf/convert/to/json is used to convert PDF to JSON format. We’re creating the PDF.co request with the following JSON payload.

// JSON payload for api request
var jsonPayload = JSON.stringify({
    name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl
});

2. Create the Request Options

We’re creating the request options and providing the API endpoint name as well as the header information such as the PDF.co API key and request content information.

var reqOptions = {
    host: "api.pdf.co",
    method: "POST",
    path: queryPath,
    headers: {
        "x-api-key": API_KEY,
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
    }
};

3. Execute the Request

With this information we’re executing the request against PDF.co server. Once the response is received we’re validating it’s error free and then writing the receied JSON data to the destination location (provided in DestinationFile).

// Send request
var postRequest = https.request(reqOptions, (response) => {
    response.on("data", (d) => {
        // Parse JSON response
        var data = JSON.parse(d);        
        if (data.error == false) {
            // Download JSON file
            var file = fs.createWriteStream(DestinationFile);
            https.get(data.url, (response2) => {
                response2.pipe(file)
                .on("close", () => {
                    console.log(`Generated JSON file saved as "${DestinationFile}" file.`);
                });
            });
        }
        else {
            // Service reported error
            console.log(data.message);
        }
    });
}).on("error", (e) => {
    // Request error
    console.log(e);
});

// Write request data
postRequest.write(jsonPayload);
postRequest.end();

Following is the sample response JSON generated.

"text": {
           "@fontName": "Arial",
           "@fontSize": "24.0",
           "@fontStyle": "Bold",
           "@color": "#538DD3",
           "@x": "36.00",
           "@y": "34.44",
           "@width": "242.81",
           "@height": "24.00",
           "#text": "Your Company Name"
        }

As we’ve noticed, the response JSON contains the extracted data in #text node. The following are the additional data extracted.

@fontName, @fontSize, @fontStyle Font information such as font name, font size and font sytle
@color Text color information
@x, @y, @width, @height Contains text co-ordinate information such as X and Y coordinate and height and width of text block.
#text Contains extracted text

4. Customize Data with Additional Parameters

In this sample we’re using very basic request information. However, depending on our requirement we customize the request data with additional parameters as shown in the following table. Please refer to PDF.co documentations to get a full list of available customizations.

rect Defines the co-ordinates for JSON data extraction. For example, 51.8, 114.8, 235.5, 204.0. We can use PDF.co free online tool PDF Viewer to easily get co-ordinates.
async If we’re processing a big input PDF, then it might take time and there are changes that request times out. In these scenarios we can opt for the asyncronous mode by specifying this request parameter. When we’re processing request in the async mode, we’ll receive the Job in response. By checking this job request (/job/check), we can get the status of the JSON conversion process.

That’s all guys! It’s how easy and efficient to convert PDF to JSON with PDF.co Web API. Please try to execute this code in your machine to get more familiar with PDF.co.

Thank you for reading!


Video Guide

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

Related Pages:

Related Samples: