How to convert PDF to JSON from URL (node for PDF to JSON API in JavaScript and PDF.co Web API

JavaScript Object Notation (aka JSON) is a very popular way to transfer data across systems as well as store data. We can even say that JSON format is considered standard when sharing or storing data across different systems irrespective of Operating system, Programming languages, or Database format. Almost all programming languages as well databases support JSON format, be it native support by external packages.

On the other hand, PDF (Portable Document Format) is standard for documents. A large share of reports is generated in PDF format. We can also say PDF is standard when creating reports and other documents.

Tons of data are stored in PDF format, and extracting information from PDF and converting it to JSON is very helpful for further information processing and to create something cool. PDF.co provides API endpoint /v1/pdf/convert/to/json for this purpose. This endpoint facilitates converting PDF to JSON format, the source PDF file can be either URL pointing to PDF file or direct PDF file; it supports em all.

In this article we’ll demonstrate a simple scenario to have invoice PDF file data to JSON format. You can also visit our GitHub repository to get this source code along with source code in other programming languages.

PDF.co Web API is the Rest API that provides a set of data extraction functions, tools for document manipulation, splitting, and merging of pdf files. Includes built-in OCR, images recognition, can generate and read barcodes from images, scans, and pdf.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

app.js

      
var https = require("https"); var path = require("path"); var fs = require("fs"); // The authentication key (API Key). // Get your own by registering at https://app.pdf.co/documentation/api const API_KEY = "***********************************"; // Direct URL of source PDF file. const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-json/sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. const Pages = ""; // PDF document password. Leave empty for unprotected documents. const Password = ""; // Destination JSON file name const DestinationFile = "./result.json"; // Prepare request to `PDF To JSON` API endpoint var queryPath = `/v1/pdf/convert/to/json`; // JSON payload for api request var jsonPayload = JSON.stringify({ name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl }); var reqOptions = { host: "api.pdf.co", method: "POST", path: queryPath, headers: { "x-api-key": API_KEY, "Content-Type": "application/json", "Content-Length": Buffer.byteLength(jsonPayload, 'utf8') } }; // Send request var postRequest = https.request(reqOptions, (response) => { response.on("data", (d) => { // Parse JSON response var data = JSON.parse(d); if (data.error == false) { // Download JSON file var file = fs.createWriteStream(DestinationFile); https.get(data.url, (response2) => { response2.pipe(file) .on("close", () => { console.log(`Generated JSON file saved as "${DestinationFile}" file.`); }); }); } else { // Service reported error console.log(data.message); } }); }).on("error", (e) => { // Request error console.log(e); }); // Write request data postRequest.write(jsonPayload); postRequest.end();

package.json

      
{ "name": "test", "version": "1.0.0", "description": "PDF.co", "main": "app.js", "scripts": { }, "keywords": [ "pdf.co", "web", "api", "bytescout", "api" ], "author": "ByteScout & PDF.co", "license": "ISC", "dependencies": { "request": "^2.88.2" } }

Output

Now that we’ve analyzed source code and its output, let’s analyze source code a bit.

Initially we’re referencing packages that are used in this article, such as https, path and fs. Next we’re creating variables useful for creating PDF.co request, as shown below.

API_KEY This field stores PDF.co API key. PDF.co API key is used to authenticate request at PDF.co server. This field is being passed in request header with key x-api-key.
SourceFileUrl We’re storing Direct URL of source PDF file. Which is invoice file in this case.
Pages This field holds page numbers whose data needs to be extracted in to JSON format. Page numbers are in comma seperated format like “1,2,3”. Leave this fields empty to extract data from all pages.
Password For password protected files, we can provide document password in here.
DestinationFile This variable holds path to destination JSON file. After converting PDF to JSON, output JSON data will be stored at this location.

PDF.co API endpoint /v1/pdf/convert/to/json is used to convert PDF to JSON format. We’re creating PDF.co request with following JSON payload.

// JSON payload for api request
var jsonPayload = JSON.stringify({
    name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl
});

We’re creating request options and providing API endpoint name as well header information such as PDF.co API key and request content information.

var reqOptions = {
    host: "api.pdf.co",
    method: "POST",
    path: queryPath,
    headers: {
        "x-api-key": API_KEY,
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
    }
};

With this information we’re executing request against PDF.co server. Once response received we’re validating it’s error free and then writing receied JSON data to destination location (provided in DestinationFile).

// Send request
var postRequest = https.request(reqOptions, (response) => {
    response.on("data", (d) => {
        // Parse JSON response
        var data = JSON.parse(d);        
        if (data.error == false) {
            // Download JSON file
            var file = fs.createWriteStream(DestinationFile);
            https.get(data.url, (response2) => {
                response2.pipe(file)
                .on("close", () => {
                    console.log(`Generated JSON file saved as "${DestinationFile}" file.`);
                });
            });
        }
        else {
            // Service reported error
            console.log(data.message);
        }
    });
}).on("error", (e) => {
    // Request error
    console.log(e);
});

// Write request data
postRequest.write(jsonPayload);
postRequest.end();

Following is the sample response JSON generated.

"text": {
           "@fontName": "Arial",
           "@fontSize": "24.0",
           "@fontStyle": "Bold",
           "@color": "#538DD3",
           "@x": "36.00",
           "@y": "34.44",
           "@width": "242.81",
           "@height": "24.00",
           "#text": "Your Company Name"
        }

As we’ve noticed, response JSON contains extracted data in #text node. Following are the additional data extracted.

@fontName, @fontSize, @fontStyle Font information such as font name, font size and font sytle
@color Text color information
@x, @y, @width, @height Contains text co-ordinate information such as X and Y coordinate and height and width of text block.
#text Contains extracted text

In this sample we’re using very basic request information. However, depending on our requirement we customize request data with additional parameters as shown in the following table. Please refer to PDF.co documentations to get a full list of available customizations.

rect Defines co-ordinates for JSON data extraction. For example, 51.8, 114.8, 235.5, 204.0. We can use PDF.co free online tool PDF Viewer to easily get co-ordinates.
async If we’re processing big input PDF, then it might take time and there are changes that request times-out. In such scenarios we can otp for asyncronous mode by specifying this request parameter. When we’re processing request in async mode, we’ll received Job in response. By checking this job request (/job/check), we can get status of JSON convertion process.

That’s all guys! It’s how easy and efficient to convert PDF to JSON with PDF.co API. Please try to execute this code in your machine to get more familiar with PDF.co.

Thank you for reading!

VIDEO

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

Related Samples: