How to Use PDF to JSON Meta in JavaScript

This tutorial and the sample code explain converting a PDF document to the JSON meta using the PDF to JSON functionality from the PDF.co Web API using JavaScript programming language. Using JavaScript, the users can easily convert any PDF document into JSON meta while preserving the fonts, texts, images, vectors, and formatting.

Features of PDF to JSON API Endpoint

The PDF.co Web API provides tools and functionalities to convert any PDF document or scanned image to JSON. This conversion process is so efficiently used that even the formatting, the text, the vectors, and the images are preserved. The API uses the method of auto-classification of any input document to convert it to JSON. The users can utilize the document classifier endpoint to automatically detect the incoming document’s class based on their keywords-based rules.

The users can define certain rules to find a specific vendor who provided the document and which template to apply according to this information. The PDF to JSON Meta functionality uses Artificial Intelligence (AI) to detect text objects’ meta styles. For example, the paragraph style includes h1 to h7, p, and small, the meta type includes text, integer, decimal, and currency, and the meta subType includes personName, companyName, and AI-based meta types. Moreover, meta end-point consumers need more credits as it runs with the AI, and the process is slowed down due to AI. Therefore, the async mode is more suitable for this endpoint.

The PDF.co Web API also provides a highly secure platform for the users. The security process is critical as the users must provide sensitive information to the platform. The API is secure to use as it transmits the data via an encrypted connection. The users can go through the security protocols here.

Endpoint Parameters for PDF to JSON Meta

Following are the parameters of PDF to JSON Meta endpoint:

  1. url: It is a required parameter. It provides the URL to the source file. The PDF.co platform supports URLs from Dropbox, GoogleDrive, and built-in file storage of PDF.co.

  2. httpusername: It is an optional parameter. It provides an HTTP auth user name to access the source URL if required.

  3. httppassword: It is an optional parameter. It provides an HTTP auth password to access the source URL if required.

  4. pages: An optional parameter and must be a string. It helps in providing a comma-separated list of the pages required. The users can set a range for the pages by using “ -.” For example, 2, 5-7, and 2-. Moreover, the users can leave the parameter empty to indicate selecting all the pages.

  5. unwrap: It is an optional parameter that unwraps the lines and forms them into a single line in the table cells. It is done by enabling lineGrouping.

  6. rect: It is an optional parameter and must be a string. It provides coordinates for extraction. Moreover, the users can use PDF.co PDF viewer to copy coordinates easily.

  7. lang: It is an optional parameter that sets the language for OCR to use for scanned JPG, PNG, and PDF document inputs to extract text from them. The languages include eng, spa, jpn, deu, and many others.

  8. inline: It is an optional parameter. It is set to true to return data as inline or false to return the link to an output file.

  9. lineGrouping: It is an optional parameter and must be a string. It enables grouping within the table cells.

  10. async: It is an optional parameter. It helps in running the processes asynchronously. It returns the JobId to check the state of the background job. The possible states include success, failed, aborted, and working.

  11. name: It is an optional parameter and must be a string. It provides the generated output file name.

  12. expiration: It is an optional parameter that provides the expiration time for the output link.

  13. profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options.

PDF to JSON Meta using JavaScript

The following source code explains how to convert any PDF document using the PDF to JSON Meta API. The sample code in JavaScript explains converting a PDF document or any scanned image using the API. The below code takes the sample PDF file for classification and uses Artificial Intelligence (AI) to detect the meta styles, such as paragraph styles. It then writes these found meta styles in JSON and sends them back to the user, which he downloads and stores on his local storage.

PDF to JSON Sample Code

Following is the sample code in javascript to explain using PDF to JSON meta endpoint:

var https = require("https");
var path = require("path");
var fs = require("fs");

const API_KEY = "***************************************";

// Direct URL of source PDF file.
// You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/   
const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-json/sample.pdf";

// Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
const Pages = "";

// PDF document password. Leave empty for unprotected documents.
const Password = "";

// Destination JSON file name
const DestinationFile = "./result.json";


//To run the process asynchronously
const async = "true";

// Prepare request to `PDF To JSON` API endpoint
var queryPath = `/v1/pdf/convert/to/json-meta`;

// JSON payload for api request
var jsonPayload = JSON.stringify({
    name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl, async: async
});

var reqOptions = {
    host: "api.pdf.co",
    method: "POST",
    path: queryPath,
    headers: {
        "x-api-key": API_KEY,
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
    }
};

// Send request
var postRequest = https.request(reqOptions, (response) => {
    response.on("data", (d) => {
        // Parse JSON response
        var data = JSON.parse(d);       
        if (data.error == false) {
            // Download JSON file
            var file = fs.createWriteStream(DestinationFile);
            https.get(data.url, (response2) => {
                response2.pipe(file)
                .on("close", () => {
                    console.log(`Generated JSON file saved as "${DestinationFile}" file.`);
                });
            });
        }
        else {
            // Service reported error
            console.log(data.message);
        }
    });
}).on("error", (e) => {
    // Request error
    console.log(e);
});

// Write request data
postRequest.write(jsonPayload);
postRequest.end();

JSON File Output

Below is the output screenshot to show a small part of the resulting JSON file:

JSON File Output

Step-by-Step Guide to Convert PDF to JSON in JS

  1. The code imports the required packages to make the API request, read the file, and store it on the local storage. In this scenario, the required packages are “https”, “fs”, and “path.”

  2. It then declares and initializes the API_Key, which the users can get by signing up or logging into the PDF.co account. As the PDF.co APIs are secure; the users need these API keys to make requests to them.

  3. After this, it declares and initializes the API’s body payload, which in this sample code is the source file URL, destination file pages, and the file password. The users can see from this example that the pages and password variable are left empty for the users to play with the variables to classify the pages of the source or password-protected file. The code utilizes the PDF.co sample code containing the sample source file in this scenario. The users can see and download this file. Therefore, it is not password protected.

  4. It declares the initializes the “async” variable with “true” so that the request can run asynchronously. As this API runs on AI, it can be prolonged, especially for large files. Therefore, it is recommended to use async mode while calling it.

  5. It then assembles variables for JSON payload and sends the API requests. The successful request returns the JSON that the file stream reads and stores on the local storage as the results.JSON file, i.e., the destination file.