How to convert PDF to text from URL (node for PDF to text API in JavaScript using PDF.co Web API

Tons of millions of text are buried into PDF documentation; be it in the form of a report, or any plain documents. Extracting text out of PDF is necessary for further data processing. Use cases of this requirement to extract text from PDF are endless; it can be to fill out a simple form or to store data in the database, or to be consumed by a next-generation AI-based solution. Extracting text from PDF gives immense usability when building data solutions.

We might think what’s so difficult when extracting text from PDF? Text is already there in front of our eyes! Well, PDF is originally built for printing purposes with the aim that when printing this document from any operating system, the output should be symmetrical. In order to make that happen, PDF stores data in a very complex format. It uses embedded fonts, custom spacing, and what not to preserve formatting. Sometimes what we see as simple text in PDF turns out to be embedded font, and challenges like that make it difficult to extract text from PDF.

We have created a very interesting set of videos for anyone who wanted to understand internals of PDF in brief. Please checkout these videos!

PDF.co provides a cloud-based solution to extract text from PDF. With help of PDF.co endpoint /v1/pdf/convert/to/text we can easily and effortlessly extract text from any PDF document.

In this article, we’ll review one sample program written in NodeJs where we’re converting text from PDF. For demonstration purposes, we’re using the sample invoice PDF URL as input.

You can also get this program from our GitHub repository at this location. You can also refer to same implementations in other languages at this location.

PDF.co Web API: the Rest API that provides a set of data extraction functions, tools for documents manipulation, splitting, and merging of pdf files. Includes built-in OCR, images recognition, can generate and read barcodes from images, scans, and pdf.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

Let’s review the source code and it’s output first, then we’ll analyze the code a bit.

app.js

      
var https = require("https"); var path = require("path"); var fs = require("fs"); // The authentication key (API Key). // Get your own by registering at https://app.pdf.co/documentation/api const API_KEY = "***********************************"; // Direct URL of source PDF file. const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. const Pages = ""; // PDF document password. Leave empty for unprotected documents. const Password = ""; // Destination TXT file name const DestinationFile = "./result.txt"; // Prepare request to `PDF To Text` API endpoint var queryPath = `/v1/pdf/convert/to/text`; // JSON payload for api request var jsonPayload = JSON.stringify({ name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl }); var reqOptions = { host: "api.pdf.co", method: "POST", path: queryPath, headers: { "x-api-key": API_KEY, "Content-Type": "application/json", "Content-Length": Buffer.byteLength(jsonPayload, 'utf8') } }; // Send request var postRequest = https.request(reqOptions, (response) => { response.on("data", (d) => { // Parse JSON response var data = JSON.parse(d); if (data.error == false) { // Download TXT file var file = fs.createWriteStream(DestinationFile); https.get(data.url, (response2) => { response2.pipe(file) .on("close", () => { console.log(`Generated TXT file saved as "${DestinationFile}" file.`); }); }); } else { // Service reported error console.log(data.message); } }); }).on("error", (e) => { // Request error console.log(e); }); // Write request data postRequest.write(jsonPayload); postRequest.end();

package.json

      
{ "name": "test", "version": "1.0.0", "description": "PDF.co", "main": "app.js", "scripts": { }, "keywords": [ "pdf.co", "web", "api", "bytescout", "api" ], "author": "ByteScout & PDF.co", "license": "ISC", "dependencies": { "request": "^2.88.2" } }

Output

Initially we’re referencing libraries useful for executing this NodeJs program. Libraries being referenced in this program are https, path and fs.

var https = require("https");
var path = require("path");
var fs = require("fs");

Next, We’re declaring and assigning variables that are to be used in this program. Following table summarizes it.

API_KEY PDF.co Api Key. This key is used to authenticate requests at PDF.co server.
SourceFileUrl We’re storing the direct URL of the source PDF file here.
Pages Pages whose data needs to be extracted into textual form are provided here. Page numbers are specified in comma-separated form for example “0,2-5,7-“, and are zero index based. Please leave this input empty if we want to convert the whole PDF to text.
Password If input PDF is password protected then this field is very useful. Document password needs to be entered here.
DestinationFile As the name suggests, this field holds the destination file path for the output file. Resulting text document will be stored at this location.

PDF.co endpoint /v1/pdf/convert/to/text is used to perform PDF to Text conversation. Data is provided into JSON format. We’re also building request headers with API key (x-api-key) and other content related header parameters such as Content-Type and Content-Length.

// JSON payload for api request
var jsonPayload = JSON.stringify({
    name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl
});

var reqOptions = {
    host: "api.pdf.co",
    method: "POST",
    path: queryPath,
    headers: {
        "x-api-key": API_KEY,
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
    }
};

With all request input data ready, we’re all ready to execute the request and consume output; which we’re doing in the following code snippet.

// Send request
var postRequest = https.request(reqOptions, (response) => {
    response.on("data", (d) => {
        // Parse JSON response
        var data = JSON.parse(d);        
        if (data.error == false) {
            // Download TXT file
            var file = fs.createWriteStream(DestinationFile);
            https.get(data.url, (response2) => {
                response2.pipe(file)
                .on("close", () => {
                    console.log(`Generated TXT file saved as "${DestinationFile}" file.`);
                });
            });
        }
        else {
            // Service reported error
            console.log(data.message);
        }
    });
}).on("error", (e) => {
    // Request error
    console.log(e);
});

// Write request data
postRequest.write(jsonPayload);
postRequest.end();

In this sample we’ve demonstrated a very basic example. PDF.co has in-built support for OCR. Hence, if we have scanned input PDF then also we’ll have text extracted without any extra effort. Cool 😎!, isn’t it?

We can further customize the output as per our needs with help of additional input parameters. Some of the additional input parameters are shown in the table below. Please visit PDF.co API documentation for more information.

rect This parameter is very useful when we want to extract text from a specific location of PDF. We have to pass coordinates for extraction in string format. For example, “51.8, 114.8, 235.5, 204.0”. PDF.co free tool “PDF.co PDF Viewer” is useful to mark and get coordinates.
lang This parameter sets OCR language to be used for scanned PDF or Image documents. By-default language is eng, however we can provide other languages like spa, deu, fra, jpn, chi_sim, chi_tra and kor. If a scanned document is containing multiple languages then we can also pass input for combined language like eng+deu.
async Many times we have a large PDF and it’s possible that request times out while processing. To overcome these situations, when can enable async mode processing by providing True to this field. Output of this process will be Job, and we can track job status by /job/check endpoint call. Please refer to this sample for more information.

That’s all guys! With PDF.co it’s that easy to extract textual information from PDF. Please try this sample in your machine to get more out of this article.

Thank you for reading!

VIDEO

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

Related Pages:

Related Samples: