How to Convert PDF to Text from URL (Node for PDF to text API in JavaScript using PDF.co Web API)

Tons of millions of text are buried into PDF documentation; be it in the form of a report, or any plain documents. Extracting text out of PDF is necessary for further data processing. Use cases of this requirement to extract text from PDF are endless; it can be to fill out a simple form or to store data in the database, or to be consumed by a next-generation AI-based solution. Extracting text from PDF gives immense usability when building data solutions.

We might think what’s so difficult when extracting text from PDF? Text is already there in front of our eyes! Well, PDF is originally built for printing purposes with the aim that when printing this document from any operating system, the output should be symmetrical. In order to make that happen, PDF stores data in a very complex format. It uses embedded fonts, custom spacing, and what not to preserve formatting. Sometimes what we see as simple text in PDF turns out to be embedded font, and challenges like that make it difficult to extract text from PDF.

We have created a very interesting set of videos for anyone who wanted to understand internals of PDF in brief. Please checkout these videos!

PDF.co provides a cloud-based solution to extract text from PDF. With the help of PDF.co endpoint /v1/pdf/convert/to/text we can easily and effortlessly extract text from any PDF document.

In this article, we’ll review one sample program written in NodeJs where we’re converting text from PDF. For demonstration purposes, we’re using the sample invoice PDF URL as input.

You can also get this program at this location. You can also refer to same implementations in other languages at this location.

PDF.co Web API: the REST API that provides a set of data extraction functions, tools for documents manipulation, splitting, and merging of PDF files. Includes built-in OCR, images recognition, can generate and read barcodes from images, scans, and pdf.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

Let’s review the source code and it’s output first, then we’ll analyze the code a bit.

app.js

      
var https = require("https"); var path = require("path"); var fs = require("fs"); // The authentication key (API Key). // Get your own by registering at https://app.pdf.co/documentation/api const API_KEY = "***********************************"; // Direct URL of source PDF file. const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. const Pages = ""; // PDF document password. Leave empty for unprotected documents. const Password = ""; // Destination TXT file name const DestinationFile = "./result.txt"; // Prepare request to `PDF To Text` API endpoint var queryPath = `/v1/pdf/convert/to/text`; // JSON payload for api request var jsonPayload = JSON.stringify({ name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl }); var reqOptions = { host: "api.pdf.co", method: "POST", path: queryPath, headers: { "x-api-key": API_KEY, "Content-Type": "application/json", "Content-Length": Buffer.byteLength(jsonPayload, 'utf8') } }; // Send request var postRequest = https.request(reqOptions, (response) => { response.on("data", (d) => { // Parse JSON response var data = JSON.parse(d); if (data.error == false) { // Download TXT file var file = fs.createWriteStream(DestinationFile); https.get(data.url, (response2) => { response2.pipe(file) .on("close", () => { console.log(`Generated TXT file saved as "${DestinationFile}" file.`); }); }); } else { // Service reported error console.log(data.message); } }); }).on("error", (e) => { // Request error console.log(e); }); // Write request data postRequest.write(jsonPayload); postRequest.end();

package.json

      
{ "name": "test", "version": "1.0.0", "description": "PDF.co", "main": "app.js", "scripts": { }, "keywords": [ "pdf.co", "web", "api", "bytescout", "api" ], "author": "ByteScout & PDF.co", "license": "ISC", "dependencies": { "request": "^2.88.2" } }

Output

Initially, we’re referencing the libraries useful for executing this NodeJs program. Libraries being referenced in this program are https, path and fs.

var https = require("https");
var path = require("path");
var fs = require("fs");

Next, we’re declaring and assigning variables that are to be used in this program. The following table summarizes it.

API_KEY PDF.co API Key. This key is used to authenticate requests at PDF.co server.
SourceFileUrl We’re storing the direct URL of the source PDF file here.
Pages Pages whose data needs to be extracted into the textual form are provided here. Page numbers are specified in the comma-separated format for example “0,2-5,7-“, and are zero index based. Please leave this input empty if you want to convert the whole PDF to text.
Password If input PDF is password protected then this field is very useful. The document password needs to be entered here.
DestinationFile As the name suggests, this field holds the destination file path for the output file. The resulting text document will be stored at this location.

PDF.co endpoint /v1/pdf/convert/to/text is used to perform PDF to Text conversion. The data is provided into the JSON format. We’re also building request headers with the API key (x-api-key) and other content related header parameters such as Content-Type and Content-Length.

// JSON payload for api request
var jsonPayload = JSON.stringify({
    name: path.basename(DestinationFile), password: Password, pages: Pages, url: SourceFileUrl
});

var reqOptions = {
    host: "api.pdf.co",
    method: "POST",
    path: queryPath,
    headers: {
        "x-api-key": API_KEY,
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(jsonPayload, 'utf8')
    }
};

With all request input data ready, we’re all ready to execute the request and consume the output; which we’re doing in the following code snippet.

// Send request
var postRequest = https.request(reqOptions, (response) => {
    response.on("data", (d) => {
        // Parse JSON response
        var data = JSON.parse(d);        
        if (data.error == false) {
            // Download TXT file
            var file = fs.createWriteStream(DestinationFile);
            https.get(data.url, (response2) => {
                response2.pipe(file)
                .on("close", () => {
                    console.log(`Generated TXT file saved as "${DestinationFile}" file.`);
                });
            });
        }
        else {
            // Service reported error
            console.log(data.message);
        }
    });
}).on("error", (e) => {
    // Request error
    console.log(e);
});

// Write request data
postRequest.write(jsonPayload);
postRequest.end();

In this sample we’ve demonstrated a very basic example. PDF.co has in-built support for OCR. Hence, if we have scanned the input PDF then also we’ll have text extracted without any extra effort. Cool 😎!, isn’t it?

We can further customize the output as per our needs with the help of the additional input parameters. Some of the additional input parameters are shown in the table below. Please visit PDF.co API documentation for more information.

rect This parameter is very useful when we want to extract text from a specific location of PDF. We have to pass the coordinates for extraction in the string format. For example, “51.8, 114.8, 235.5, 204.0”. PDF.co free tool “PDF.co PDF Viewer” is useful to mark and get coordinates.
lang This parameter sets the OCR language to be used for the scanned PDF or Image documents. By-default language is eng, however we can provide other languages like spa, deu, fra, jpn, chi_sim, chi_tra and kor. If a scanned document is containing multiple languages then we can also pass the input for the combined language like eng+deu.
async Many times we have a large PDF and it’s possible that the request times out while processing. To overcome these situations, when can enable the async mode processing by providing True to this field. The output of this process will be Job, and we can track the job status by /job/check endpoint call. Please refer to this sample for more information.

That’s all guys! With PDF.co it’s that easy to extract the textual information from PDF. Please try this sample in your machine to get more out of this article.

Thank you for reading!

VIDEO


ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

Related Pages:

Related Samples: