How to convert PDF to HTML from file (node for PDF to HTML API in JavaScript and PDF.co Web API)

PDF.co Web API is the Web API with a set of tools for documents manipulation, data conversion, data extraction, splitting and merging of documents. Includes image recognition, built-in OCR, barcode generation and barcode decoders to decode bar codes from scans, pictures and pdf.

HTML is wastly used formatting and many times we require to build quick webpage straight from PDF. Internally PDF is having very complicated structuring as core usefulness of PDF is for printing purpose only. In requirements like these when PDF to HTML conversation is needed, PDF.co’s feature to convert PDF to HTML can be very helpful. In this article we’ll review how we can convert PDF to HTML with NodeJS. Code snippet in this article is also available on our GitHub repository at this location.

On-demand (REST Web API) version:
 Web API (on-demand version)

On-premise offline SDK for Windows:
 60 Day Free Trial (on-premise)

app.js

      
/*jshint esversion: 6 */ var https = require("https"); var path = require("path"); var fs = require("fs"); // `request` module is required for file upload. // Use "npm install request" command to install. var request = require("request"); // The authentication key (API Key). // Get your own by registering at https://app.pdf.co/documentation/api const API_KEY = "***********************************"; // Source PDF file const SourceFile = "./sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. const Pages = ""; // PDF document password. Leave empty for unprotected documents. const Password = ""; // Destination HTML file name const DestinationFile = "./result.html"; // Set to `true` to get simplified HTML without CSS. Default is the rich HTML keeping the document design. const PlainHtml = 'False'; // Set to `true` if your document has the column layout like a newspaper. const ColumnLayout = 'False'; // Prepare URL for `PDF To HTML` API endpoint var query = `https://api.pdf.co/v1/pdf/convert/to/html`; let reqOptions = { uri: query, headers: { "x-api-key": API_KEY }, formData: { name: path.basename(DestinationFile), password: Password, pages: Pages, simple: PlainHtml, columns: ColumnLayout, file: fs.createReadStream(SourceFile) } }; // Send request request.post(reqOptions, function (error, response, body) { if (error) { return console.error("Error: ", error); } // Parse JSON response let data = JSON.parse(body); if (data.error == false) { // Download HTML file var file = fs.createWriteStream(DestinationFile); https.get(data.url, (response2) => { response2.pipe(file) .on("close", () => { console.log(`Generated HTML file saved as "${DestinationFile}" file.`); }); }); } else { // Service reported error console.log("Error: " + data.message); } });

package.json

      
{ "name": "test", "version": "1.0.0", "description": "PDF.co", "main": "app.js", "scripts": { }, "keywords": [ "pdf.co", "web", "api", "bytescout", "api" ], "author": "ByteScout & PDF.co", "license": "ISC", "dependencies": { "request": "^2.88.2" } }

Output

At the start of this program, we’re referencing package dependencies such as https, path, fs, as well requesting external package dependecies (request)

Next logical step in program is to prepare all necessary request data for PDF.co endpoint. We’re providing placeholders for input source file (SourceFile), Pages which we’re considering converting to HTML (Pages), Destination file location (DestinationFile), etc. We’re also specifying additional parameters like PlainHtml which is useful when we want output HTML without any stylesheet(CSS), ColumnLayout when we have input PDF in column structure, etc.

Now that all data preparation are done, we’re ready to invoke PDF.co request using /v1/pdf/convert/to/html endpoint. All input data are passed in as form data collection. In order to pass source PDF, we’re creating data stream and passing it to file parameter. PDF.co API key x-api-key is passed in header for request authentication purpose.

In the request output, we have parameter named URL which contains link to downlaod generated HTML document as shown in above output image.

PDF.co endpoint for HTML conversation provides many parameters by which we can achive respose as per our requirement. Some of request parameter by which we can do the additional setting are as below.

rect This parameter is usful when we only want to convert only certain region of PDF to HTML. We can specify region co-ordinates in input such as “51.8, 114.8, 235.5, 204.0”. PDF.co PDF Viewer is useful to easily select and copy coordinates.
lang This parameter sets OCR language to be used for scanned PDF, PNG or JPG documents. By default eng is language of choice. Other supported languages values are spa, deu, fra, jpn, chi_sim, chi_tra, kor.
async When this parameter is enabled, HTML conversation will run in asynchronous mode. Output will return Job Number and it’s status can be checked by running endpoint /job/check.

Please refer to PDF.co API documentations at here for more details.

Please try this code in your machine to get most out of this sample. Thank you for reading!

VIDEO

ON-PREMISE OFFLINE SDK

Get 60 Day Free Trial

See also:

ON-DEMAND REST WEB API

Get Your API Key

See also:

Related Samples: