This tutorial and the sample source code explain converting a PDF document to HTML using PDF to HTML functionality from PDF.co Web API. The source code uses Java programming language for this tutorial. The users can also convert scanned images into HTML, fully preserving the text, vectors, pictures, fonts, and format.
- Features of PDF to HTML API Endpoint
- PDF to HTML Endpoint Parameters
- How to Convert PDF to HTML in Java
- Step-by-Step Guide to Convert PDF to HTML
Features of PDF to HTML API Endpoint
The PDF.co Web API provides tools and functionalities to convert any PDF document or scanned image to HTML. This process works efficiently to preserve the formatting, the text, the vectors, and the images after conversion. The API uses the method of automatic document classification. The users can utilize the document classifier endpoint to automatically detect the incoming document’s class based on their keywords-based rules. Moreover, the users can find any vendor or template required using the defined rules.
The PDF.co Web API also provides a highly secure platform for its consumers. The platform has implemented a high-security process as the users must provide sensitive information. The PDF.co API transmits the data via an encrypted connection. The users can go through the security protocols in detail here.
PDF to HTML Endpoint Parameters
Following are the PDF to HTML endpoint parameters:
- url: It is required to provide the URL to the source file. The PDF.co platform supports URLs from Dropbox, Google Drive, and built-in file storage of PDF.co. The API takes the URL as input from this parameter and loads the PDF form to make changes whenever a user makes such a request.
- httpusername: An optional parameter provides an HTTP auth user name to access the source URL if required.
- httppassword: An optional parameter provides an HTTP auth password to access the source URL if required.
- pages: It is an optional parameter that must be a string. It helps in providing a comma-separated list of the pages required. The users can set a page range by using “ -.” Moreover, the users can leave the parameter empty to indicate selecting all the pages.
- unwrap: It is an optional parameter. It unwraps the lines and forms them into a single line in the table cells. It is done by enabling lineGrouping.
- rect: It is an optional parameter and must be a string. It provides coordinates for extraction.
- lang: An optional parameter sets the language for OCR to use for scanned JPG, PNG, and PDF document inputs to extract text from them.
- inline: It is an optional parameter. It is set to true to return data as inline or false to return the link to an output file.
- lineGrouping: It is an optional parameter and must be a string. It enables grouping within the table cells.
- async: It is an optional parameter. It helps in running the processes asynchronously. It returns the JobId to check the state of the background job.
- name: It is an optional parameter and must be a string. It provides the name of the generated output file.
- expiration: It is an optional parameter that provides the expiration time for the output link.
- profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options.
How to Convert PDF to HTML in Java
The following source code explains how to convert any PDF document to HTML using the PDF to HTML API endpoint. The sample code in Java demonstrates converting a PDF document using the API. The below code takes the sample PDF file for classification and uses Artificial Intelligence (AI) to detect the data types like font, text, format, and others. It then writes all this data in HTML and returns it to the user as an HTML file.
Source Code in Java
Following is the sample code in Java to explain using PDF to HTML endpoint:
//*******************************************************************************************// // // // Download Free Evaluation Version From: https://bytescout.com/download/web-installer // // // // Also available as Web API! Get Your Free API Key: https://app.pdf.co/signup // // // // Copyright © 2017-2020 ByteScout, Inc. All rights reserved. // // https://www.bytescout.com // // https://pdf.co // // // //*******************************************************************************************// package com.company; import com.google.gson.JsonObject; import com.google.gson.JsonParser; import okhttp3.*; import java.io.*; import java.net.*; import java.nio.file.Path; import java.nio.file.Paths; public class Main { // Get your own API Key by registering at https://app.pdf.co final static String API_KEY = "***************"; // Direct URL of source PDF file. // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/ final static String SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-html/sample.pdf"; // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'. final static String Pages = ""; // PDF document password. Leave empty for unprotected documents. final static String Password = ""; // Destination HTML file name final static Path DestinationFile = Paths.get(".\\result.html"); // Set to `true` to get simplified HTML without CSS. Default is the rich HTML keeping the document design. final static boolean PlainHtml = false; // Set to `true` if your document has the column layout like a newspaper. final static boolean ColumnLayout = false; public static void main(String[] args) throws IOException { // Create HTTP client instance OkHttpClient webClient = new OkHttpClient(); // Prepare URL for `PDF To HTML` API call String query = "https://api.pdf.co/v1/pdf/convert/to/html"; // Make correctly escaped (encoded) URL URL url = null; try { url = new URI(null, query, null).toURL(); } catch (URISyntaxException e) { e.printStackTrace(); } // Create JSON payload String jsonPayload = String.format("{\"name\": \"%s\", \"password\": \"%s\", \"pages\": \"%s\", \"simple\": \"%s\", \"columns\": \"%s\", \"url\": \"%s\"}", DestinationFile.getFileName(), Password, Pages, PlainHtml, ColumnLayout, SourceFileUrl); // Prepare request body RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonPayload); // Prepare request Request request = new Request.Builder() .url(url) .addHeader("x-api-key", API_KEY) // (!) Set API Key .addHeader("Content-Type", "application/json") .post(body) .build(); // Execute request Response response = webClient.newCall(request).execute(); if (response.code() == 200) { // Parse JSON response JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject(); boolean error = json.get("error").getAsBoolean(); if (!error) { // Get URL of generated HTML file String resultFileUrl = json.get("url").getAsString(); // Download HTML file downloadFile(webClient, resultFileUrl, DestinationFile.toFile()); System.out.printf("Generated HTML file saved as \"%s\" file.", DestinationFile.toString()); } else { // Display service reported error System.out.println(json.get("message").getAsString()); } } else { // Display request error System.out.println(response.code() + " " + response.message()); } } public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException { // Prepare request Request request = new Request.Builder() .url(url) .build(); // Execute request Response response = webClient.newCall(request).execute(); byte[] fileBytes = response.body().bytes(); // Save downloaded bytes to file OutputStream output = new FileOutputStream(destinationFile); output.write(fileBytes); output.flush(); output.close(); response.close(); } }
Source PDF File
Following is the source PDF file.
Output File
The following is the generated output file in HTML:
<!DOCTYPE html PUBLIC " -//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"/> <title></title> <style> .page { background-color:white; position:relative; z-index:0; } .vector { position:absolute; z-index:1; } .image { position:absolute; z-index:2; } .text { position:absolute; z-index:3; opacity:inherit; white-space:nowrap; } .annotation { position:absolute; z-index:5; } .control { position:absolute; z-index:10; } .annotation2 { position:absolute; z-index:7; } .dummyimg { vertical-align: top; border: none; } </style> </head> <body style="background-color:#999999;color:#000000;"> <div id="canvas" align="center"> <!-- page 1 begin --> <div class="page" style="width:1024.0px;height:1448.2px;"> <span style="color:#538DD3;font-size:41px;font-family:'Arial';font-weight:bold;"> <span class="text" style="left:61.9px;top:59.2px;">Your Company Name </span> </span> <span style="font-size:19px;font-family:'Arial';"> <span class="text" style="left:61.9px;top:132.3px;">Your Address </span> <span class="text" style="left:61.9px;top:157.3px;">City, State Zip </span> </span> <span style="font-size:19px;font-family:'Arial';font-weight:bold;"> <span class="text" style="left:793.0px;top:199.4px;">Invoice No. 123456 </span> <span class="text" style="left:750.9px;top:224.4px;">Invoice Date 01/01/2016 </span> <span class="text" style="left:61.9px;top:266.5px;">Client Name</span> </span> <span style="font-size:19px;font-family:'Arial';"> <span class="text" style="left:61.9px;top:291.9px;">Address </span> <span class="text" style="left:61.9px;top:316.9px;">City, State Zip </span> <span class="text" style="left:61.9px;top:401.3px;">Notes </span> </span> <span style="font-size:19px;font-family:'Arial';font-weight:bold;"> <span class="text" style="left:61.9px;top:544.0px;">Item </span> <span class="text" style="left:425.9px;top:544.0px;">Quantity </span> <span class="text" style="left:686.2px;top:544.0px;">Price </span> <span class="text" style="left:917.0px;top:544.0px;">Total </span> </span> <span style="font-size:19px;font-family:'Arial';"> <span class="text" style="left:61.9px;top:587.1px;">Item 1 </span> <span class="text" style="left:492.2px;top:587.1px;">1 </span> <span class="text" style="left:685.2px;top:587.1px;">40.00 </span> <span class="text" style="left:915.0px;top:587.1px;">40.00 </span> <span class="text" style="left:61.9px;top:623.4px;">Item 2 </span> <span class="text" style="left:492.2px;top:623.4px;">2 </span> <span class="text" style="left:685.2px;top:623.4px;">30.00 </span> <span class="text" style="left:915.0px;top:623.4px;">60.00 </span> <span class="text" style="left:61.9px;top:659.8px;">Item 3 </span> <span class="text" style="left:492.2px;top:659.8px;">3 </span> <span class="text" style="left:685.2px;top:659.8px;">20.00 </span> <span class="text" style="left:915.0px;top:659.8px;">60.00 </span> <span class="text" style="left:61.9px;top:696.5px;">Item 4 </span> <span class="text" style="left:492.2px;top:696.5px;">4 </span> <span class="text" style="left:685.2px;top:696.5px;">10.00 </span> <span class="text" style="left:915.0px;top:696.5px;">40.00 </span> </span> <span style="font-size:19px;font-family:'Arial';font-weight:bold;"> <span class="text" style="left:669.3px;top:732.5px;">TOTAL </span> <span class="text" style="left:904.5px;top:732.5px;">200.00 </span> </span> <div class="vector" style="left:52.0px;top:472.0px;"><img width="921" height="292" src=""/></div> </div> <!-- page 1 end --> <p></p> </div> </body> </html>
Extract PDF to HTML in Java – Demo
Below is a demonstration of the working code.
Step-by-Step Guide to Convert PDF to HTML
Following is the step-by-step guide to explain to users the working of the above-mentioned source code for the tutorial:
- The code imports the required packages and libraries to make the API request and reads the file from the URL.
- It then declares and initializes the API_Key, which the users can get by signing up or logging into the PDF.co account. The users require this API key to make requests to API endpoints. The users can not send a direct request, so the API key acts as an access token, and the users have to provide it in the header for authentication.
- After this, the user has to provide API’s body payload, which in this sample code is the destination and source file URL, file password, and file pages to convert the required data, column layout, and HTML. The users can provide their own required information here and customize the code. In this scenario, the code utilizes the PDF.co sample code containing the source file.
- The sample code assembles variables for JSON payload and sends the API POST request.
- The output response will provide the error, status, file name, and others. The status 200 determines that the request was successful, and the code provides the URL for the resulting file. The successful request returns the HTML code that the file stream reads and stores on the local storage as the result.HTML file, i.e., the destination file.