The Power of Conversion: Transform PDF to HTML with Java!

This tutorial and the sample source code explain converting a PDF document to HTML using PDF to HTML functionality from PDF.co Web API. The source code uses Java programming language for this tutorial. The users can also convert scanned images into HTML, fully preserving the text, vectors, pictures, fonts, and format.

Features of PDF to HTML API Endpoint

The PDF.co Web API provides tools and functionalities to convert any PDF document or scanned image to HTML. This process works efficiently to preserve the formatting, the text, the vectors, and the images after conversion. The API uses the method of automatic document classification. The users can utilize the document classifier endpoint to automatically detect the incoming document’s class based on their keywords-based rules. Moreover, the users can find any vendor or template required using the defined rules.

The PDF.co Web API also provides a highly secure platform for its consumers. The platform has implemented a high-security process as the users must provide sensitive information. The PDF.co API transmits the data via an encrypted connection.

PDF to HTML Endpoint Parameters

Following are the PDF to HTML endpoint parameters:

  1. url: It is required to provide the URL to the source file. The PDF.co platform supports URLs from Dropbox, Google Drive, and built-in file storage of PDF.co. The API takes the URL as input from this parameter and loads the PDF form to make changes whenever a user makes such a request.
  2. httpusername: An optional parameter provides an HTTP auth user name to access the source URL if required.
  3. httppassword: An optional parameter provides an HTTP auth password to access the source URL if required.
  4. pages: It is an optional parameter that must be a string. It helps in providing a comma-separated list of the pages required. The users can set a page range by using “ -.” Moreover, the users can leave the parameter empty to indicate selecting all the pages.
  5. unwrap: It is an optional parameter. It unwraps the lines and forms them into a single line in the table cells. It is done by enabling lineGrouping.
  6. rect: It is an optional parameter and must be a string. It provides coordinates for extraction.
  7. lang: An optional parameter sets the language for OCR to use for scanned JPG, PNG, and PDF document inputs to extract text from them.
  8. inline: It is an optional parameter. It is set to true to return data as inline or false to return the link to an output file.
  9. lineGrouping: It is an optional parameter and must be a string. It enables grouping within the table cells.
  10. async: It is an optional parameter. It helps in running the processes asynchronously. It returns the JobId to check the state of the background job.
  11. name: It is an optional parameter and must be a string. It provides the name of the generated output file.
  12. expiration: It is an optional parameter that provides the expiration time for the output link.
  13. profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options.

How to Convert PDF to HTML in Java

The following source code explains how to convert any PDF document to HTML using the PDF to HTML API endpoint. The sample code in Java demonstrates converting a PDF document using the API. The below code takes the sample PDF file for classification and uses Artificial Intelligence (AI) to detect the data types like font, text, format, and others. It then writes all this data in HTML and returns it to the user as an HTML file.

Source Code in Java

//*******************************************************************************************//
//                                                                                           //
// Download Free Evaluation Version From: https://bytescout.com/download/web-installer       //
//                                                                                           //
// Also available as Web API! Get Your Free API Key: https://app.pdf.co/signup               //
//                                                                                           //
// Copyright © 2017-2020 ByteScout, Inc. All rights reserved.                                //
// https://www.bytescout.com                                                                 //
// https://pdf.co                                                                            //
//                                                                                           //
//*******************************************************************************************//

package com.company;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import okhttp3.*;
import java.io.*;
import java.net.*;
import java.nio.file.Path;
import java.nio.file.Paths;

public class Main
{
   // Get your own API Key by registering at https://app.pdf.co
   final static String API_KEY = "***************";
   // Direct URL of source PDF file.
   // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/   
   final static String SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-html/sample.pdf";
   // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
   final static String Pages = "";
   // PDF document password. Leave empty for unprotected documents.
   final static String Password = "";
   // Destination HTML file name
   final static Path DestinationFile = Paths.get(".\\result.html");
   // Set to `true` to get simplified HTML without CSS. Default is the rich HTML keeping the document design.
   final static boolean PlainHtml = false;
   // Set to `true` if your document has the column layout like a newspaper.
   final static boolean ColumnLayout = false;
   public static void main(String[] args) throws IOException
   {
       // Create HTTP client instance
       OkHttpClient webClient = new OkHttpClient();
       // Prepare URL for `PDF To HTML` API call
       String query = "https://api.pdf.co/v1/pdf/convert/to/html";
       // Make correctly escaped (encoded) URL
       URL url = null;
       try
       {
           url = new URI(null, query, null).toURL();
       }
       catch (URISyntaxException e)
       {
           e.printStackTrace();
       }
       // Create JSON payload
      String jsonPayload = String.format("{\"name\": \"%s\", \"password\": \"%s\", \"pages\": \"%s\", \"simple\": \"%s\", \"columns\": \"%s\", \"url\": \"%s\"}",
               DestinationFile.getFileName(),
               Password,
               Pages,
               PlainHtml,
               ColumnLayout,
               SourceFileUrl);

       // Prepare request body
       RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonPayload);

       // Prepare request
       Request request = new Request.Builder()

           .url(url)
           .addHeader("x-api-key", API_KEY) // (!) Set API Key
           .addHeader("Content-Type", "application/json")
           .post(body)
           .build();
      
       // Execute request
       Response response = webClient.newCall(request).execute();
       if (response.code() == 200)
       {
           // Parse JSON response
           JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
           boolean error = json.get("error").getAsBoolean();
           if (!error)
           {

               // Get URL of generated HTML file
               String resultFileUrl = json.get("url").getAsString();
               // Download HTML file
               downloadFile(webClient, resultFileUrl, DestinationFile.toFile());
               System.out.printf("Generated HTML file saved as \"%s\" file.", DestinationFile.toString());
           }
           else
           {

               // Display service reported error
               System.out.println(json.get("message").getAsString());
           }
       }
       else
       {
           // Display request error
           System.out.println(response.code() + " " + response.message());
       }
   }
   public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException
   {
       // Prepare request
       Request request = new Request.Builder()
               .url(url)
               .build();
       // Execute request
       Response response = webClient.newCall(request).execute();
       byte[] fileBytes = response.body().bytes();
       // Save downloaded bytes to file
       OutputStream output = new FileOutputStream(destinationFile);
       output.write(fileBytes);
       output.flush();
       output.close();
       response.close();
   }
}

Source PDF File

The following source PDF file is used for this tutorial - Sample PDF.

Output File

The following is the generated output file in HTML:

<!DOCTYPE html PUBLIC " -//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
<title></title>
<style>
.page { background-color:white; position:relative; z-index:0; }
.vector { position:absolute; z-index:1; }
.image { position:absolute; z-index:2; }
.text { position:absolute; z-index:3; opacity:inherit; white-space:nowrap; }
.annotation { position:absolute; z-index:5; }
.control { position:absolute; z-index:10; }
.annotation2 { position:absolute; z-index:7; }
.dummyimg { vertical-align: top; border: none; }
</style>
</head>
<body style="background-color:#999999;color:#000000;">
<div id="canvas" align="center">
<!-- page 1 begin -->
<div class="page" style="width:1024.0px;height:1448.2px;">
<span style="color:#538DD3;font-size:41px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:61.9px;top:59.2px;">Your Company Name&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:132.3px;">Your Address&nbsp;</span>
<span class="text" style="left:61.9px;top:157.3px;">City, State Zip&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:793.0px;top:199.4px;">Invoice No. 123456&nbsp;</span>
<span class="text" style="left:750.9px;top:224.4px;">Invoice Date 01/01/2016&nbsp;</span>
<span class="text" style="left:61.9px;top:266.5px;">Client Name</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:291.9px;">Address&nbsp;</span>
<span class="text" style="left:61.9px;top:316.9px;">City, State Zip&nbsp;</span>
<span class="text" style="left:61.9px;top:401.3px;">Notes&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:61.9px;top:544.0px;">Item&nbsp;</span>
<span class="text" style="left:425.9px;top:544.0px;">Quantity&nbsp;</span>
<span class="text" style="left:686.2px;top:544.0px;">Price&nbsp;</span>
<span class="text" style="left:917.0px;top:544.0px;">Total&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:587.1px;">Item 1&nbsp;</span>
<span class="text" style="left:492.2px;top:587.1px;">1&nbsp;</span>
<span class="text" style="left:685.2px;top:587.1px;">40.00&nbsp;</span>
<span class="text" style="left:915.0px;top:587.1px;">40.00&nbsp;</span>
<span class="text" style="left:61.9px;top:623.4px;">Item 2&nbsp;</span>
<span class="text" style="left:492.2px;top:623.4px;">2&nbsp;</span>
<span class="text" style="left:685.2px;top:623.4px;">30.00&nbsp;</span>
<span class="text" style="left:915.0px;top:623.4px;">60.00&nbsp;</span>
<span class="text" style="left:61.9px;top:659.8px;">Item 3&nbsp;</span>
<span class="text" style="left:492.2px;top:659.8px;">3&nbsp;</span>
<span class="text" style="left:685.2px;top:659.8px;">20.00&nbsp;</span>
<span class="text" style="left:915.0px;top:659.8px;">60.00&nbsp;</span>
<span class="text" style="left:61.9px;top:696.5px;">Item 4&nbsp;</span>
<span class="text" style="left:492.2px;top:696.5px;">4&nbsp;</span>
<span class="text" style="left:685.2px;top:696.5px;">10.00&nbsp;</span>
<span class="text" style="left:915.0px;top:696.5px;">40.00&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:669.3px;top:732.5px;">TOTAL&nbsp;</span>
<span class="text" style="left:904.5px;top:732.5px;">200.00&nbsp;</span>
</span>
<div class="vector" style="left:52.0px;top:472.0px;"><img width="921" height="292" src=""/></div>
</div>
<!-- page 1 end -->
<p></p>
</div>
</body>
</html>

In this tutorial we have shown how you can use Java to use a PDF.co endpoint to convert your document to raw HTML.