Asynchronous PDF to HTML Conversion: Convert Uploaded PDFs to HTML with Java

Dec 19, 2024·14 Minutes Read

This tutorial and accompanying code walk you through the process of converting a PDF document to HTML asynchronously using PDF.co's Web API and Java. Asynchronous processing is particularly valuable for handling larger files, allowing the conversion to run in the background. This approach ensures that formatting, text, vectors, images, fonts, and overall structure are retained in the HTML output.

Features of PDF to HTML API Endpoint

The PDF.co Web API provides powerful tools for converting PDF documents or scanned images to HTML with high fidelity. Key features include:

  • Preservation of Formatting and Structure: The API ensures that text, images, vectors, and other elements are retained, delivering a near-perfect representation of the PDF in HTML.
  • Automatic Document Classification: With automatic classification, you can define keyword-based rules to detect the document’s type, ensuring that it’s converted with the right format and layout.
  • High Security: PDF.co prioritizes data security, using encrypted connections to protect any sensitive information transmitted through the API.

This tutorial will show you how to configure and customize asynchronous PDF-to-HTML conversion using Java, enabling efficient and secure processing of even large documents.

Endpoint Parameters

Following are the PDF to HTML endpoint parameters:

  1. url: It is a required parameter that provides the URL to the source file. The PDF.co platform supports any publicly accessible URL, including those from Dropbox, Google Drive, and the built-in file storage of PDF.co.
  2. httpusername: It is an optional parameter that provides an HTTP auth user name to access the source URL if required.
  3. httppassword: It is an optional parameter that provides an HTTP auth password to access the source URL if needed.
  4. pages: It is an optional parameter and must be a string. The parameter provides a comma-separated list of the pages required. The users can set a page range by using “ -.” Moreover, the users can leave the parameter empty to indicate selecting all the pages.
  5. unwrap: It is an optional parameter that helps unwrap the lines and forms them into a single line in the table cells. It is done by enabling lineGrouping.
  6. rect: It is an optional parameter and must be a string. It provides the specific data coordinates for extraction.
  7. lang: It is an optional parameter that helps set the language for OCR to use for scanned JPG, PNG, and PDF document inputs to extract text from them.
  8. inline: It is an optional parameter. The users can set it to true to return data as inline or false to return the link to an output file.
  9. lineGrouping: It is an optional parameter and must be a string. It enables grouping within the table cells.
  10. name: It is an optional parameter and must be a string. It provides the name of the generated output file after successful code execution.
  11. expiration: It is an optional parameter that offers the expiration time for the output link.
  12. async: Setting "async":true enables background processing, allowing large or complex conversions to complete without blocking the initial request.
  13. profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options

How to Convert PDF to HTML Asynchronously in Java

This tutorial explains how to asynchronously convert a PDF document to HTML using the PDF.co API's PDF-to-HTML endpoint. The Java code provided demonstrates how to handle PDF conversion efficiently, with options to customize the process and manage the API response. Once complete, the resulting HTML file preserves the text, images, fonts, and layout of the original document, making integration into Java applications seamless.

Sample Code: PDF to HTML from Uploaded file (Async Mode)

Source PDF File

The following source PDF file is used for this tutorial - Sample PDF.

Output File

The following is the generated output file in HTML:

<!DOCTYPE html PUBLIC " -//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
<title></title>
<style>
.page { background-color:white; position:relative; z-index:0; }
.vector { position:absolute; z-index:1; }
.image { position:absolute; z-index:2; }
.text { position:absolute; z-index:3; opacity:inherit; white-space:nowrap; }
.annotation { position:absolute; z-index:5; }
.control { position:absolute; z-index:10; }
.annotation2 { position:absolute; z-index:7; }
.dummyimg { vertical-align: top; border: none; }
</style>
</head>
<body style="background-color:#999999;color:#000000;">
<div id="canvas" align="center">
<!-- page 1 begin -->
<div class="page" style="width:1024.0px;height:1448.2px;">
<span style="color:#538DD3;font-size:41px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:61.9px;top:59.2px;">Your Company Name&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:132.3px;">Your Address&nbsp;</span>
<span class="text" style="left:61.9px;top:157.3px;">City, State Zip&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:793.0px;top:199.4px;">Invoice No. 123456&nbsp;</span>
<span class="text" style="left:750.9px;top:224.4px;">Invoice Date 01/01/2016&nbsp;</span>
<span class="text" style="left:61.9px;top:266.5px;">Client Name</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:291.9px;">Address&nbsp;</span>
<span class="text" style="left:61.9px;top:316.9px;">City, State Zip&nbsp;</span>
<span class="text" style="left:61.9px;top:401.3px;">Notes&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:61.9px;top:544.0px;">Item&nbsp;</span>
<span class="text" style="left:425.9px;top:544.0px;">Quantity&nbsp;</span>
<span class="text" style="left:686.2px;top:544.0px;">Price&nbsp;</span>
<span class="text" style="left:917.0px;top:544.0px;">Total&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';">
<span class="text" style="left:61.9px;top:587.1px;">Item 1&nbsp;</span>
<span class="text" style="left:492.2px;top:587.1px;">1&nbsp;</span>
<span class="text" style="left:685.2px;top:587.1px;">40.00&nbsp;</span>
<span class="text" style="left:915.0px;top:587.1px;">40.00&nbsp;</span>
<span class="text" style="left:61.9px;top:623.4px;">Item 2&nbsp;</span>
<span class="text" style="left:492.2px;top:623.4px;">2&nbsp;</span>
<span class="text" style="left:685.2px;top:623.4px;">30.00&nbsp;</span>
<span class="text" style="left:915.0px;top:623.4px;">60.00&nbsp;</span>
<span class="text" style="left:61.9px;top:659.8px;">Item 3&nbsp;</span>
<span class="text" style="left:492.2px;top:659.8px;">3&nbsp;</span>
<span class="text" style="left:685.2px;top:659.8px;">20.00&nbsp;</span>
<span class="text" style="left:915.0px;top:659.8px;">60.00&nbsp;</span>
<span class="text" style="left:61.9px;top:696.5px;">Item 4&nbsp;</span>
<span class="text" style="left:492.2px;top:696.5px;">4&nbsp;</span>
<span class="text" style="left:685.2px;top:696.5px;">10.00&nbsp;</span>
<span class="text" style="left:915.0px;top:696.5px;">40.00&nbsp;</span>
</span>
<span style="font-size:19px;font-family:'Arial';font-weight:bold;">
<span class="text" style="left:669.3px;top:732.5px;">TOTAL&nbsp;</span>
<span class="text" style="left:904.5px;top:732.5px;">200.00&nbsp;</span>
</span>
<div class="vector" style="left:52.0px;top:472.0px;"><img width="921" height="292" src=""/></div>
</div>
<!-- page 1 end -->
<p></p>
</div>
</body>
</html>


Step-by-Step Guide

Here’s a step-by-step guide to convert a PDF to HTML asynchronously in Java using the provided code:

Importing Necessary Libraries and Setting Up the Environment

You can use the IDE of your choice, whether it's VS Code, IntelliJ, or another environment that you're comfortable with.

To interact with the PDF.co API and handle the HTTP requests, you need several libraries:

  • OkHttpClient: To send HTTP requests to the API endpoints.
  • JsonObject and JsonParser: To parse the JSON responses from the API.
  • Standard Java libraries: For file handling, URL creation, and exception handling.

These libraries allow your Java application to send requests to the PDF.co API, process the responses, and handle files like the resulting HTML.

Defining Constants and Initializing Variables

Next, define the key variables that will be used throughout the program:

  • API_KEY: This is your unique PDF.co API Key, which is required to authenticate API calls. You can get your API key by logging into your PDF.co account and copying it from your dashboard. Once you have the key, paste it into the code where it says API_KEY.
  • SourceFile: The local file path to the PDF document that needs to be converted. Here, you specify the location of the PDF file on your computer that you wish to convert to HTML. For example, if the file is in the same directory as the program, you can provide a relative path like .\\sample.pdf.
  • Pages: Specify the pages you want to convert (leave empty for all pages).
  • Password: If the PDF is password-protected, provide the password.
  • DestinationFile: The local path where the converted HTML file will be saved.
  • PlainHtml and ColumnLayout: These flags control the output format (simplified HTML and column layout options).

Creating the JSON Payload for the Request

The code then constructs the JSON payload with parameters required by the API, which include:

  • Destination file name.
  • Password, Pages.
  • PlainHtml, ColumnLayout.
  • SourceFile URL: The pre-signed URL of the uploaded PDF file.
  • Async

This payload is used to send a POST request to the PDF.co API endpoint, initiating the conversion process.

Sending the API Request to Convert PDF to HTML

The OkHttpClient is used to send an HTTP POST request to the PDF.co API to convert the PDF file to HTML.

Polling the Job Status Asynchronously

Once the conversion request is sent, the API responds with a jobId that can be used to track the conversion status. The code enters a loop to check the status of the job every 3 seconds:

  1. It sends a status check request to the job status endpoint (/v1/job/check).
  2. It checks if the job is complete by comparing the returned status to "success".
  3. If not completed, it waits for 3 seconds before re-checking.

Downloading the Resulting HTML File

Once the job is marked as complete, the resulting HTML file’s URL is returned by the API. The code proceeds to download this file and save it to the local file system.

Conclusion

This approach converts PDFs to HTML asynchronously by:

  1. Initiating the conversion request.
  2. Polling the job status periodically.
  3. Downloading the result once the conversion is complete.

By using asynchronous processing, the application avoids timeouts and continues other operations while waiting for the conversion to finish, which is especially useful for large PDFs or lengthy conversions. This setup ensures a smooth and efficient PDF-to-HTML conversion process.For more details on asynchronous processing, refer to the PDF.co documentation.

Related Tutorials

See Related Tutorials