Extracting Multipage PDF Table to CSV in Java Using Asynchronous Processing

Dec 19, 2024·8 Minutes Read

Programming Languages & Integrations Java

In this tutorial, you'll learn how to extract multi-page PDF tables and convert them to CSV in Java using asynchronous processing with the PDF.co Web API. This method prevents timeouts, allowing other operations to run while waiting for the extraction to complete, ideal for large PDFs or lengthy conversions.

The guide demonstrates using the Document Parser API to extract data from borderless tables efficiently. The provided Java code shows how to implement this, and the same approach can be used to parse tables, barcodes, and fields from PDFs or image files like JPG and PNG, including invoices and orders. This tutorial helps automate data extraction from complex PDFs with optimal performance.

IN THIS TUTORIAL

Features of Document Parser API Endpoint

Benefits of Asynchronous Processing for PDF to CSV Conversion

Endpoint Parameters to Extract PDF Table to CSV in Java

How to Use Java for Asynchronous PDF to CSV Conversion

Step-by-Step Guide to Extract PDF Form Data to CSV Using Document Parser in Asynchronous Mode

Features of Document Parser API Endpoint

The PDF.co Web API offers tools to convert PDF documents or scanned images into formats such as CSV, XML, and JSON. The API uses automatic classification techniques to handle incoming documents effectively. Users can take advantage of the document classifier endpoint to automatically detect the document type based on custom rules or AI-based classification.

Key features include:

Custom Templates: Use predefined or custom templates to extract data fields such as invoice IDs, dates, taxes, and totals from invoices in English.
Secure Processing: PDF.co ensures secure handling of sensitive user data through encrypted connections. For more details on security protocols, visit the PDF.co documentation.
Flexible Formats: Extract data into structured formats like CSV, which are ideal for further processing and analysis.

Benefits of Asynchronous Processing for PDF to CSV Conversion

By using asynchronous processing, applications can:

Avoid Timeouts: For large or complex PDFs, asynchronous processing ensures the operation completes without hitting server or client time limits.
Optimize Performance: While waiting for the conversion, the application can continue handling other tasks, improving overall efficiency.
Reliability: Asynchronous processing minimizes the risks of interruptions during lengthy conversions, resulting in a smoother workflow.

This setup ensures an efficient and seamless PDF-to-CSV conversion process. For more details on asynchronous processing, refer to the PDF.co documentation.

Endpoint Parameters to Extract PDF Table to CSV in Java

Following are the Document Parser API endpoint parameters for converting the PDF to CSV:

url: It is a required parameter that provides the URL to the source file. The PDF.co platform supports any publicly accessible URL, including those from Dropbox, Google Drive, and the built-in file storage of PDF.co.
httpusername: It is an optional parameter and provides an HTTP auth user name to access the source URL if required.
httppassword: It is an optional parameter and provides an HTTP auth password to access the source URL if needed.
templateId: It is a required parameter that sets the Id of a document parser temple the user uses.
template: It is an optional parameter. The users can provide the document parser template code using this parameter directly.
inline: It is an optional parameter. The users can set it to true to return data as inline or false to return the link to an output file.
outputFormat: It is an optional parameter. The user can set this parameter to generate the output in any required format, including CSV, XML, or JSON.
password: It is an optional parameter that must be a string. It provides the password for the PDF file if required.
async: Setting "async": true enables background processing, allowing large or complex conversions to complete without blocking the initial request.
name: It is an optional parameter and must be a string. It provides the name of the generated output file after successful code execution.
expiration: It is an optional parameter that offers the expiration time for the output link.
profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options

How to Use Java for Asynchronous PDF to CSV Conversion

The following source code demonstrates how to extract multipage tables from a PDF and convert them into a CSV file in Java using asynchronous processing with the Document Parser API. A custom template is applied to parse a sample PDF, and the application retrieves the CSV file once the asynchronous conversion completes.

Template JSON Code

Following is the template JSON code for PDF parsing:

{
  "templateName": "Multipage Table Test",
  "templateVersion": 4,
  "templatePriority": 0,
  "detectionRules": {
    "keywords": [
      "Sample document with multi-page table"
    ]
  },
  "objects": [
    {
      "name": "total",
      "objectType": "field",
      "fieldProperties": {
        "fieldType": "macros",
        "expression": "TOTAL{{Spaces}}({{Number}})",
        "regex": true,
        "dataType": "decimal"
      }
    },
    {
      "name": "table1",
      "objectType": "table",
      "tableProperties": {
        "start": {
          "expression": "Item{{Spaces}}Description{{Spaces}}Price",
          "regex": true
        },
        "end": {
          "expression": "TOTAL{{Spaces}}{{Number}}",
          "regex": true
        },
        "row": {
          "expression": "{{LineStart}}{{Spaces}}(?<itemNo>{{Digits}}){{Spaces}}(?<description>{{SentenceWithSingleSpaces}}){{Spaces}}(?<price>{{Number}}){{Spaces}}(?<qty>{{Digits}}){{Spaces}}(?<extPrice>{{Number}})",
          "regex": true
        },
        "columns": [
          {
            "name": "itemNo",
            "dataType": "integer"
          },
          {
            "name": "description",
            "dataType": "string"
          },
          {
            "name": "price",
            "dataType": "decimal"
          },
          {
            "name": "qty",
            "dataType": "integer"
          },
          {
            "name": "extPrice",
            "dataType": "decimal"
          }
        ],
        "multipage": true
      }
    }
  ]
}

Sample Code in Java

Following is the sample code to parse the PDF file using the above template:

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.gson.JsonPrimitive;
import okhttp3.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;


public class Main {
    // Get your own API Key by registering at https://app.pdf.co
    final static String API_KEY = "************************";


    public static void main(String[] args) throws IOException, InterruptedException {
        // Source PDF file
        final String SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/document-parser/MultiPageTable.pdf";
        // PDF document password. Leave empty for unprotected documents.
        final String Password = "";
        // Destination JSON file name
        final Path DestinationFile = Paths.get(".\\result.csv");
        final String outputFormat = "CSV";
        // Template text
        String templateText = new String(Files.readAllBytes(Paths.get(".\\MultiPageTable-template1.json")), StandardCharsets.UTF_8);


        // Create HTTP client instance
        OkHttpClient webClient = new OkHttpClient();


        // Parse the uploaded PDF document
        ParseDocumentAsync(webClient, API_KEY, DestinationFile, Password, SourceFileUrl, templateText, outputFormat);
    }


    public static void ParseDocumentAsync(OkHttpClient webClient, String apiKey, Path destinationFile,
                                          String password, String uploadedFileUrl, String templateText, String outputFormat) throws IOException, InterruptedException {


        // Prepare POST request body in JSON format
        JsonObject jsonBody = new JsonObject();
        jsonBody.add("url", new JsonPrimitive(uploadedFileUrl));
        jsonBody.add("template", new JsonPrimitive(templateText));
        jsonBody.add("outputFormat", new JsonPrimitive(outputFormat));
        jsonBody.add("async", new JsonPrimitive(true)); // Enable asynchronous processing


        RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonBody.toString());


        // Prepare request to `Document Parser` API
        Request request = new Request.Builder()
                .url("https://api.pdf.co/v1/pdf/documentparser")
                .addHeader("x-api-key", apiKey)
                .addHeader("Content-Type", "application/json")
                .post(body)
                .build();


        // Execute request
        Response response = webClient.newCall(request).execute();
        if (response.code() == 200) {
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
            boolean error = json.get("error").getAsBoolean();
            if (!error) {
                String jobId = json.get("jobId").getAsString();
                System.out.println("Job#" + jobId + ": has been created.");


                // Check job status in a loop
                while (true) {
                    String status = CheckJobStatus(webClient, apiKey, jobId);
                    DateTimeFormatter dtf = DateTimeFormatter.ofPattern("MM/dd/yyyy HH:mm:ss");
                    System.out.println("Job#" + jobId + ": " + status + " - " + dtf.format(LocalDateTime.now()));


                    if (status.equalsIgnoreCase("success")) {
                        String resultFileUrl = json.get("url").getAsString();
                        // Download the file
                        downloadFile(webClient, resultFileUrl, destinationFile.toFile());
                        System.out.printf("Generated file saved to \"%s\" file.", destinationFile.toString());
                        break;
                    } else if (status.equalsIgnoreCase("working")) {
                        Thread.sleep(3000); // Pause for a few seconds before retrying
                    } else {
                        System.out.println("Job failed with status: " + status);
                        break;
                    }
                }
            } else {
                System.out.println(json.get("message").getAsString());
            }
        } else {
            System.out.println(response.code() + " " + response.message());
        }
    }


    // Check Job Status
    private static String CheckJobStatus(OkHttpClient webClient, String apiKey, String jobId) throws IOException {
        String url = "https://api.pdf.co/v1/job/check?jobid=" + jobId;


        Request request = new Request.Builder()
                .url(url)
                .addHeader("x-api-key", apiKey)
                .build();


        Response response = webClient.newCall(request).execute();
        if (response.code() == 200) {
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
            return json.get("status").getAsString();
        } else {
            System.out.println(response.code() + " " + response.message());
        }
        return "Failed";
    }


    public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException {
        Request request = new Request.Builder()
                .url(url)
                .build();


        Response response = webClient.newCall(request).execute();
        byte[] fileBytes = response.body().bytes();
        OutputStream output = new FileOutputStream(destinationFile);
        output.write(fileBytes);
        output.flush();
        output.close();
        response.close();
    }
}

Sample PDF with Multi-Page Table for PDF to CSV Parsing

Below is the source PDF file for parsing:

Output CSV File

Below is the CSV output of the above code and the source file:

total,tableNames,tables
450.00,table1,"itemNo,description,price,qty,extPrice
1,Item 1,10.00,1,10.00
2,Item 2,10.00,1,10.00
3,Item 3,10.00,1,10.00
4,Item 4,10.00,1,10.00
5,Item 5,10.00,1,10.00
6,Item 6,10.00,1,10.00
7,Item 7,10.00,1,10.00
8,Item 8,10.00,1,10.00
9,Item 9,10.00,1,10.00
10,Item 10,10.00,1,10.00
11,Item 11,10.00,1,10.00
12,Item 12,10.00,1,10.00
13,Item 13,10.00,1,10.00
14,Item 14,10.00,1,10.00
15,Item 15,10.00,1,10.00
16,Item 16,10.00,1,10.00
17,Item 17,10.00,1,10.00
18,Item 18,10.00,1,10.00
19,Item 19,10.00,1,10.00
20,Item 20,10.00,1,10.00
21,Item 21,10.00,1,10.00
22,Item 22,10.00,1,10.00
23,Item 23,10.00,1,10.00
24,Item 24,10.00,1,10.00
25,Item 25,10.00,1,10.00
26,Item 26,10.00,1,10.00
27,Item 27,10.00,1,10.00
28,Item 28,10.00,1,10.00
29,Item 29,10.00,1,10.00
30,Item 30,10.00,1,10.00
31,Item 31,10.00,1,10.00
32,Item 32,10.00,1,10.00
33,Item 33,10.00,1,10.00
34,Item 34,10.00,1,10.00
35,Item 35,10.00,1,10.00
36,Item 36,10.00,1,10.00
37,Item 37,10.00,1,10.00
38,Item 38,10.00,1,10.00
39,Item 39,10.00,1,10.00
40,Item 40,10.00,1,10.00
41,Item 41,10.00,1,10.00
42,Item 42,10.00,1,10.00
43,Item 43,10.00,1,10.00
44,Item 44,10.00,1,10.00
45,Item 45,10.00,1,10.00

Step-by-Step Guide to Extract PDF Form Data to CSV Using Document Parser in Asynchronous Mode

This guide walks you through converting a complete PDF form into a CSV file using the PDF.co Document Parser API with asynchronous processing.

Steps Overview

Import Required Libraries and Set Up

You can use the IDE of your choice, whether it's VS Code, IntelliJ, or another environment that you're comfortable with.Start by importing necessary packages and libraries to handle the API request. These libraries will manage file I/O, HTTP requests, and JSON parsing. Ensure the file is accessible via a URL or upload it if needed.

Set Up the API Key

Obtain your API key by signing up or logging into your PDF.co account. This key authenticates your requests to the API. Initialize the API key in your code.

Prepare the API Request Payload

Customize the payload with the following:

Source File URL: The location of your PDF form.
File Password (if applicable): Add if the PDF is password-protected.
Template: The template provided here is a JSON file that contains all the information needed to extract data from the document, like the keywords, objects, expressions, and other information. The users can use built-in templates provided by the API or customize them to get the required output.
Output Format: Set this to "CSV" for the desired output.

Enable Asynchronous Processing

Set the "async" parameter in your request body to true. This ensures the processing happens asynchronously, allowing you to poll for results without waiting for the process to finish.

Send the API Request

Use an HTTP POST request to send the payload to the Document Parser endpoint. Ensure the payload is JSON-formatted with all necessary parameters.

Poll for Job Status

The API will return a jobId and a statusUrl to track the processing status. Use these details to periodically check if the job is complete.

Retrieve and Save the Results

Once the job status is "success", download the resulting CSV file using the resultUrl provided in the response. Save the file locally, for example, as result.csv.