Extract Data from PDF According to the Column using Java

This tutorial and the sample source code explain data extraction from a PDF document according to specific columns using PDF to CSV functionality from PDF.co Web API. The source code uses Java programming language for this tutorial and the PDF.co editor, which provides all the coordinates in a PDF document. Users can also convert scanned images into CSV according to layouts, tables, rows, and columns.

Features of PDF to CSV API Endpoint

The PDF.co Web API provides tools and functionalities to convert any PDF document or scanned image to CSV. This process preserves the formatting, the text, the vectors, and the images efficiently after the conversion. The API uses the method of automatic document classification of incoming documents. The users can utilize the document classifier endpoint to automatically sort and detect the document’s class on their keywords-based rules.

Another reason to use the PDF.co Web API for sensitive information is they provide a highly secure platform for its consumers. The platform transmits user data via encrypted connections to ensure security. The users can go through the security protocols in detail here.

Endpoint Parameters

Following are the PDF to CSV endpoint parameters:

  1. url: It is a required parameter that provides the URL to the source file. The PDF.co platform supports URLs from Dropbox, Google Drive, and built-in file storage of PDF.co.
  2. httpusername: It is an optional parameter that provides an HTTP auth user name to access the source URL if required.
  3. httppassword: It is an optional parameter that provides an HTTP auth password to access the source URL if needed.
  4. pages: It is an optional parameter and must be a string. The parameter provides a comma-separated list of the pages required. The users can set a page range by using “ -.” Moreover, the users can leave the parameter empty to indicate selecting all the pages.
  5. unwrap: It is an optional parameter that helps unwrap the lines and forms them into a single line in the table cells. It is done by enabling lineGrouping.
  6. rect: It is an optional parameter and must be a string. It provides the specific data coordinates for extraction.
  7. lang: It is an optional parameter that helps set the language for OCR to use for scanned JPG, PNG, and PDF document inputs to extract text from them.
  8. inline: It is an optional parameter. The users can set it to true to return data as inline or false to return the link to an output file.
  9. lineGrouping: It is an optional parameter and must be a string. It enables grouping within the table cells.
  10. async: It is an optional parameter that helps run the processes asynchronously. It returns the JobId to check the state of the background job.
  11. name: It is an optional parameter and must be a string. It provides the name of the generated output file after successful code execution.
  12. expiration: It is an optional parameter that offers the expiration time for the output link.
  13. profiles: It is an optional parameter and must be a string. This parameter helps in setting additional configurations and extra options.

Extract Data from PDF Column – Example in Java

The following source code explains how to extract data from any PDF document and save it as CSV according to a specific column using the PDF to HTML API endpoint. The sample code in Java demonstrates converting a PDF document to CSV using the API. The below code takes the sample PDF file for classification and uses the PDF.co editor to find the document coordinates. The code then uses the specific coordinates of the required columns and extracts the entire columns into CSV. The resulting CSV file is returned to the user after data extraction.

Java Source Code

Following is the sample code in Java to explain using PDF to CSV endpoint:

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import okhttp3.*;
import java.io.*;
import java.net.*;
import java.nio.file.Path;
import java.nio.file.Paths;

public class Main
{
    // Get your own API Key by registering at https://app.pdf.co
    final static String API_KEY = "************";
    // Direct URL of source PDF file.
    // You can also upload your own file into PDF.co and use it as url. Check "Upload File" samples for code snippets: https://github.com/bytescout/pdf-co-api-samples/tree/master/File%20Upload/    
    final static String SourceFileUrl = "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf";
    // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
    final static String Pages = "";
    // PDF document password. Leave empty for unprotected documents.
    final static String Password = "";
    // Destination CSV file name
    final static Path DestinationFile = Paths.get(".\\result.csv");
    final static String rect = "54,240,180,380";

    public static void main(String[] args) throws IOException
    {
        // Create HTTP client instance
        OkHttpClient webClient = new OkHttpClient();
        // Prepare URL for `PDF To CSV` API call
        String query = "https://api.pdf.co/v1/pdf/convert/to/csv";
        // Make correctly escaped (encoded) URL
        URL url = null;
        try
        {
            url = new URI(null, query, null).toURL();
        }
        catch (URISyntaxException e)
        {
            e.printStackTrace();
        }

        // Create JSON payload
                        String jsonPayload = String.format("{\"name\": \"%s\", \"password\": \"%s\"," +
                        " \"pages\": \"%s\", \"url\": \"%s\", \"rect\":\"%s\"}",
                DestinationFile.getFileName(),
                Password,
                Pages,
                SourceFileUrl,
                rect);

        // Prepare request body
        RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonPayload);
        // Prepare request
        Request request = new Request.Builder()
            .url(url)
            .addHeader("x-api-key", API_KEY) // (!) Set API Key
            .addHeader("Content-Type", "application/json")
            .post(body)
            .build();
        
        // Execute request
        Response response = webClient.newCall(request).execute();
        if (response.code() == 200)
        {
            // Parse JSON response
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();
            String status = json.get("status").getAsString();
            if (!status.equals("error"))
            {

                // Get URL of generated CSV file
                String resultFileUrl = json.get("url").getAsString();
                // Download CSV file
                downloadFile(webClient, resultFileUrl, DestinationFile.toFile());
                System.out.printf("Generated CSV file saved as \"%s\" file.", DestinationFile.toString());
            }
            else
            {
                // Display service reported error
                System.out.println(json.get("message").getAsString());
            }
        }
        else
        {
            // Display request error
            System.out.println(response.code() + " " + response.message());
        }
    }

    public static void downloadFile(OkHttpClient webClient, String url, File destinationFile) throws IOException
    {
        // Prepare request
        Request request = new Request.Builder()
                .url(url)
                .build();
        // Execute request
        Response response = webClient.newCall(request).execute();
        byte[] fileBytes = response.body().bytes();
        // Save downloaded bytes to file
        OutputStream output = new FileOutputStream(destinationFile);
        output.write(fileBytes);
        output.flush();
        output.close();
        response.close();
    }
}

Source File for PDF Data Extraction in Java

Following is the source PDF file:

Source PDF File

Output File in CSV Format

Following is the generated output file in CSV:

"QUANTITY","DESCRIPTION","",
"2","Item 1","",
"5","Item 2","",
"1","Item 3","",
"1","Item 4","",
"10","Item 5","",

Step-by-Step Guide To Extract Data from PDF to CSV

Following is the step-by-step guide to explain to users the working of the above-mentioned source code for the tutorial:

  1. The code imports the required packages and libraries to make the API request and reads the file from the URL.
  2. It then declares and initializes the API_Key, which the users can get by signing up or logging into the PDF.co account. The users require this API key to make requests to API endpoints.
  3. After this, the user has to provide API’s body payload, which in this sample code is the destination and source file URL, file password, file pages, column layout, and the rect parameter. The users can provide their own required information here and customize the code. The code utilizes the PDF.co sample code containing the source file.
  4. The rect parameter contains the data coordinates that must be extracted from the PDF document. The users can get these coordinates using the PDF.co editor, which provides all the document coordinates.
  5. The sample code then assembles variables for JSON payload and sends the API POST request. The successful request returns the CSV formatted data that the file stream reads and stores on the local storage as the result.csv file, i.e., the destination file.

In conclusion, extracting data from PDF files according to specific columns using Java has never been easier. This tutorial has equipped you with the necessary knowledge and code examples to efficiently extract and organize data from PDFs.