How to Extract Text from Scanned PDF Using PDF.co Web API in Java
Working with PDF files without any 3rd party tools or libraries might be a challenge.
Especially when a job is not a trivial one, like extracting text information from a scanned PDF.
Here, I will show you how to extract the text contained in a scanned page of a PDF file using Java OkHttp, Gson, and RESTful Web API.
I will use a sample scanned PDF located on Fujitsu page of samples which I saved to my local path ‘./ScannedPDF.pdf’.
Follow these steps to extract text from a scanned PDF file.
Step 1: Add Maven Dependencies
We will use OkHttp HTTP client and Gson library to serialize/deserialize JSON, so the following Maven dependencies must be added in POM.xml file:
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>4.9.1</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.5</version>
</dependency>
These are the only external dependencies needed for our purposes.
Step 2: Set API key
To start working with the Web API you have to retrieve the API Key available in the ‘Your API Key’ tab which will appear after you sign in on the main page. The API key must be sent with every API request in the URL param or as an HTTP header (the header param is preferred):
private static final String API_KEY = "__YOUR_API_KEY__";
…
new Request.Builder().addHeader("x-api-key", API_KEY)
Step 3: Prepare and Get the Pre-signed URL for the File Upload
Next, we’ll have to upload the source pdf file to the Web API engine using pre-signed URL API and you can find a ready-to-compile code snipped for pre-signed URL retrieval in the company’s GitHub page here.
To retrieve the pre-signed URL the web request has to be made to the API and if there is no error the response will contain the pre-signed URL in the presignedUrl property.
private static WebApiResponse getPresignedUrlResponse() throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var request = new Request.Builder()
.url("https://api.pdf.co/v1/file/upload/get-presigned-url?name=" + SCANNED_PDF_LOCAL_PATH + "&encrypt=true")
.method("GET", null)
.addHeader("x-api-key", API_KEY)
.build();
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class);
}
, where WebApiResponse is defined as follows:
class WebApiResponse {
public String presignedUrl;
public String url;
public boolean error;
public String status;
public int errorCode;
public String message;
public String jobId;
}
, where
presignedUrl – URL where the local PDF file will be uploaded to;
url – URL link to access the uploaded file;
status – status description
error – a boolean flag indicating whether a response has an error or not;
errorCode – an integer that contains error code (for example, 401 – forbidden)
jobId – is not used now, but we will use it later on when placing an actual text extraction async job.
Step 4: Upload Source PDF into the Cloud
As soon as we have the Presigned URL we can upload the local file into the Web API cloud:
private static boolean uploadFile(String url, Path sourceFile) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
// Prepare request body
var body = RequestBody.create(sourceFile.toFile(), MediaType.parse("application/octet-stream"));
// Prepare request
var request = new Request.Builder()
.url(url)
.addHeader("x-api-key", API_KEY) // (!) Set API Key
.addHeader("content-type", "application/octet-stream")
.put(body)
.build();
// Execute request
var response = client.newCall(request).execute();
return (response.code() == 200);
}
A ready-to-compile code snippet for uploading a file can be found here.
Step 5: Place Async Job to Convert PDF into Text
Having uploaded the file to the Web API cloud we can use the uploaded file URL to place a job to convert scanned PDF to text.
There are some points to be noted before we use the PDF-to-text conversion API.
We will have to use OCR methods and that might take some time. And if we don’t use asynchronous processing we can easily end up with timeout errors. Actually, you must make an asynchronous call whenever the processing time is greater than 25 sec, otherwise, the timeout error will be returned and you won’t be able to finish the job. You can start any Web API process asynchronously by simply putting additional param ‘async’ set to ‘true’ (see here).
And if you know that the job you want to run will take less than 25 secs you can skip setting the ‘async’ param completely or set it to ‘false’.
Also, we have to set up an OCRMode profile in the request to explicitly tell the engine to use OCR. For this example, I will use ‘TextFromImagesAndVectorsAndRepairedFonts’ OCRMode. A full list of available profiles you can find in the docs here.
private static WebApiResponse placeJobToConvertPdfToText(String uploadedFileUrl) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var url = API_URL_BASE + "/v1/pdf/convert/to/text";
var parameters = new HashMap<String, Object>();
parameters.put("name", "Extracted.txt");
parameters.put("url", uploadedFileUrl);
parameters.put("async", true);
var profiles = "{ 'profiles':[ { 'profile1':{ 'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts' } } ] }";
parameters.put("profiles", profiles);
var payload = new Gson().toJson(parameters);
// Prepare request body
var body = RequestBody.create(payload, MediaType.parse("application/json"));
// Prepare request
var request = new Request.Builder()
.url(url)
.addHeader("x-api-key", API_KEY) // (!) Set API Key
.addHeader("Content-Type", "application/json")
.post(body)
.build();
// Execute request
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class);
}
A ready-to-compile code snippet for using conversion API can be found here.
Step 6: Check Job Status and Retrieve the Result
After the job is placed and you have a job id returned by the Web API you can poll periodically to check the job status:
private static String checkJobStatus(String jobId) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var mediaType = MediaType.parse("text/plain");
var body = new MultipartBody.Builder().setType(MultipartBody.FORM)
.addFormDataPart("jobid", jobId)
.build();
var request = new Request.Builder()
.url("https://api.pdf.co/v1/job/check")
.method("POST", body)
.addHeader("x-api-key", API_KEY)
.build();
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class).status;
}
A ready-to-compile code snippet for uploading a file can be found here.
Waiting can be defined as a method that takes a job id to check the status and action to invoke job success.
private static void waitTillJobIsDone(String jobId, Runnable onDone) throws IOException, InterruptedException {
while (true) {
var status = checkJobStatus(jobId);
if (status.equals("success")) {
onDone.run();
break;
}
if (status.equals("working")) {
// Pause for a few seconds
Thread.sleep(10000);
} else {
System.out.println(status);
break;
}
}
}
Step 7: Main Function
Finally, here is our main function which shows a full workflow:
private static final String API_KEY = "__YOUR_API_KEY__";
private static final String API_URL_BASE = "https://api.pdf.co";
// the source document to extract text from
private static final String SCANNED_PDF_LOCAL_PATH = ".\\ScannedPDF.pdf";
public static void main(String[] args) {
try {
var presignedUrlResponse = getPresignedUrlResponse();
uploadFile(presignedUrlResponse.presignedUrl, Path.of(SCANNED_PDF_LOCAL_PATH));
var convertPdfResponse = placeJobToConvertPdfToText(presignedUrlResponse.url);
var resultFileUrl = convertPdfResponse.url;
waitTillJobIsDone(convertPdfResponse.jobId, () -> {
var client = new OkHttpClient().newBuilder()
.build();
var request = new Request.Builder()
.url(resultFileUrl)
.method("GET", null)
.addHeader("x-api-key", API_KEY)
.build();
Response response = null;
try {
response = client.newCall(request).execute();
var convertedText = response.body().string();
System.out.println(convertedText);
} catch (IOException e) {
e.printStackTrace();
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
The API logs are available here and contain detailed information about each request/response made. There you can also see the credits consumed per each call as well as the estimated cost.
NB: A full code listing below (if needed):
package org.example;
import com.google.gson.Gson;
import okhttp3.*;
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.util.HashMap;
public class App {
private static final String API_KEY = "__YOUR_API_KEY__";
private static final String API_URL_BASE = "https://api.pdf.co";
// the source document to extract text from
private static final String SCANNED_PDF_LOCAL_PATH = "C:\\Temp\\ScannedPDF.pdf";
public static void main(String[] args) {
try {
var presignedUrlResponse = getPresignedUrlResponse();
uploadFile(presignedUrlResponse.presignedUrl, Path.of(SCANNED_PDF_LOCAL_PATH));
var convertPdfResponse = placeJobToConvertPdfToText(presignedUrlResponse.url)
var resultFileUrl = convertPdfResponse.url;
waitTillJobIsDone(convertPdfResponse.jobId, () -> {
var client = new OkHttpClient().newBuilder()
.build();
var request = new Request.Builder()
.url(resultFileUrl)
.method("GET", null)
.addHeader("x-api-key", API_KEY)
.build();
Response response = null;
try {
response = client.newCall(request).execute();
var convertedText = response.body().string();
System.out.println(convertedText);
} catch (IOException e) {
e.printStackTrace();
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
private static WebApiResponse getPresignedUrlResponse() throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var request = new Request.Builder()
.url("https://api.pdf.co/v1/file/upload/get-presigned-url?name=" + SCANNED_PDF_LOCAL_PATH + "&encrypt=true")
.method("GET", null)
.addHeader("x-api-key", API_KEY)
.build();
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class);
}
private static boolean uploadFile(String url, Path sourceFile) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
// Prepare request body
var body = RequestBody.create(sourceFile.toFile(), MediaType.parse("application/octet-stream"));
// Prepare request
var request = new Request.Builder()
.url(url)
.addHeader("x-api-key", API_KEY) // (!) Set API Key
.addHeader("content-type", "application/octet-stream")
.put(body)
.build();
// Execute request
var response = client.newCall(request).execute();
return (response.code() == 200);
}
private static WebApiResponse placeJobToConvertPdfToText(String uploadedFileUrl) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var url = API_URL_BASE + "/v1/pdf/convert/to/text";
var parameters = new HashMap<String, Object>();
parameters.put("name", "Extracted.txt");
parameters.put("url", uploadedFileUrl);
parameters.put("async", true);
var profiles = "{ 'profiles':[ { 'profile1':{ 'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts' } } ] }";
parameters.put("profiles", profiles);
var payload = new Gson().toJson(parameters);
// Prepare request body
var body = RequestBody.create(payload, MediaType.parse("application/json"));
// Prepare request
var request = new Request.Builder()
.url(url)
.addHeader("x-api-key", API_KEY) // (!) Set API Key
.addHeader("Content-Type", "application/json")
.post(body)
.build();
// Execute request
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class);
}
private static String checkJobStatus(String jobId) throws IOException {
var client = new OkHttpClient().newBuilder()
.build();
var mediaType = MediaType.parse("text/plain");
var body = new MultipartBody.Builder().setType(MultipartBody.FORM)
.addFormDataPart("jobid", jobId)
.build();
var request = new Request.Builder()
.url("https://api.pdf.co/v1/job/check")
.method("POST", body)
.addHeader("x-api-key", API_KEY)
.build();
var response = client.newCall(request).execute();
var json = response.body().string();
return new Gson().fromJson(json, WebApiResponse.class).status;
}
private static void waitTillJobIsDone(String jobId, Runnable onDone) throws IOException, InterruptedException {
while (true) {
var status = checkJobStatus(jobId);
if (status.equals("success")) {
onDone.run();
break;
}
if (status.equals("working")) {
// Pause for a few seconds
Thread.sleep(10000);
} else {
System.out.println(status);
//break;
}
}
}
}
class WebApiResponse {
public String presignedUrl;
public String url;
public boolean error;
public String status;
public int errorCode;
public String message;
public String jobId;
}