How to Extract Text from Scanned PDF Using PDF.co Web API in C#
Working with PDF files can often be complex without relying on third-party tools or libraries. This complexity increases significantly when dealing with tasks that are far from straightforward, such as extracting text from scanned PDF documents.
In this guide, I will demonstrate how to extract text from a scanned page within a PDF file using only the standard C# library, augmented with Newtonsoft.Json, and PDF.co RESTful Web API.
For the purpose of this demonstration, we'll utilize a sample scanned PDF that can be found on the Fujitsu sample pages. This file has been downloaded and stored at the local path './ScannedPDF.pdf': https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf
Step 1: Set Your API key
To start working with PDF.co Web API you have to retrieve your API Key, which will appear on your dashboard after you create an account and sign at https://app.pdf.co. The API key must be sent with every API request in the URL param or as an HTTP header (the header param is preferred):
private static void SetApiKey()
{
HttpClient.BaseAddress = new Uri("https://api.pdf.co");
HttpClient.DefaultRequestHeaders.Add("x-api-key", API_KEY);
}
Step 2: Prepare and Get the Presigned URL for the File Upload
Next, we’ll have to upload the source PDF file to the Web API engine using the pre-signed URL API: /file/upload/get-presigned-url.
private static async Task<WebApiResponse> GetPresignedUrlResponse()
{
var fileName = Path.GetFileName(ScannedPdfLocalPath);
var presignedUrl = Uri.EscapeUriString($"/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={fileName}");
var response = await HttpClient.GetAsync(presignedUrl);
var json = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<WebApiResponse>(json);
}
where WebApiResponse is defined as follows:
public class WebApiResponse
{
[JsonProperty("presignedUrl")]
public string PresignedUrl { get; set; }
[JsonProperty("url")]
public string Url { get; set; }
[JsonProperty("jobId")]
public string JobId { get; set; }
}
and PresignedUrl
is the URL where the local PDF file will be uploaded to, Url
is the URL link to access the uploaded file, and JobId
is not used just yet, but we will use it later on when performing an actual text extraction job.
Step 3: Upload Source PDF into the Cloud
As soon as we have the pre-signed URL we can upload the local file into the Web API cloud:
private async static Task UploadFile(string presignedUrl)
{
var uploadData = File.ReadAllBytes(ScannedPdfLocalPath);
var requestContent = new ByteArrayContent(uploadData);
requestContent.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
await HttpClient.PutAsync(presignedUrl, requestContent);
}
Step 4: Place Async Job to Convert PDF into Text
After uploading the file to the Web API cloud, the next step involves using the file's URL to initiate a job for converting the scanned PDF to text. However, there are important considerations to take into account before leveraging the PDF-to-text conversion API.
- OCR Processing Time: The conversion process involves Optical Character Recognition (OCR), which can be time-consuming. It's crucial to manage this aspect to avoid potential timeout errors.
- Asynchronous Calls: To prevent timeouts, especially for processes exceeding 25 seconds, it's advisable to make asynchronous API calls. This can be achieved by including an additional parameter,
async
, set totrue
in your API request. For detailed guidance on running background jobs, refer to the PDF.co documentation on asynchronous processing. - Handling Shorter Tasks: If you're confident the task will complete in less than 25 seconds, you might opt to not use the
async
parameter or set it tofalse
. - Setting the OCRMode: Explicitly specifying an OCRMode profile in your request instructs the engine to utilize OCR for the conversion. In this instance, I'll employ the
TextFromImagesAndVectorsAndRepairedFonts
OCRMode. For a comprehensive list of available OCRMode profiles, consult the PDF.co documentation on OCR profiles.
private static async Task<WebApiResponse> PlaceJobToConvertPdfToText(string uploadedFileUrl)
{
var convertToTextUrl = "/v1/pdf/convert/to/text";
var extractedText = @".\Extracted.txt";
var parameters = new Dictionary<string, object>();
parameters.Add("name", Path.GetFileName(extractedText));
parameters.Add("url", uploadedFileUrl);
parameters.Add("async", true);
var profiles =
@"{
'profiles':[
{
'profile1':{
'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts'
}
}
]
}";
parameters.Add("profiles", profiles);
var payload = JsonConvert.SerializeObject(parameters);
var response = await HttpClient.PostAsync(convertToTextUrl, new StringContent(payload, Encoding.UTF8, "application/json"));
var textResult = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<WebApiResponse>(textResult);
}
Step 5: Check Job Status and Retrieve the Result
After the job is placed and you have a job id returned by the Web API you can poll periodically to check the job status: https://developer.pdf.co/api/background-job-check/index.html.
private static async Task<string> CheckJobStatus(string jobId)
{
string url = "/v1/job/check?jobid=" + jobId;
var response = await HttpClient.GetStringAsync(url);
var json = JObject.Parse(response);
return Convert.ToString(json["status"]);
}
Waiting can be defined as a method that takes a job id to check the status and action to invoke job success.
private static async Task WaitTillJobIsDone(string jobId, Action onDone)
{
while (true)
{
var status = await CheckJobStatus(jobId);
if (status == "success")
{
onDone();
break;
}
if (status == "working")
{
// Pause for a few seconds
Thread.Sleep(10000);
}
else
{
Console.WriteLine(status);
break;
}
}
}
Finally, here is our main function which shows a full workflow:
private static read-only string API_KEY = "__YOUR_API_KEY__";
private static readonly HttpClient HttpClient = new HttpClient();
// the source document to extract text from
private static readonly string ScannedPdfLocalPath = "ScannedPDF.pdf";
static void Main(string[] args)
{
SetApiKey();
var presignedUrlResponse = GetPresignedUrlResponse().Result;
UploadFile(presignedUrlResponse.PresignedUrl).Wait();
var convertPdfResponse = PlaceJobToConvertPdfToText(presignedUrlResponse.Url).Result;
var resultFileUrl = convertPdfResponse.Url;
WaitTillJobIsDone(convertPdfResponse.JobId, () => Console.Write(HttpClient.GetStringAsync(resultFileUrl).Result)).Wait();
}
In this example, we used just a small part of PDF.co Web API's functionality. It has vast documentation available at https://developer.pdf.co, where you can find all the details of API calls we used in this article.
Step 6: Full Code (if needed)
using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
namespace ExtractTextFromPDF
{
class Program
{
private static readonly string API_KEY = "";
private static readonly HttpClient HttpClient = new HttpClient();
// the source document to extract text from
private static readonly string ScannedPdfLocalPath = "ScannedPDF.pdf";
static void Main(string[] args)
{
SetApiKey();
var presignedUrlResponse = GetPresignedUrlResponse().Result;
UploadFile(presignedUrlResponse.PresignedUrl).Wait();
var convertPdfResponse = PlaceJobToConvertPdfToText(presignedUrlResponse.Url).Result;
var resultFileUrl = convertPdfResponse.Url;
WaitTillJobIsDone(convertPdfResponse.JobId, () => Console.Write(HttpClient.GetStringAsync(resultFileUrl).Result)).Wait();
}
private static void SetApiKey()
{
HttpClient.BaseAddress = new Uri("https://api.pdf.co");
HttpClient.DefaultRequestHeaders.Add("x-api-key", API_KEY);
}
private static async Task<WebApiResponse> GetPresignedUrlResponse()
{
var fileName = Path.GetFileName(ScannedPdfLocalPath);
var presignedUrl = Uri.EscapeUriString($"/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={fileName}");
var response = await HttpClient.GetAsync(presignedUrl);
var json = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<WebApiResponse>(json);
}
private static async Task UploadFile(string presignedUrl)
{
var uploadData = File.ReadAllBytes(ScannedPdfLocalPath);
var requestContent = new ByteArrayContent(uploadData);
requestContent.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
await HttpClient.PutAsync(presignedUrl, requestContent);
}
private static async Task<WebApiResponse> PlaceJobToConvertPdfToText(string uploadedFileUrl)
{
var convertToTextUrl = "/v1/pdf/convert/to/text";
var extractedText = @".\Extracted.txt";
var parameters = new Dictionary<string, object>();
parameters.Add("name", Path.GetFileName(extractedText));
parameters.Add("url", uploadedFileUrl);
parameters.Add("async", true);
var profiles =
@"{
'profiles':[
{
'profile1':{
'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts'
}
}
]
}";
parameters.Add("profiles", profiles);
var payload = JsonConvert.SerializeObject(parameters);
var response = await HttpClient.PostAsync(convertToTextUrl, new StringContent(payload, Encoding.UTF8, "application/json"));
var textResult = response.Content.ReadAsStringAsync().Result;
return JsonConvert.DeserializeObject<WebApiResponse>(textResult);
}
private static async Task WaitTillJobIsDone(string jobId, Action onDone)
{
while (true)
{
var status = await CheckJobStatus(jobId);
if (status == "success")
{
onDone();
break;
}
if (status == "working")
{
// Pause for a few seconds
Thread.Sleep(10000);
}
else
{
Console.WriteLine(status);
break;
}
}
}
private static async Task<string> CheckJobStatus(string jobId)
{
string url = "/v1/job/check?jobid=" + jobId;
var response = await HttpClient.GetStringAsync(url);
var json = JObject.Parse(response);
return Convert.ToString(json["status"]);
}
}
public class WebApiResponse
{
[JsonProperty("presignedUrl")]
public string PresignedUrl { get; set; }
[JsonProperty("url")]
public string Url { get; set; }
[JsonProperty("jobId")]
public string JobId { get; set; }
}
}