How to Extract Text from Scanned PDF Using PDF.co Web API in C#

Working with PDF files without any 3rd party tools or libraries might be a challenge.

Especially when a job is not a trivial one, like extracting text information from a scanned PDF.

And here I will show you how to extract the text contained in a scanned page of a PDF file using just standard C# library (+Newtonsoft.Json) and RESTful Web API.

I will use a sample scanned PDF located in Fujitsu page of samples which I saved to my local path ‘./ScannedPDF.pdf’ : https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf

1. Set Your API key

To start working with Web API you have to retrieve the API Key available in the ‘Your API Key’ tab which will appear after you sign-in on the main page https://pdf.co/rest-web-api. The API key must be sent with every API request in

the URL param or as an HTTP header (the header param is preferred):

 

private static void SetApiKey()
{
    HttpClient.BaseAddress = new Uri("https://api.pdf.co");
    HttpClient.DefaultRequestHeaders.Add("x-api-key", API_KEY);
}

 

2. Prepare and Get the Presigned URL for the File Upload

 Next, we’ll have to upload the source PDF file to the Web API engine using the pre-signed URL API: generate-secure-url-for-upload:

private static async Task<WebApiResponse> GetPresignedUrlResponse()
{
    var fileName = Path.GetFileName(ScannedPdfLocalPath);
    var presignedUrl = Uri.EscapeUriString($"/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={fileName}");
    var response = await HttpClient.GetAsync(presignedUrl);
    var json = await response.Content.ReadAsStringAsync();
    return JsonConvert.DeserializeObject<WebApiResponse>(json);
}

where WebApiResponse is defined as follows:

public class WebApiResponse
{
[JsonProperty("presignedUrl")]
public string PresignedUrl { get; set; }

[JsonProperty("url")]
public string Url { get; set; }

[JsonProperty("jobId")]
public string JobId { get; set; }
}

and PresignedUrl is the URL where the local PDF file will be uploaded to, Url is the URL link to access the uploaded file, and JobId is not used now, but we will use it later on when placing actual text extraction job.

 

3. Upload Source PDF into the Cloud

As soon as we have the pre-signed URL we can upload the local file into the Web API cloud:

private async static Task UploadFile(string presignedUrl)
{
var uploadData = File.ReadAllBytes(ScannedPdfLocalPath);
var requestContent = new ByteArrayContent(uploadData);
requestContent.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
await HttpClient.PutAsync(presignedUrl, requestContent);
}

4. Place Async Job to Convert PDF into Text

Having uploaded the file to the Web API cloud we can use the uploaded file URL to place a job to convert scanned PDF to text.

There are some points to be noted before we use the conversion API PDF-to-text conversion API.

We will have to use OCR methods and that might take some time. And if we don’t use asynchronous processing we can easily end up with timeout errors. Actually, you must make an asynchronous call whenever the processing time is greater than 25 sec, otherwise, the timeout error will be returned and you won’t be able to finish the job. You can start any Web API process asynchronously by simply putting additional param ‘async’ set to ‘true’ (see https://apidocs.pdf.co/#how-to-run-a-background-job).

And if you know that the job you want to run will take less than 25 secs you can skip setting the ‘async’ param completely or set it to ‘false’.

Also, we have to set up an OCRMode profile in the request explicitly tell the engine to use OCR. For this example, I will use ‘TextFromImagesAndVectorsAndRepairedFonts’ OCRMode. A full list of available profiles you can find in the docs here: https://apidocs.pdf.co/profiles

private static async Task<WebApiResponse> PlaceJobToConvertPdfToText(string uploadedFileUrl)
{
var convertToTextUrl = "/v1/pdf/convert/to/text";
var extractedText = @".\Extracted.txt";

var parameters = new Dictionary<string, object>();
parameters.Add("name", Path.GetFileName(extractedText));
parameters.Add("url", uploadedFileUrl);
parameters.Add("async", true);
var profiles =
@"{
'profiles':[
{
'profile1':{
'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts'
}
}
]
}";

parameters.Add("profiles", profiles);

var payload = JsonConvert.SerializeObject(parameters);
var response = await HttpClient.PostAsync(convertToTextUrl, new StringContent(payload, Encoding.UTF8, "application/json"));
var textResult = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<WebApiResponse>(textResult);
}

5. Check Job Status and Retrieve the Result

After the job is placed and you have a job id returned by the Web API you can poll periodically to check the job status (https://apidocs.pdf.co/42-background-jobs-check):

private static async Task<string> CheckJobStatus(string jobId)
{
string url = "/v1/job/check?jobid=" + jobId;
var response = await HttpClient.GetStringAsync(url);
var json = JObject.Parse(response);
return Convert.ToString(json["status"]);
}

 

Waiting can be defined as a method that takes a job id to check the status and action to invoke job success.

private static async Task WaitTillJobIsDone(string jobId, Action onDone)
{
while (true)
{
var status = await CheckJobStatus(jobId);

if (status == "success")
{
onDone();
break;
}
if (status == "working")
{
// Pause for a few seconds
Thread.Sleep(10000);
}
else
{
Console.WriteLine(status);
break;
}
}
}

 

Finally, here is our main function which shows a full workflow:

private static read-only string API_KEY = "__YOUR_API_KEY__";
private static readonly HttpClient HttpClient = new HttpClient();
// the source document to extract text from
private static readonly string ScannedPdfLocalPath = "ScannedPDF.pdf";
static void Main(string[] args)
{

SetApiKey();
var presignedUrlResponse = GetPresignedUrlResponse().Result;
UploadFile(presignedUrlResponse.PresignedUrl).Wait();
var convertPdfResponse = PlaceJobToConvertPdfToText(presignedUrlResponse.Url).Result;
var resultFileUrl = convertPdfResponse.Url;
WaitTillJobIsDone(convertPdfResponse.JobId, () => Console.Write(HttpClient.GetStringAsync(resultFileUrl).Result)).Wait();
}

In this example, we used just a small piece of functionality of what the sophisticated Web API https://pdf.co/rest-web-api is offering. It has vast documentation available at https://apidocs.pdf.co/ where you can find all the details of API calls we used in this article.

Full Code (if needed):

using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
namespace ExtractTextFromPDF
{
class Program
{
private static readonly string API_KEY = "";
private static readonly HttpClient HttpClient = new HttpClient();
// the source document to extract text from
private static readonly string ScannedPdfLocalPath = "ScannedPDF.pdf";
static void Main(string[] args)
{
SetApiKey();
var presignedUrlResponse = GetPresignedUrlResponse().Result;
UploadFile(presignedUrlResponse.PresignedUrl).Wait();
var convertPdfResponse = PlaceJobToConvertPdfToText(presignedUrlResponse.Url).Result;
var resultFileUrl = convertPdfResponse.Url;
WaitTillJobIsDone(convertPdfResponse.JobId, () => Console.Write(HttpClient.GetStringAsync(resultFileUrl).Result)).Wait();
}
private static void SetApiKey()
{
HttpClient.BaseAddress = new Uri("https://api.pdf.co");
HttpClient.DefaultRequestHeaders.Add("x-api-key", API_KEY);
}
private static async Task<WebApiResponse> GetPresignedUrlResponse()
{
var fileName = Path.GetFileName(ScannedPdfLocalPath);
var presignedUrl = Uri.EscapeUriString($"/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={fileName}");
var response = await HttpClient.GetAsync(presignedUrl);
var json = await response.Content.ReadAsStringAsync();
return JsonConvert.DeserializeObject<WebApiResponse>(json);
}

private static async Task UploadFile(string presignedUrl)
{
var uploadData = File.ReadAllBytes(ScannedPdfLocalPath);
var requestContent = new ByteArrayContent(uploadData);
requestContent.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
await HttpClient.PutAsync(presignedUrl, requestContent);
}
private static async Task<WebApiResponse> PlaceJobToConvertPdfToText(string uploadedFileUrl)
{
var convertToTextUrl = "/v1/pdf/convert/to/text";
var extractedText = @".\Extracted.txt";
var parameters = new Dictionary<string, object>();
parameters.Add("name", Path.GetFileName(extractedText));
parameters.Add("url", uploadedFileUrl);
parameters.Add("async", true);
var profiles =
@"{
'profiles':[
{
'profile1':{
'OCRMode':'TextFromImagesAndVectorsAndRepairedFonts'
}
}
]
}";
parameters.Add("profiles", profiles);
var payload = JsonConvert.SerializeObject(parameters);
var response = await HttpClient.PostAsync(convertToTextUrl, new StringContent(payload, Encoding.UTF8, "application/json"));
var textResult = response.Content.ReadAsStringAsync().Result;
return JsonConvert.DeserializeObject<WebApiResponse>(textResult);
}
private static async Task WaitTillJobIsDone(string jobId, Action onDone)
{
while (true)
{
var status = await CheckJobStatus(jobId);
if (status == "success")
{
onDone();
break;
}
if (status == "working")
{
// Pause for a few seconds
Thread.Sleep(10000);
}
else
{
Console.WriteLine(status);
break;
}
}
}
private static async Task<string> CheckJobStatus(string jobId)
{
string url = "/v1/job/check?jobid=" + jobId;
var response = await HttpClient.GetStringAsync(url);
var json = JObject.Parse(response);
return Convert.ToString(json["status"]);
}
}
public class WebApiResponse
{
[JsonProperty("presignedUrl")]
public string PresignedUrl { get; set; }

[JsonProperty("url")]
public string Url { get; set; }

[JsonProperty("jobId")]
public string JobId { get; set; }
}
}