COVID-19 has significantly affected our reliance on technology. Starting from working from home to schooling from home, technology is at its tipping point. The HealthCare providers are affected most, in a positive disruptive manner. Technology changes in the medical industry are more apparent, and they’re working hard to provide robust data storage and more importantly effective data analysis.

Being a developer for one of the leading HealthCare providers, I have first-hand experience of the struggles and pain behind making these analytical systems. However, If you know the right set of tools and technology and open to new experiments, development is fun!

In this article, We’ll analyze one of the basic and typical requirements when developing a healthcare analytic system. We’ll take a sample medical blood report in PDF format, and try to extract its data in a specific format. In the end, we’ll have JSON ready as per our requirement. As we know JSON is standard when storing or transferring data, and almost every technology framework and database supports it.

The solution we’re going to demonstrate here works seamlessly for scanned PDF also. PDF.co has in-built support for OCR. Following are the tools/technology we’re going to use.

  • ByteScout Document Parser Template Editor
  • PDF.co (Use Document Parser SDK for an on-premises solution.)

Analyze Document

Let’s analyze input PDFs that resemble a sample blood report. What we aim to achieve here is to extract specific information such as patient name, report date, or report table and export it to JSON format.

We emphasize specific information extraction with the predefined format so that we can build a generic parsing mechanism and store it in a structured and useful manner. Machine analysis requires structured data. Structured data is necessary because data analysis made upon raw data is not reliable even with trained AI models.

ByteScout Template Editor

PDF.co provides an API endpoint for converting whole PDF data to JSON format (check out this and this link), but this is not a requirement here. We’ll be building a data extraction template first with help of ByteScout Document Parser Template Editor.

PDF.co Document Parser Template Editor

PDF.co Document Parser Template Editor is a cloud-based template generator and viewer. It is fast, secure, and features a rich online editor to generate data parsing templates.

This editor is simple and well designed, users can directly load sample PDF and start mapping fields. Commands provided in the top navigation bar helps map different fields by just drawing a rectangle over it! Users can further modify this by specifying type data type or regex.

One of the cool features about template editors is that users can see sample output inside the editor it-self. This is really useful as developers can know sample output in advance. This is also beneficial for quick manual data extraction without any coding.

This tool is in beta as of now. Soon it’ll be rolled out at PDF.co.

ByteScout Document Parser Template Editor

This ByteScout Document Parser Template Editor installs with the installation of ByteScout Document Parser SDK. You can download the ByteScout community edition to get started with it.

Template Editor is very good at mapping fields. It provides in-built support to “Detect Fields”, which will map common fields such as phone no, social security no, etc. For fine-grain support, we can also map individual fields using the “Add rectangle field” option.

Compared to PDF.co Template Editor, this is an offline version provided by ByteScout. It contains some advanced features such as automatic field detection.

Building a Template

Following is the sample template we’ve mapped for this demonstration purpose. We’ve mapped three fields, Patient Name, ReportName, and TestResults.

PatientName This field maps to the patient name field in the sample PDF. DataType for this field is a string.
ReportName We’ve chosen the report date time for this field with the SmartDate data type.
TestResults This is a very interesting field. We’ve selected the entire region where report data is displayed and marked it as Rectangle Table type.

One cool thing about document templates is that it provides inbuilt support for extracting templated data to various formats such as JSON, CSV, XML, YAML or simply GRID format. We can quickly check generated output directly inside the Template Editor!

Blood Report YAML Template

Once our template is ready and all field mapping is complete, we’re now ready to export the template into YAML format. For this sample, the exported template looks like the following.

templateName: BloodTestTemplate
templateVersion: 4
templatePriority: 0
detectionRules:
  keywords: []
objects:
- name: PatientName
  objectType: field
  fieldProperties:
    fieldType: rectangle
    rectangle:
    - 177.75
    - 123.75
    - 62.25
    - 12.75
    pageIndex: 0
- name: ReportName
  objectType: field
  fieldProperties:
    fieldType: rectangle
    expression: '{{SmartDate}}'
    dataType: date
    rectangle:
    - 335.25
    - 94.5
    - 65.25
    - 12
    pageIndex: 0
- name: TestResults
  objectType: table
  tableProperties:
    start:
      y: 261.75
      pageIndex: 0
    end:
      y: 712.5
      pageIndex: 0
    left: 41.25
    right: 573.75
    rowMergingRule: byBorders

Program to Automate

So far we’ve identified the data-export-related required fields. We’ve also reviewed JSON file preview, and now we’re ready to automate it. We’re going to write a program that will get us JSON data from all similar formatted blood reports.

Following is the C# code snippet. Let’s review it first, and we’ll analyze it later.

// Parse with PDF.Co API
static void ParseWithPDFCoApi(string SourceFile, string sampleTemplate)
{
    // The authentication key (API Key).
    // Get your own by registering at https://app.pdf.co/documentation/api
    const String API_KEY = "***********************************";

    // PDF document password. Leave empty for unprotected documents.
    const string Password = "";

    // Destination TXT file name
    const string DestinationFile = @".\result.json";

    // (!) Make asynchronous job
    const bool Async = true;

    // Template text. Use Document Parser SDK (https://bytescout.com/products/developer/documentparsersdk/index.html)
    // to create templates.
    // Read template from file:
    String templateText = File.ReadAllText(sampleTemplate);

    // Create standard .NET web client instance
    WebClient webClient = new WebClient();

    // Set API Key
    webClient.Headers.Add("x-api-key", API_KEY);

    // 1. RETRIEVE THE PRESIGNED URL TO UPLOAD THE FILE.
    // * If you already have a direct file URL, skip to the step 3.

    // Prepare URL for `Get Presigned URL` API call
    string query = Uri.EscapeUriString(string.Format(
        "https://api.pdf.co/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={0}",
        Path.GetFileName(SourceFile)));

    try
    {
        // Execute request
        string response = webClient.DownloadString(query);

        // Parse JSON response
        JObject json = JObject.Parse(response);

        if (json["error"].ToObject() == false)
        {
            // Get URL to use for the file upload
            string uploadUrl = json["presignedUrl"].ToString();
            string uploadedFileUrl = json["url"].ToString();

            // 2. UPLOAD THE FILE TO CLOUD.

            webClient.Headers.Add("content-type", "application/octet-stream");
            webClient.UploadFile(uploadUrl, "PUT", SourceFile); // You can use UploadData() instead if your file is byte[] or Stream
            webClient.Headers.Remove("content-type");

            // 3. PARSE UPLOADED PDF DOCUMENT

            // URL for `Document Parser` API call
            query = Uri.EscapeUriString(string.Format(
                "https://api.pdf.co/v1/pdf/documentparser?url={0}&async={1}",
                uploadedFileUrl,
                Async));

            Dictionary<string, string> requestBody = new Dictionary<string, string>();
            requestBody.Add("template", templateText);

            // Execute request
            response = webClient.UploadString(query, "POST", JsonConvert.SerializeObject(requestBody));

            // Parse JSON response
            json = JObject.Parse(response);

            if (json["error"].ToObject() == false)
            {
                // Asynchronous job ID
                string jobId = json["jobId"].ToString();
                // Get URL of generated JSON file
                string resultFileUrl = json["url"].ToString();

                // Check the job status in a loop. 
                // If you don't want to pause the main thread you can rework the code 
                // to use a separate thread for the status checking and completion.
                do
                {
                    string status = CheckJobStatus(webClient, jobId); // Possible statuses: "working", "failed", "aborted", "success".

                    // Display timestamp and status (for demo purposes)
                    Console.WriteLine(DateTime.Now.ToLongTimeString() + ": " + status);

                    if (status == "success")
                    {
                        // Download JSON file
                        webClient.DownloadFile(resultFileUrl, DestinationFile);

                        Console.WriteLine("Generated JSON file saved as \"{0}\" file.", DestinationFile);
                        break;
                    }
                    else if (status == "working")
                    {
                        // Pause for a few seconds
                        Thread.Sleep(3000);
                    }
                    else
                    {
                        Console.WriteLine(status);
                        break;
                    }
                }
                while (true);
            }
            else
            {
                Console.WriteLine(json["message"].ToString());
            }
        }
        else
        {
            Console.WriteLine(json["message"].ToString());
        }
    }
    catch (WebException e)
    {
        Console.WriteLine(e.ToString());
    }

    webClient.Dispose();
}

// Check PDF.co job status
static string CheckJobStatus(WebClient webClient, string jobId)
{
    string url = "https://api.pdf.co/v1/job/check?jobid=" + jobId;
    string response = webClient.DownloadString(url);
    JObject json = JObject.Parse(response);
    return Convert.ToString(json["status"]);
}

Source Code Analysis

Function ParseWithPDFCoApi takes up two parameters, SourceFile containing input file path and sampleTemplate containing YAML template string.

This program basically uses three PDF.co endpoints. Which are to upload source files to PDF.co cloud, Making PDF.co call to extract JSON data, and check request job status. At the start of the program we’re preparing necessary parameters for making PDF.co requests. The following table summarizes them.

API_KEY This is the PDF.co API key used for authenticating PDF.co requests. Each PDF.co request must have a header key “x-api-key” with an API key specified in them.
Password This field is useful for password-protected documents. Document password should be provided here.
DestinationFile Specifies destination file location for generated output JSON document.
Async PDF.co offers to run all jobs in either async mode or traditional synchronous mode. Extracting data from PDF might take up some time, hence we’re invoking this request in asynchronous mode and passing True here.

Uploading file to PDF.co Cloud

We’re uploading the Input source file to PDF.co cloud, and for that, we need a pre-signed URL. Following PDF.co endpoint is called along with input file name to get a pre-signed URL.

https://api.pdf.co/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name=

This request to pre-signed URL returns JSON containing pre-signed URL (json[“presignedUrl”]) as well as public URL (json[“url”]). Pre-signed URL (json[“presignedUrl”]) is further used to upload input files using PUT requests

// 2. UPLOAD THE FILE TO CLOUD.
webClient.Headers.Add("content-type", "application/octet-stream");
webClient.UploadFile(uploadUrl, "PUT", SourceFile); // You can use UploadData() instead if your file is byte[] or Stream
webClient.Headers.Remove("content-type");

Making a request

PDF.co endpoint /v1/pdf/documentparser is used to parse input documents and get desired data. This POST request contains input data such as source file URL, async mode, and template.

// URL for `Document Parser` API call
query = Uri.EscapeUriString(string.Format("https://api.pdf.co/v1/pdf/documentparser?url={0}&async={1}", uploadedFileUrl,Async));

Dictionary<string, string> requestBody = new Dictionary<string, string>();
requestBody.Add("template", templateText);

// Execute request
response = webClient.UploadString(query, "POST", JsonConvert.SerializeObject(requestBody));

Now, Request output contains output URL(json[“url”]) along with Job Id (json[“jobId”]). We have opted for async mode requests, hence instead of getting the direct result, we are getting the resulting URL. And to check whether the result is ready or not, we’re making an API call at https://api.pdf.co/v1/job/check?jobid=. Possible job statuses are “working”, “failed”, “aborted” and “success”.

Once Job-status turns success, we’re getting the result and writing it at the destination file location.

if (status == "success")
{
    // Download JSON file
    webClient.DownloadFile(resultFileUrl, DestinationFile);

    Console.WriteLine("Generated JSON file saved as \"{0}\" file.", DestinationFile);
    break;
}

Output

The output of this sample blood report is as follows.

Also, the full source code for this program is available at our GitHub repository at this location.

Summary

That’s it! Few lines of code and we’re programmed our PDF input parsing modal. This can be used to parse all future requests with the same formatted PDFs. In case we have a requirement to parse another formatted PDF, all we need is to create a new parsing template and use that as input. Code is generic!

We’ve emphasized the use of PDF.co to parse blood reports in this article. ByteScout (parent company of PDF.co) also provides on-premise solutions for the same.

Please try this sample in your machine to get more info out of this article. Thank you for reading!