About Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Read more about Spark at https://spark.apache.org/

What is PDF.co ?

PDF.co is the secure and scalable data extraction API service with a full set of PDF tools included.

Benefits:

  • Decrease data entry costs using AI-powered unstructured data extraction from PDF and scanned documents, invoices, reports, receipts, and agreements;
  • Read from complex documents using customizable Document Parser. Supports automatic reading from tables, PDF forms, and mixed content documents;
  • Save time on preparing documents with PDF filler functionality that can add text, images, and fields to PDF forms and PDF documents;
  • Leverage power of a built-in full set of PDF tools: split PDF, merge PDF, delete pages, advanced HTML to PDF generation;
  • Detailed API logs for Enterprise users with audit logs requirements;
  • On-premise and offline versions are available for Enterprise users;

Security

  • All documents and files processed by PDF.co are encrypted at rest using AES 256-bit encryption;
  • PDF.co relies on TLS and SSL to transmit data and files (the same security protocols that are used by banks)
  • Runs on award-winning secure certified Amazon AWS infrastructure: https://pdf.co/security

Spark and PDF.co Integration

To start, please use the button below:

Setup Spark+PDF.co