Amazon Textract uses machine learning to extract text and data, including from tables and forms, in virtually any document

Amazon Web Services Inc. (AWS), an Amazon.com company, announced Wednesday general availability of Amazon Textract, a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document without the need for manual review, custom code, or machine learning experience.

Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented, such as a name or social security number from a tax form or the product SKU or quantity in a warehouse from an inventory report. The extracted text and data can be easily used to build smart searches on large archives of documents, or can be loaded into a database for use by applications, such as accounting, auditing, and compliance software.

Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos, and customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to derive deeper meaning from the extracted text and data.

Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads them, and returns data in the form of JSON text annotated with the page number, section, form labels, and data types. This data can then be used for a range of applications (e.g. generating smart search indexes, redacting text in a massive collection of forms, creating automated loan approval workflows, using the data for regulatory compliance, and flagging fraud risk for insurance claims).

Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is a time-consuming and often inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications. That’s because existing OCR technologies are unable to recognize common layouts like forms and tables, and only generate a lengthy and often inaccurate text dump.

What organizations want instead is the ability to accurately identify and extract text and data from forms and tables in documents of any format and from a variety of file types and templates.

Amazon Textract analyzes virtually any type of document, automatically generating highly accurate text, form, and table data. Amazon Textract identifies text and data from tables and forms in documents – such as line items and totals from a photographed receipt, tax information from a W2, or values from a table in a scanned inventory report – and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring any customization or human intervention.

Amazon Textract can provide the inputs required to automatically process forms without human intervention. For example, banks can automate loan applications using Amazon Textract. The information contained in the document could be used to initiate all of the necessary background and credit checks to approve the loan so that customers can get instant results of their application rather than having to wait several days for manual review and validation.

Amazon Textract makes it easy for customers to accurately process millions of document pages in just a few hours, significantly lowering document processing costs, and allowing customers to focus on deriving business value from their text and data instead of wasting time and effort on post-processing. Results are delivered via an API that can be easily accessed and used without requiring any machine learning experience.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required. Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning. “In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.”

PwC helps organizations and individuals create value by delivering quality in assurance, tax, and advisory services.

“At PwC, we work to provide our customers with intelligent automation tools that help transform previously manual processes. We’ve integrated Amazon Textract into our solution for the pharmaceutical industry to automate document processing for various FDA forms like MedWatch and CIOMS,” said Siddhartha Bhattacharya of PwC. “Previously, people would manually review, edit, and process these forms, each one taking hours. Amazon Textract has proven to be the most efficient and accurate OCR solution available for these forms, extracting all of the relevant information for review and processing, and reducing time spent from hours to down to minutes.”

TeraDact allows customers to transform stored images and paper documents into privacy-compliant, usable digital formats at scale.

“Amazon Textract’s smart docs platform feeds TeraDact’s patented redaction services to automatically remove and secure sensitive data. TeraDact customers can permanently remove this data so that it can never be recovered or opt to replace sensitive data with patented tokens which can be recovered by individuals with the appropriate permissions. This is particularly useful in complying with government mandates surrounding individual data privacy such as GDPR,” said Tom Trobridge, COO, TeraDact.

Ripcord’s mission is to digitize and extract knowledge from paper documents using vision-guided robotics, machine learning, and advanced AI. This knowledge automates business processes and workflows.

“We’ve had tremendous success utilizing Amazon Textract to augment our advanced entity extraction to benefit many industries and uncover $4 billion in new pay. We look forward to expanding our use of Amazon Textract across financial and government services, healthcare and legal,” said Alex Fielding, CEO of Ripcord.

Blue Prism develops Robotic Process Automation software to provide businesses and organizations with a more agile virtual workforce.

“Blue Prism’s connected-RPA can automate and perform mission-critical processes, allowing customers the freedom to focus on more creative, meaningful work. By using Amazon Textract, we’ve given our digital workforce another powerful tool for automation. Amazon Textract accurately analyzes data from various document types using machine learning, which enhances the digital transformation journey for our customers. Using additional AWS AI services like Amazon Comprehend and Amazon Rekognition, we can tackle challenges from added secure customer authentication processes to fraud detection capabilities. The intelligence and flexibility of Amazon Textract’s form data extraction can elevate OCR to new levels in industries like financial services, retail, manufacturing and transportation to name a few,” said Dave Moss, CTO and co-founder of Blue Prism.

Customers can load the data into business software, such as spreadsheets, databases, and payroll systems, or they can analyze and query the data using Amazon ElasticSearch, Amazon DynamoDB, Amazon Redshift, or Amazon Athena. Amazon Textract is available today in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland), and will expand to additional regions in the coming year.


IoT Innovator Newsletter

Get the latest updates and industry news in your inbox! Enter your email address and name below to be the first to know.

Name