Amazon Textract Aims to Extract Text From 'Virtually' Any Document

AWS has launched a new service that lets users automatically extract text, tables and other data from documents ranging from JPEG, PNG and PDF, with prices starting at $1.50 per 1,000 pages for the first million pages.

Amazon Textract, which is powered by machine learning, was launched for general availability late Wednesday. AWS name checks The Globe and Mail, MET Office, PwC, UiPath, and more as early adopters of the service.

Organisations regularly have to extract text and data from an array of files like contracts or tax documents. Administrative employees in legal and healthcare offices have to process insurance, patient and employee forms; often this is done manually.

The service is built on machine learning and optical character recognition (OCR) software. Typically OCR software has struggle with reading data and text submitted in varying formats due to a lack of contextual training. Textract’s API is able to extract text from multiple image formats such as PDFs, photos of documents and scanned material.

Optical Character Recognition (OCR) Image Source: AWSFiles are stored in an Amazon S3 bucket from which Textract reads and extracts the data before returning it in a JSON file format, annotated with the page, number, section, data types and form labels. All data that is extracted is returned with bounding box coordinates which demarcates each piece of data from the other, allowing the user to quickly scan and identify individual words or numbers.

Amazon engineers have trained the Textract machine learning tool on millions of documents so that it is capable of recognising ‘virtually’ any form of document type submitted for processing allowing it to easily extract text and data.

APIs Analyses and Extract Text With Context

Amazon Textract uses APIs to detect and extract text from each of the submitted documents, the APIs will extract structured data in the form of tables.

The document analysis API can detect text, fields, values, table and their contextual relationship. The operation returns three categories of text extraction, text, form and tables. With form data the API will return it as key-value pairs. If it encounters text such ‘Name: Jane Doe’ it will identify a key such as ‘Name’ and a value ‘Jane Doe’.

Data Operations Analysis

In its extract text and data operations analysis, the below types are returned:

PAGE – Contains a list of child Block objects that are detected on a document page.
KEY_VALUE_SET – Stores the KEY and VALUE Block objects for linked text that’s detected on a document page. Use theEntityType field to determine if a KEY_VALUE_SET object is a KEY Block object or a VALUE Block object.
WORD – A word that’s detected on a document page. A word is one or more ISO basic Latin script characters that aren’t separated by spaces.
LINE – A string of tab-delimited, contiguous words that are detected on a document page.
TABLE – A table that’s detected on a document page. A table is grid-based information with two or more rows or columns, with a cell span of one row and one column each.
CELL – A cell within a detected table. The cell is the parent of the block that contains the text in the cell.
SELECTION_ELEMENT – A selection element such as an option button (radio button) or a check box that’s detected on a document page. Use the value of SelectionStatus to determine the status of the selection element.

Once a scan is completed the tool returns a confidence score on its work highlighting each element with a rating depicting how confidant it has translated the data correctly. The scoring system can be set to flag any irregularities that fall below a set confidence percentage, for instance when reading an official payslip the tool can be instructed to alert the user of any confidence scores below 90 percent.

In terms of text and language it is restricted to Latin-script characters from the English alphabet, as well as ASCII which is the American Standard Code for Information Interchange, such as SOH for ‘start of heading’ or STX ‘start of text.’

Pricing for the service begins at $0.0015 per page, typical page can contain up to 3000 words, while a 1000 pages will cost $1.50. If you are submitting more than a million pages then the rates will drop to $0.0006 per page and $0.60 for a 1000 pages.

Philip Brohan, Climate Scientist at the UK’s Met Office stated that: “We hope to use Amazon Textract to digitize millions of historical weather observations from document archives. Making these observations available to science will improve our understanding of climate variability and change.”

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

APIs Analyses and Extract Text With Context

Data Operations Analysis

See Also: Informatica and Google Expand Partnership to Support Analytics Initiatives in BigQueryAzure NetApp Files Now Generally Available

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing