AWS has launched a new service that lets users automatically extract text, tables and other data from documents ranging from JPEG, PNG and PDF, with prices starting at $1.50 per 1,000 pages for the first million pages.
Amazon Textract, which is powered by machine learning, was launched for general availability late Wednesday. AWS name checks The Globe and Mail, MET Office, PwC, UiPath, and more as early adopters of the service.
Organisations regularly have to extract text and data from an array of files like contracts or tax documents. Administrative employees in legal and healthcare offices have to process insurance, patient and employee forms; often this is done manually.
The service is built on machine learning and optical character recognition (OCR) software. Typically OCR software has struggle with reading data and text submitted in varying formats due to a lack of contextual training. Textract’s API is able to extract text from multiple image formats such as PDFs, photos of documents and scanned material.
Optical Character Recognition (OCR) Image Source: AWSFiles are stored in an Amazon S3 bucket from which Textract reads and extracts the data before returning it in a JSON file format, annotated with the page, number, section, data types and form labels. All data that is extracted is returned with bounding box coordinates which demarcates each piece of data from the other, allowing the user to quickly scan and identify individual words or numbers.
Amazon engineers have trained the Textract machine learning tool on millions of documents so that it is capable of recognising ‘virtually’ any form of document type submitted for processing allowing it to easily extract text and data.
See also: IBM Releases Pre-Trained AI Toolkit For 9 Industries
Swami Sivasubramanian, VP of Amazon’s Machine Learning division commented in a release : “No machine learning experience [is] required… Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services…to help customers derive deeper meaning from the extracted text and data.”
San Francisco-based Informed, Inc. automates how financial institutions originate loans and open bank accounts. CEO Justin Wickett said: “We have already used Amazon Textract to analyze tens of thousands of loan documents on behalf of financial institutions, and our own software-as-a-service offering has been enhanced by the service, enabling us to identify 95 percent of the defects in loan application packages and help banks reduce their manual data entry.”
APIs Analyses and Extract Text With Context
Amazon Textract uses APIs to detect and extract text from each of the submitted documents, the APIs will extract structured data in the form of tables.
The document analysis API can detect text, fields, values, table and their contextual relationship. The operation returns three categories of text extraction, text, form and tables. With form data the API will return it as key-value pairs. If it encounters text such ‘Name: Jane Doe’ it will identify a key such as ‘Name’ and a value ‘Jane Doe’.
Data Operations Analysis
In its extract text and data operations analysis, the below types are returned:
- PAGE – Contains a list of child
Block
objects that are detected on a document page. - KEY_VALUE_SET – Stores the KEY and VALUE
Block
objects for linked text that’s detected on a document page. Use theEntityType
field to determine if a KEY_VALUE_SET object is a KEYBlock
object or a VALUEBlock
object. - WORD – A word that’s detected on a document page. A word is one or more ISO basic Latin script characters that aren’t separated by spaces.
- LINE – A string of tab-delimited, contiguous words that are detected on a document page.
- TABLE – A table that’s detected on a document page. A table is grid-based information with two or more rows or columns, with a cell span of one row and one column each.
- CELL – A cell within a detected table. The cell is the parent of the block that contains the text in the cell.
- SELECTION_ELEMENT – A selection element such as an option button (radio button) or a check box that’s detected on a document page. Use the value of
SelectionStatus
to determine the status of the selection element.
Once a scan is completed the tool returns a confidence score on its work highlighting each element with a rating depicting how confidant it has translated the data correctly. The scoring system can be set to flag any irregularities that fall below a set confidence percentage, for instance when reading an official payslip the tool can be instructed to alert the user of any confidence scores below 90 percent.
In terms of text and language it is restricted to Latin-script characters from the English alphabet, as well as ASCII which is the American Standard Code for Information Interchange, such as SOH for ‘start of heading’ or STX ‘start of text.’
Pricing for the service begins at $0.0015 per page, typical page can contain up to 3000 words, while a 1000 pages will cost $1.50. If you are submitting more than a million pages then the rates will drop to $0.0006 per page and $0.60 for a 1000 pages.
Philip Brohan, Climate Scientist at the UK’s Met Office stated that: “We hope to use Amazon Textract to digitize millions of historical weather observations from document archives. Making these observations available to science will improve our understanding of climate variability and change.”