Computer Vision: Frictionless Mobile Document Digitization

Business Solution Series

Feb 15, 2023

The Business Solutions Series is a compilation of solutions to various business challenges that I have encountered throughout my professional journey.

Context

We were engaged by a client that specializes in digitizing documents. The client sought a solution to enable their mobile users to capture physical documents using their phone cameras and fill the documents electronically through their mobile application.

Objective

The objective was to develop a model that could accurately detect all inputs on a document, including the precise location, size, and type of the inputs (e.g. name boxes, date boxes, signature boxes).

Solution

To achieve this objective, we utilized several of Google Cloud Platform's AutoML managed services and created a scalable python API (also utilizing a GCP managed service) to coordinate the calls to all of the managed services. The mobile application was able to call this custom API every time a user captured a document using their phone camera.

The custom python API received the document image as input and:

Utilized the GCP Vision API, which is trained by Google, to extract all text and text metadata from the document image (text metadata includes the precise coordinates of each word in the document image).
In parallel, utilized the GCP AutoML Image managed service where we had trained a custom object detection model and deployed it as a service. This model detected all input boxes on the document image and returned their bounding boxes (coordinates).
Inserted a custom keyword for each detected input box into the document text (using the coordinates from both the Vision API and custom object detection model output to insert the keyword in the correct location).
Utilized a final custom AutoML Text model, trained to perform entity extraction, to classify the input boxes into their respective types based on the surrounding text.

For example:

Input document image:

Vision API output:
- Text: “Sample Document Name Date Signature End Of Sample Document”
- Metadata word coordinates: Sample <0, 0>, Document <9, 0> , Name <0, 4> …
Object detection model output:
- Coordinates: BOX_1 <6, 4>, BOX_2 <21 ,4>, BOX_3 <0 ,7>

Modified text using coordinates from both outputs:
- “Sample Document Name INPUT_BOX Date INPUT_BOX Signature INPUT_BOX End Of Sample Document”
Entity Extraction model output when providing the modified text:
- INPUT_BOX [0]=name, INPUT_BOX [1]=date, INPUT_BOX [2]=signature

In order to train the object detection and entity extractions model, we had to first obtain high-quality labeled data. To do this, we collaborated with a data annotation company that utilized human annotators to label thousands of document images. The annotators drew bounding boxes around the input boxes and selected input types for these, providing the training data our model needed to learn. Once the labeled data was obtained, it can be loaded into the managed AutoML Image service with ease, allowing for the training of an object detection model. The default settings and robust capabilities of the AutoML managed service provided us with an effective model for this task. Initially, before we used professional help, we encountered some issues with the model, but upon realizing that the issue was with the training data which had some inconsistent annotations, we enlisted the services of a professional data annotation vendor. This led to an improvement in the quality of the training data and subsequently, an improvement in the performance of the model from the AutoML service.

Similarly, we utilized the input type annotations to train the entity extraction model using the AutoML Text service, which also resulted in the managed service automatically learning a model that effectively addressed this task.

While there are multiple companies that offer labeling services for machine learning projects, I highly recommend Sama due to their commitment to hiring individuals living below the poverty line in developing countries, primarily in Africa, and helping them improve their livelihoods. I have first-hand experience with Sama as my spouse worked there for several years. I had the opportunity to visit their centers, meet their agents, and read the academic study about their impact

Impact

As a result of this solution, the mobile application was successfully deployed and is now being utilized by hundreds of thousands of users globally to efficiently digitize and fill documents. The API has proven to be a valuable tool in streamlining the document digitization process.

Data Science is not Rocket Science

Discussion about this post