aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Data
  • Engineering

Automate Annotations For Vertex AI Text Datasets With Cloud Vision API And BigQuery

  • aster.cloud
  • June 28, 2022
  • 3 minute read

One of the main challenges machine learning practitioners face is the availability of annotated training datasets or a lack thereof. In many cases, practitioners may have access to existing datasets that have been manually extracted, which they can use to accelerate their model training.

In this post, we demonstrate how Google Cloud AI/ML products can be used to train a text entity extraction model for patent application PDFs. We use BigQuery, Vision API, and Jupyter Notebook to automatically annotate an existing dataset used for model training. Although we won’t go into the details of each step, you can check the complete version in this Jupyter Notebook, which is released as part of the Vertex AI Samples GitHub repository.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

The sample dataset

The dataset used in this example is the Patent PDF Samples with Extracted Structured Data from the BigQuery public datasets. It contains links to PDFs from the first page of a subset of US and EU patents stored in Google Cloud Storage. The dataset also contains labels for multiple patent entities including the application number, patent inventor, and publication date. This provides the ideal dataset to use for our next step.

Preprocessing PDF documents using Cloud Vision API

Today, Vertex AI AutoML entity extraction supports only text data for training datasets. Our first step in using the PDF files is to convert them to text format. Cloud Vision API offers a text detection feature that uses Optical Character Recognition (OCR) to detect and extract text out of PDF and TIFF files. It offers a batch operation mode that allows us to process multiple files at once.

Read More  Nature Already Inspired A.I. Than Most Realise

Preparing the training dataset

Vertex AI offers multiple ways to upload our training dataset. The most convenient choice in our case is to include annotations as part of the import process using an import file. The import file follows a specific format that specifies the content and the list of annotations for each label we want to train.

To generate the annotations, we are going to query the existing data stored in BigQuery and find the location of extracted entities in each file. If an entity has multiple occurrences in the text, all of the occurrences are included in the annotations. We will then export the annotations in JSON Lines format to a file in Google Cloud Storage and use that file in our model training. We can also review the annotated dataset in the Google Cloud Console to ensure the accuracy of the annotations.

Training the model

Once the import file is ready, we can then create a new text dataset in Vertex AI, and use that dataset to train a new entity extraction model. In a few hours, a model is ready for deployment and testing.

Evaluating the model

Once the model training is completed, you can review the model’s evaluation results in the Google Cloud Console. Click here to learn more about how to evaluate Vertex AI AutoML models.

Model evaluation results

Putting it all together

The diagram below shows the various components used to build the complete solution and how they interact with each other.

 

 

 

Solution diagram
Note: This diagram was created using the free Google Cloud Architecture Diagramming Tool, which makes it easy to document your Google Cloud architecture. Check it out and begin using it for your own project!

Summary

In this post, we’ve learned how to train a Vertex AI text entity extraction model by using BigQuery and Vision API to annotate ground truth data.  By using this approach, it is easier for you to replicate this solution and leverage existing datasets to accelerate your AI/ML journey.

Read More  Building Advanced Beam Pipelines In Scala With SCIO

Next Steps

You can try this solution by using this Jupyter Notebook. You can run this notebook on your machine, in Colab or in Vertex AI Workbench. You can also check out the Vertex AI Samples GitHub repository for more examples on developing and managing machine learning workflows using Google Cloud Vertex AI.

And if you’d like to review more of the latest tool set from Google Cloud for ML practitioners, you can watch recordings of the second Google Cloud Applied ML Summit. Catch up on the latest product announcements, insights from experts, and customer stories that can help you grow your skills at the pace of innovation.

We wish you a happy machine learning journey!

Special thanks to Karl Weinmeister, Andrew Ferlitsch and Daniel Wang for their help in reviewing this post’s content, and to Terrie Pugh for her editorial support. You rock!

 

By: Mohammad Al-Ansari (Customer Engineer, Infrastructure Modernization (GCloud Customers))
Source: Google Cloud Blog


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • BigQuery;
  • Cloud Vision API
  • Data Processing
  • Google Cloud
  • Machine Learning
  • Vertex AI
You May Also Like
Data center
View Post
  • Data
  • Public Cloud

Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency

  • June 3, 2026
View Post
  • Data
  • Platforms
  • Technology

Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future

  • May 11, 2026
View Post
  • Data

Streamline read scalability with Cloud SQL autoscaling read pools

  • March 23, 2026
View Post
  • Data
  • Platforms
  • Public Cloud

PayPal’s historically large data migration is the foundation for its gen AI innovation

  • March 4, 2026
View Post
  • Data
  • Technology

3 obstacles to agentic AI adoption and how to overcome them

  • December 22, 2025
Points, Lines and a Question
View Post
  • Architecture
  • Design
  • Engineering
  • People

What Is The Point In Making Points?

  • November 26, 2025
View Post
  • Engineering
  • Software Engineering

Development gets better with Age

  • October 9, 2025
View Post
  • Engineering
  • Technology

Apple supercharges its tools and technologies for developers to foster creativity, innovation, and design

  • June 9, 2025

Stay Connected!
LATEST
  • 1
    Expectations vs. Reality: The AI We Thought We’d Have in 10 Years
    • June 19, 2026
  • digital-nomad-freelancer-worker-2151205464 2
    One paperwork problem – Get your Digital Nomad Visa employment documents fast from UK, EU or Singapore
    • June 16, 2026
  • 3
    Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection
    • June 15, 2026
  • 4
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 5
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 6
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 7
    WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements
    • June 8, 2026
  • 8
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 9
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 10
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
  • pope-leo-xiv-cq5dam-1500.844 2
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 3
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 4
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
  • 5
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.