Agentic OCR: Gemini 3 flash revolutionizing text extraction

Google has just announced a step forward in intelligent document processing with the introduction of Agentic Vision for Gemini 3 flash. This new capability can improve what I am looking and testing for some time, Document digitalization known as optical character recognition moving OCR from a passive, static scan to an active, investigative process, that can zoom, rotate and perform needed step to achieve better result to understand your paper documents.

At the end of this text, I will provide example how to start building agentic vision with Gemini 3 flash, but the whole text is worth to review to know what is even possible. 

Just imagine how many papers are still needed in regular agenda and how many people sitting Infront of computers just as interface putting data to specific form. This stupid kind of human interface can be replaced by AI much sooner than for example programmers. These people can have free hans and focus on tasks where the creativity and edge solution is really needed. The market is masive.

By combining visual reasoning with code execution by Agentic Vision, the model formulates plans to zoom in, inspect, and manipulate images step-by-step. This improves ability to transfer pater to digital and structured data significantly.

Enabling code execution with Gemini 3 flash delivers a consistent 5-10% quality boost across most vision benchmarks according to google, with even higher gains in complex OCR tasks. My experience is that even static model was great in OCR. I was switching to pro in some cases of flash failure, but currently most of the task are managed just by flash model.

From easy use cases like bills, simple office documents to very sophisticated invoices or financial documents.

Just take picture, provide structure that is accepted by further financial software, export the structured data and even provide logic that Gemini needs to perform over the extracted data. This is power, super power. Now agentic vision can also zoom, perform steps to be even more precise in details. In practical workflow, this is more complicated task to minimize error and create user friendly service.

The problem with static OCR

Traditional frontier AI models process a document in a single pass. If the text is too small like a serial number on a scanned invoice or a footnote in a dense contract the model often has to guess. This "static" approach fails when documents are low-resolution, dense, or contain mixed media. This is where these 5 to 10 percent of quality improvements comes from. I was surprised how Gemini can digitalized and look and plan to get details from document.

Agentic Vision solves this by introducing a Think, Act, Observe loop specifically for extraction. It is truly great.

Think: The model identifies illegible or dense sections of a document.

Act: It executes Python code to "crop" and "zoom" into those specific regions, treating them as new, high-resolution inputs.

Observe: It reads the text from the cropped view with pixel-perfect precision and integrates it back into the final answer.

Example agent action:

This is what agents plan for you, where to zoom, and even reason why to do so. Instead of doing transformation before submitting document to OCR by Gemini. Gemini will do it if needed. It might cost more, but sometimes you do not know where to zoom because each document can be slightly different. Gemini will recognize and zoom for details. Wow

# The agent zooms into the serial number area
from PIL import Image
img = Image.open('invoice.jpg')
# Define coordinates for the tiny text
cropped_img = img.crop((100, 500, 300, 600)) 
cropped_img.save('zoomed_serial.jpg')

2. The perfect alignment

Ever taken a crooked photo of a receipt? Gemini can now write code to fix the rotation before it even tries to read the totals. The unperfect rotation of some area or even the whole document is very often reason where static recognition failed. Same as in previous case, you do not know in advance how to preprocess document, each can be rotated differently. So let the Gemini recognize what is not rotated to be recognized correctly.

Example agent action:

# The agent detects a 15-degree tilt and fixes it
rotated_img = img.rotate(-15, expand=True)
rotated_img.save('straightened_document.jpg')

3. Enhancing contrast for faded text

For those old, faded tax documents, the agent can boost the contrast to make the numbers pop. Scans, foto contrast is another reason where OCR can failed. Gemini can recognize darker spots and make contrast aligned with the rest of the document. OOoo wow.

Example agent action:

from PIL import ImageEnhance
enhancer = ImageEnhance.Contrast(img)
enhanced_img = enhancer.enhance(2.0) # Doubles the contrast

This all is where the improvements comes from.

OCR benchmarks: non-agentic vs. agentic

The introduction of agentic capabilities has resulted in measurable improvements in extraction accuracy. Below is a breakdown of performance gains when comparing standard static processing (Non-Agentic) with the new Code Execution-enabled workflow (Agentic).

Metric / Category Static OCR (Standard Prompt) Agentic OCR (Thinking + Code) Improvement / Note
OmniDocBench (Accuracy) 0.121 (Edit Distance) 0.108 (Estimated) +10% Error Reduction
Complex Extraction Quality Baseline Performance +5% to 10% Boost Significant in handwriting/tables
Workflow Methodology Single pass (Fixed resolution) Think-Act-Observe Loop Uses Python to crop/zoom docs
Handwritten Text Recall High (95%+) Near-Perfect (98.5%+) Reduced hallucinations in "noisy" docs
Latency (Per Page) ~1-2 seconds 3-8 seconds Higher due to multi-step reasoning
Token Efficiency Standard usage 30% fewer tokens avg. Due to smarter visual targeting


Real-world OCR use cases

Developers are already using agentic vision to solve previously, but market opportunities are still here impossible OCR challenges:

1. High-precision technical review

PlanCheckSolver.com reported a 5% increase in accuracy by using Agentic Vision to validate building codes.

  • The Challenge: Reading tiny measurements and annotations on large-format 4K architectural blueprints.

  • The Agentic Solution: The model writes code to dynamically crop specific sections (like roof edges or electrical diagrams) to read the text clearly, rather than trying to process the entire blueprint at once.

2. Financial data extraction

Dense tables in financial reports often confuse standard OCR models, leading to "hallucinated" numbers.

  • The Challenge: Extracting hundreds of specific fields (dates, amounts, line items) from mixed-format invoices and tax documents.

  • The Agentic Solution: Gemini 3 Flash uses code to normalize the data structure and verifies the numbers by performing visual arithmetic, ensuring the extracted data matches the visual evidence.

3. "Visual scratchpad" for annotation

Instead of just outputting text, the model can now verify its own work.

  • The agentic solution: When asked to count items or verify a list in a document, the model can generate a "visual scratchpad," drawing bounding boxes and labels on the image itself. This forces the model to ground every character it recognizes in specific pixels, significantly reducing transcription errors.

Implementation: How to enable agentic OCR

Implementing this "active" OCR requires a simple update to your Gemini API configuration. By enabling the code_execution tool, you grant the model permission to write and run Python code (using libraries like opencv-python, pillow, and numpy) to manipulate the document image before reading it.

So in example below just focus on enabling code execution during Gemini API calls. 

# Enable the Code Execution tool in the config
config = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.ToolCodeExecution)]
)

The advantage in code

Unlike standard OCR API calls that return raw text, this approach allows the model to self-correct. If it cannot read a blurry section, it will generate a script to sharpen or crop that specific area and try again.

Code example

Here is how to initialize the Gemini 3 Flash model with Code Execution enabled for an OCR task:

from google import genai
from google.genai import types

client = genai.Client()

# Load your document image (e.g., a dense invoice or blueprint)
image = types.Part.from_uri(
    file_uri="https://example.com/dense-invoice.jpg",
    mime_type="image/jpeg"
)

# Enable the Code Execution tool in the config
config = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.ToolCodeExecution)]
)

# Prompt the model to use its new tools
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[
        image, 
        "Extract the serial number from the bottom-left barcode and the total amount. " 
        "If the text is small, use code to crop and zoom into those regions first."
    ],
    config=config
)

# The response will contain the model's thought process, the code it ran, and the final answer
print(response.text)

Why this wins for OCR:

  • Active Cropping: The model implicitly detects small text and writes PIL or OpenCV scripts to crop image[y:y+h, x:x+w], effectively "zooming in" to boost resolution for the OCR engine.

  • De-skewing: It can write code to detect document rotation and correct it before attempting to read text.

  • Verifiable Extraction: You can ask the model to draw a box around the extracted text to prove it found the correct location, which allows for human-in-the-loop validation.

What’s next for intelligent document processing?

This release is just the beginning. Google plans to expand Agentic Vision for OCR in several key areas:

Implicit Visual Math- Future updates will allow the model to automatically perform calculations on extracted data without explicit prompting.

Broader Availability- The technology is rolling out to other sizes in the Gemini family, bringing high-end OCR reasoning to even more efficient models.

How to get started in google AI studio

Start in Google AI Studio, experimenting in playground. You can generate API key later and call Gemini by provided sample code based on your input. First use Gemini 3 Flash model. Pro model does not support code execution agentic vision as far as I know. It could be also handy instruct to create structured data. 


Upload picture (i used invoice from hugging face invoices-and-receipts_ocr_v1 dataset) and provide these instructions. 


Gemini than create following code that is going to preprocess my image. 


Invoice is by code than cropped into several images that will be analyzed separately for better result. OCR is performed over each sub region separately and result combined and analyzed together. 



This is the result of invoice analysis. I still cover some of the details even the dataset is public, bul all matches as it should. 

------------------------------------------------------------------------------------------------------------------------

This invoice (No. 98-------1, issued on 09/02/2019) details a transaction between Pe-----n LLC (Seller) and Pa-----------s (Client).

Entities Involved

  • Seller:

    • Name: Pe------ LLC

    • Address: USS Da-----, FPO -----

    • Tax ID: 979-97-----

    • IBAN: G-----A1644----99-------

  • Client:

    • Name: Pat------on, Pie----- and Fue-----es

    • Address: 79----0 An-----on Fort, Mic-----port, SD 4-----

    • Tax ID: 932-------

Detailed Item List

No.DescriptionQtyUMNet PriceNet WorthVAT [%]Gross Worth
1.New Design Rectangular Hand Woven Chindi Rag Rug Floor Handmade Carpet Aztec Mat1.00each29.9929.9910%32.99
2.Bedroom Decor Carpets Living Room Line Pattern Green Rug Modern Simple Area Rugs1.00each33.8633.8610%37.25
3.YILONG 2.5'x4' Small Hand Knotted Silk Carpets Antistatic Floor Area Rug 844B1.00each2,000.002,000.0010%2,200.00
4.Moroccan Amazigh Handmade Carpet 14'2" x 7'5"5.00each342.001,710.0010%1,881.00

Invoice Summary & Verification

CategoryValue
Total Net Worth$ 3,773.85
VAT (10%)$ 377.38
Total Gross Worth$ 4,151.23

Calculations Check:

  1. Net Worth Sum: 29.99 + 33.86 + 2,000.00 + 1,710.00 = 3,773.85 (Correct)

  2. VAT Calculation: 10% of 3,773.85 = 377.385 (The invoice uses 377.38, likely applying a floor rounding or slight variation in precision).

  3. Gross Worth Sum: 3,773.85 + 377.38 = 4,151.23 (Correct)

  4. Individual Items: All item gross values (Net + 10% VAT) match their respective totals listed in the table.

The invoice data is consistent and correctly calculated.

------------------------------------------------------------------------------------------------------------------------

Final notes

Agentic Vision is available now via the Gemini API in Google AI Studio and Vertex AI. The easiest way is Google AI studio where you can get API key in free tier and try Gemini. I am using Gemini pro subs, with some increased limits and also have Vertex AI enabled. But free tier exists, This is huge advantage against ChatGPT, where you can pay subscription, but no API credits. 

Developers can experiment with these OCR capabilities by enabling "Code Execution" directly in the Google AI Studio tools settings and uploading high-density documents to test the "Think, Act, Observe" loop and bring new ideas to life. 

Let me know what you are building. 

Reference:

 Google Developers Blog

Previous Post
No Comment
Add Comment
comment url