Agentic OCR: Gemini 3 flash revolutionizing text extraction
Google has just announced a step forward in intelligent document processing with the introduction of Agentic Vision for Gemini 3 flash. This new capability can improve what I am looking and testing for some time, Document digitalization known as optical character recognition moving OCR from a passive, static scan to an active, investigative process, that can zoom, rotate and perform needed step to achieve better result to understand your paper documents.
At the end of this text, I will provide example how to start building agentic vision with Gemini 3 flash, but the whole text is worth to review to know what is even possible.
Just imagine how many papers are still needed in regular agenda and how many people sitting Infront of computers just as interface putting data to specific form. This stupid kind of human interface can be replaced by AI much sooner than for example programmers. These people can have free hans and focus on tasks where the creativity and edge solution is really needed. The market is masive.
By combining visual reasoning with code execution by Agentic Vision, the model formulates plans to zoom in, inspect, and manipulate images step-by-step. This improves ability to transfer pater to digital and structured data significantly.
Enabling code execution with Gemini 3 flash delivers a consistent 5-10% quality boost across most vision benchmarks according to google, with even higher gains in complex OCR tasks. My experience is that even static model was great in OCR. I was switching to pro in some cases of flash failure, but currently most of the task are managed just by flash model.
From easy use cases like bills, simple office documents to very sophisticated invoices or financial documents.
Just take picture, provide structure that is accepted by further financial software, export the structured data and even provide logic that Gemini needs to perform over the extracted data. This is power, super power. Now agentic vision can also zoom, perform steps to be even more precise in details. In practical workflow, this is more complicated task to minimize error and create user friendly service.
The problem with static OCR
Traditional frontier AI models process a document in a single pass. If the text is too small like a serial number on a scanned invoice or a footnote in a dense contract the model often has to guess. This "static" approach fails when documents are low-resolution, dense, or contain mixed media. This is where these 5 to 10 percent of quality improvements comes from. I was surprised how Gemini can digitalized and look and plan to get details from document.
Agentic Vision solves this by introducing a Think, Act, Observe loop specifically for extraction. It is truly great.
Think: The model identifies illegible or dense sections of a document.
Act: It executes Python code to "crop" and "zoom" into those specific regions, treating them as new, high-resolution inputs.
Observe: It reads the text from the cropped view with pixel-perfect precision and integrates it back into the final answer.
Example agent action:
This is what agents plan for you, where to zoom, and even reason why to do so. Instead of doing transformation before submitting document to OCR by Gemini. Gemini will do it if needed. It might cost more, but sometimes you do not know where to zoom because each document can be slightly different. Gemini will recognize and zoom for details. Wow
# The agent zooms into the serial number area
from PIL import Image
img = Image.open('invoice.jpg')
# Define coordinates for the tiny text
cropped_img = img.crop((100, 500, 300, 600))
cropped_img.save('zoomed_serial.jpg')
2. The perfect alignment
Ever taken a crooked photo of a receipt? Gemini can now write code to fix the rotation before it even tries to read the totals. The unperfect rotation of some area or even the whole document is very often reason where static recognition failed. Same as in previous case, you do not know in advance how to preprocess document, each can be rotated differently. So let the Gemini recognize what is not rotated to be recognized correctly.
Example agent action:
# The agent detects a 15-degree tilt and fixes it
rotated_img = img.rotate(-15, expand=True)
rotated_img.save('straightened_document.jpg')
3. Enhancing contrast for faded text
For those old, faded tax documents, the agent can boost the contrast to make the numbers pop. Scans, foto contrast is another reason where OCR can failed. Gemini can recognize darker spots and make contrast aligned with the rest of the document. OOoo wow.
Example agent action:
from PIL import ImageEnhance
enhancer = ImageEnhance.Contrast(img)
enhanced_img = enhancer.enhance(2.0) # Doubles the contrast
This all is where the improvements comes from.
OCR benchmarks: non-agentic vs. agentic
The introduction of agentic capabilities has resulted in measurable improvements in extraction accuracy. Below is a breakdown of performance gains when comparing standard static processing (Non-Agentic) with the new Code Execution-enabled workflow (Agentic).
| Metric / Category | Static OCR (Standard Prompt) | Agentic OCR (Thinking + Code) | Improvement / Note |
|---|---|---|---|
| OmniDocBench (Accuracy) | 0.121 (Edit Distance) | 0.108 (Estimated) | +10% Error Reduction |
| Complex Extraction Quality | Baseline Performance | +5% to 10% Boost | Significant in handwriting/tables |
| Workflow Methodology | Single pass (Fixed resolution) | Think-Act-Observe Loop | Uses Python to crop/zoom docs |
| Handwritten Text Recall | High (95%+) | Near-Perfect (98.5%+) | Reduced hallucinations in "noisy" docs |
| Latency (Per Page) | ~1-2 seconds | 3-8 seconds | Higher due to multi-step reasoning |
| Token Efficiency | Standard usage | 30% fewer tokens avg. | Due to smarter visual targeting |
Real-world OCR use cases
Developers are already using agentic vision to solve previously, but market opportunities are still here impossible OCR challenges:
1. High-precision technical review
PlanCheckSolver.com reported a 5% increase in accuracy by using Agentic Vision to validate building codes.
The Challenge: Reading tiny measurements and annotations on large-format 4K architectural blueprints.
The Agentic Solution: The model writes code to dynamically crop specific sections (like roof edges or electrical diagrams) to read the text clearly, rather than trying to process the entire blueprint at once.
2. Financial data extraction
Dense tables in financial reports often confuse standard OCR models, leading to "hallucinated" numbers.
The Challenge: Extracting hundreds of specific fields (dates, amounts, line items) from mixed-format invoices and tax documents.
The Agentic Solution: Gemini 3 Flash uses code to normalize the data structure and verifies the numbers by performing visual arithmetic, ensuring the extracted data matches the visual evidence.
3. "Visual scratchpad" for annotation
Instead of just outputting text, the model can now verify its own work.
The agentic solution: When asked to count items or verify a list in a document, the model can generate a "visual scratchpad," drawing bounding boxes and labels on the image itself. This forces the model to ground every character it recognizes in specific pixels, significantly reducing transcription errors.
Implementation: How to enable agentic OCR
Implementing this "active" OCR requires a simple update to your Gemini API configuration. By enabling the code_execution tool, you grant the model permission to write and run Python code (using libraries like opencv-python, pillow, and numpy) to manipulate the document image before reading it.
So in example below just focus on enabling code execution during Gemini API calls.
# Enable the Code Execution tool in the config
config = types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution)]
)The advantage in code
Unlike standard OCR API calls that return raw text, this approach allows the model to self-correct. If it cannot read a blurry section, it will generate a script to sharpen or crop that specific area and try again.
Code example
Here is how to initialize the Gemini 3 Flash model with Code Execution enabled for an OCR task:
from google import genai
from google.genai import types
client = genai.Client()
# Load your document image (e.g., a dense invoice or blueprint)
image = types.Part.from_uri(
file_uri="https://example.com/dense-invoice.jpg",
mime_type="image/jpeg"
)
# Enable the Code Execution tool in the config
config = types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution)]
)
# Prompt the model to use its new tools
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=[
image,
"Extract the serial number from the bottom-left barcode and the total amount. "
"If the text is small, use code to crop and zoom into those regions first."
],
config=config
)
# The response will contain the model's thought process, the code it ran, and the final answer
print(response.text)
Why this wins for OCR:
Active Cropping: The model implicitly detects small text and writes
PILorOpenCVscripts to cropimage[y:y+h, x:x+w], effectively "zooming in" to boost resolution for the OCR engine.De-skewing: It can write code to detect document rotation and correct it before attempting to read text.
Verifiable Extraction: You can ask the model to draw a box around the extracted text to prove it found the correct location, which allows for human-in-the-loop validation.
What’s next for intelligent document processing?
This release is just the beginning. Google plans to expand Agentic Vision for OCR in several key areas:
Implicit Visual Math- Future updates will allow the model to automatically perform calculations on extracted data without explicit prompting.How to get started in google AI studio


This is the result of invoice analysis. I still cover some of the details even the dataset is public, bul all matches as it should.
------------------------------------------------------------------------------------------------------------------------
This invoice (No. 98-------1, issued on 09/02/2019) details a transaction between Pe-----n LLC (Seller) and Pa-----------s (Client).
Entities Involved
Seller: Name: Pe------ LLC Address: USS Da-----, FPO ----- Tax ID: 979-97----- IBAN: G-----A1644----99-------
Client: Name: Pat------on, Pie----- and Fue-----es Address: 79----0 An-----on Fort, Mic-----port, SD 4----- Tax ID: 932-------
Detailed Item List
Invoice Summary & Verification
Net Worth Sum: 29.99 + 33.86 + 2,000.00 + 1,710.00 = 3,773.85 (Correct) VAT Calculation: 10% of 3,773.85 = 377.385 (The invoice uses 377.38, likely applying a floor rounding or slight variation in precision). Gross Worth Sum: 3,773.85 + 377.38 = 4,151.23 (Correct) Individual Items: All item gross values (Net + 10% VAT) match their respective totals listed in the table.
Final notes
Agentic Vision is available now via the Gemini API in Google AI Studio and Vertex AI. The easiest way is Google AI studio where you can get API key in free tier and try Gemini. I am using Gemini pro subs, with some increased limits and also have Vertex AI enabled. But free tier exists, This is huge advantage against ChatGPT, where you can pay subscription, but no API credits.
Developers can experiment with these OCR capabilities by enabling "Code Execution" directly in the Google AI Studio tools settings and uploading high-density documents to test the "Think, Act, Observe" loop and bring new ideas to life.
Let me know what you are building.
Reference:

