Agentic Vision in Gemini
Agentic visual reasoning with code execution
Agentic visual reasoning with code execution
Agentic Vision in Gemini 3 Flash is a breakthrough AI capability from Google DeepMind that transforms image understanding from a static process into an active, agentic investigation. Unlike traditional AI models that process images in a single static glance, Agentic Vision combines visual reasoning with code execution to formulate plans, zoom in, inspect, and manipulate images step-by-step. This groundbreaking technology enables the model to ground its answers in visual evidence, delivering a consistent 5-10% quality boost across most vision benchmarks compared to standard approaches.
Agentic Vision is available today via the Gemini API in Google AI Studio and Vertex AI, and is also rolling out in the Gemini app. Here is how to get started:
1. Access the Demo Visit the demo app in Google AI Studio at aistudio.google.com to experience Agentic Vision capabilities firsthand.
2. Use AI Studio Playground Go to the AI Studio Playground and select Gemini 3 Flash model to experiment with the feature.
3. Enable Code Execution In the playground, navigate to Tools settings and turn on "Code Execution" to unlock Agentic Vision behaviors.
4. Upload an Image Provide an image you want to analyze - the model can work with high-resolution inputs including documents, photos, and complex visual data.
5. Ask Detailed Questions Query the model about fine-grained details in your image - it will automatically zoom, inspect, and manipulate the image to find answers.
6. Review the Process Observe how the model uses the Think, Act, Observe loop to generate Python code, manipulate images, and ground its reasoning in visual evidence.
Think, Act, Observe Loop The model analyzes queries and images, formulates multi-step plans, executes Python code to manipulate images, and appends transformed images to its context window for better understanding.
Automatic Zooming and Inspection Gemini 3 Flash is trained to implicitly zoom when detecting fine-grained details like serial numbers or distant text that would otherwise be missed.
Image Annotation Capability The model can execute code to draw directly on images, creating bounding boxes and numeric labels as a "visual scratchpad" to ensure pixel-perfect understanding.
Visual Math and Plotting Agentic Vision can parse high-density tables and execute Python code to perform calculations and generate professional visualizations like Matplotlib charts.
Code Execution Integration By offloading computation to a deterministic Python environment, the model bypasses hallucination issues common in multi-step visual arithmetic.
High-Resolution Input Processing Capable of working with complex, high-resolution inputs like building plans, improving accuracy through iterative inspection of specific patches.
#1 Building Plan Validation PlanCheckSolver.com improved accuracy by 5% using Agentic Vision to iteratively inspect high-resolution building plans, cropping and analyzing specific patches like roof edges to confirm compliance with building codes.
#2 Counting and Object Detection When asked to count items like fingers on a hand, the model draws bounding boxes and numeric labels over each identified element to ensure accurate counting without errors.
#3 Data Visualization from Images Extract data from complex tables in images and automatically generate professional charts and visualizations using Python libraries like Matplotlib.
#4 Document Analysis Process high-density documents and technical specifications, zooming into specific sections to extract and verify detailed information.
#5 Quality Inspection Inspect products, components, or materials for defects by automatically zooming into areas of interest and annotating findings.
What is the difference between Agentic Vision and traditional image understanding? Traditional AI models process images in a single, static glance and must guess if they miss fine-grained details. Agentic Vision treats vision as an active investigation, using code execution to zoom, inspect, and manipulate images step-by-step, grounding answers in visual evidence.
How much does Agentic Vision improve performance? Enabling code execution with Gemini 3 Flash delivers a consistent 5-10% quality boost across most vision benchmarks.
What tools does Agentic Vision currently support? Code execution is one of the first tools supported by Agentic Vision. Google is exploring adding more tools including web and reverse image search to ground understanding even further.
Is Agentic Vision available in other model sizes? Currently, Agentic Vision is available in Gemini 3 Flash. Google plans to expand this capability to other model sizes beyond just Flash in the future.
Do I need to explicitly prompt for certain behaviors? While zooming in on small details happens implicitly, other capabilities like rotating images or performing visual math currently require an explicit prompt nudge. Google is working to make all behaviors fully implicit in future updates.