Multimodal models Gemini 2.0 Flash for computer vision applications

28 Mar, 2025

Multimodal models can process images, texts, and audio, and generate also various output representations. This feels like magic with tremendous potential rising to be fully utilized in the next years. We may not be ready for some computer vision applications, but sooner or later, we will get there as stability and accuracy increase. Let's dive into the example and use case of how multimodal models can replace the full computer vision app workflow as a black box.

I will use the Gemini 2.0 Flash model to demonstrate the application potential discussed in this text. In other words, these models need just an interface, and system prompt on what should be output and then replace the huge amount of code of your computer vision pipeline.

This post was not written by AI. Images are created by just released new features ChatGPT 4o including diagrams. The core of the app is demonstrated in Google AI Studio.

The design

computer vision pipeline leveraging a multimodal machine learning (ML) model to process camera feed data, store structured outputs, and visualize analytical results on a dashboard.

A camera continuously streams video data to the application layer for real-time analysis.
The application Layer reads the camera feed and manages the IO processing between the camera, multimodal model, and output database.
System Prompt configures the multimodal model with global rules, output format (JSON or SQL insert), and required analytical tasks.
User Prompt dynamically selects specific frames for processing.
The output of the model is passed through the application layer to the Database.
The real logic is on a multimodal model instead of application.

Chatgpt 4o image generated diagram — ChatGPT 4o new image generated diagram

Again, What you actually program is IO. The computer vision program or let's say task over the video happens in a multimodal model. What will happen is on the system prompt, and what will be processed on the user prompt.
Most of the coding of detection, filtering, segmentation, and recognition is not programmed. The output format is not on the program. It is just defined instead.

This is such a powerful concept for creating input and output for the model and letting the model perform its black-box magic instead of manually defining all the transformations, resizing, feeding the CNN, collecting detection outputs and segmentation, performing action recognition, interpreting all attributes, and converting them to the final output.
The ultimate benefit is that you can easily define multiple computer vision tasks on your multi-modal model and reuse the IO application layer.

Application example

I am using local models, but the accuracy of models I can run locally is not so good. For our example, I will use Google AI Studio, Model Gemini 2.0 Flash. I tried some smaller models on Ollama like gemma3:4b, and multimodal phi 3.5 vision, but I need the capabilities of a large model. All models filled JSON, all models exclude surrounding text. Gemma has some issues recognizing gender correctly, and the x,y location has some issues, but it works reasonably well.

This approach will replace some common computer vision tasks like segmentation, detection of multiple categories, recognition, sentiment analysis, action recognition, and whole complex image pipeline analysis.

Replace xxxx and produce filled json format only, without surrounding text,
based on image analysis. Find all person on picture. 

add image_id Unique identifier for the image

Fields just once in json:
Scene_description:  Describe scene in 100 characters
overall_mood: General mood of the scene (e.g., "tense", "cheerful")
number_of_people: Total number of people in the scene
scene_action: What is happening overall (e.g., "people are protesting")

For each person:
pedestrians Array containing data for each pedestrian
id Unique identifier for each pedestrian
body_center X, Y coordinates of the pedestrian's body center 
(bounding box center)
head_location X, Y coordinates of the pedestrian's head
gender    
mood General mood (e.g., happy, sad, neutral)
feeling More specific inner feeling (e.g., excited, anxious)
action What the pedestrian is doing (e.g., walking, standing, waving)
close Could be a boolean ("true"/"false") or a scale (e.g., "near", "medium", "far")

{
  "image_id": "xxxx",
  "scene_context": {
    "scene_description": "xxxx",
    "overall_mood": "xxxxx",
    "number_of_people": "xxxx",
    "scene_action": "xxxx"
  },
  "pedestrians": [
    {
      "id": "person_1",
      "body_center": {
        "x": "xxxx",
        "y": "xxxx"
      },
      "head_location": {
        "x": "xxxx",
        "y": "xxxx"
      },
      "gender": "xxxx",
      "mood": "xxxx",
      "feeling": "xxxx",
      "action": "xxxx",
      "close": "xxxx"
    },
    // Anotherperson can be added here if any
  ]
}

The produced output:

{
  "image_id": "9949",
  "scene_context": {
    "scene_description": "Two joggers run down a path lined with trees and green grass on a sunny day. Cars are parked nearby.",
    "overall_mood": "cheerful",
    "number_of_people": "2",
    "scene_action": "people are jogging"
  },
  "pedestrians": [
    {
      "id": "person_1",
      "body_center": {
        "x": "439",
        "y": "627"
      },
      "head_location": {
        "x": "424",
        "y": "205"
      },
      "gender": "male",
      "mood": "happy",
      "feeling": "content",
      "action": "jogging",
      "close": "near"
    },
    {
      "id": "person_2",
      "body_center": {
        "x": "837",
        "y": "646"
      },
      "head_location": {
        "x": "796",
        "y": "215"
      },
      "gender": "female",
      "mood": "neutral",
      "feeling": "focused",
      "action": "jogging",
      "close": "near"
    }
  ]
}
Use code with caution.
Json
3.7s

You can try this example with a free tier on Google AI Studio, or generate the token to access the Gemini 2.0 flash model over API. Which is an advantage over ChatGPT.

Concept evaluation

So, this is the core of the application. The IO layer, database, and dashboard are relatively easy to create and LLM will handle the hard task for you very well. The dashboard can be based on superset or grafana. The relatively simple parts are needed to program, or better generate these building blocks. The core of the application is handled by the multi-modal model itself, which is a hard challenge here. This includes detection, classification, scene recognition, image sentiment analysis, and action recognition. So relatively very complex machine learning pipeline that can occupy a huge amount of code if you would like to program all of this. Now, with these beautiful models, The desired outcome is just defined based on your prompt and the model will do the job instead of code-passing images through different stages and different models.

Application potential

If you are going to build an application to capture the satisfaction of your customers in retail, line counting, real-world advertisement impact, traffic analysis, and other public space metrics. This approach is a game changer. Shortly, we will replace our old computer vision pipeline concepts with one multi-purpose black box, where the vision app is just defined. The time ready to market for your computer vision app will be dramatically reduced. This will be a significant shift in computer vision. I would say, the whole new era. What do you think?

Multimodal models New era of computer vision workflow

The design

Application example

Concept evaluation

Application potential

Popular Posts

How to Capture RTSP Video Streams Using OpenCV (Installed via VCPKG)

Guide to HLS Live Streaming with FFmpeg CLI

Opencv C++ Tutorial Mat resize

Opencv GStreamer (windows) video streaming tutorial + full source code for RTSP HLS streaming

OpenCV 4.5 simple optical flow GPU tutorial cuda::FarnebackOpticalFlow