Gemini api RTSP video stream analysis - detection, satisfaction, gender and action recognition

12 Apr, 2025

This post walks through a complete Python code that captures an RTSP stream from OBS studio and rtspSimpleServer, detects motion against a static background, and sends frames with significant movement to Google's multimodal Gemini API for intelligent person detection. Gemini can do much more for you than just return bounding boxes.

I was a big advocate of using a detection-based cascade classifier learned by AdaBoost, and Waldboost with features like the LBP, HAAR, and HOG to detect categories in images. It was about 10 years ago. There was no YOLO networks and the Neural networks in those days were just too expensive for our projects. All changed very quickly in favor of DNNs of all kinds. Now the multimodal models seem to be quite overhead for computer vision. For example, using GEMINI API instead of YOLO looks like a hammer used in a chess match, but GEMINI can achieve much more than YOLO is possible. The multi-modal model will be cheaper, smaller, and produce structured outputs as GEMINI is already doing. It could replace all computer vision tasks in black-box. It will be more expensive, but also more available and capable than other methods. So let's see if history will take repeat and new will replace the great old achievements of machine learning again.

Let's dive into GEMINI for vision apps, with JSON response in Pythond reading rtsp video stream.

RTSP analysis using Google GEMINI

rtsp video stream from opencv using Gemini flash

Let me start with a prompt to Google's Gemini API for intelligent person detection. You can see the response above in the picture. You can achieve structured JSON output by simply defining a prompt for your application and sending over the API together with an image. You can test, before you build such an app in https://aistudio.google.com. You will need to vision anyway to get an access token.

prompt = """
ANALYZE and return json by filling as example

Construct JSON where each detected person has:
(Person, Persons bounding box[ymin, xmin, ymax, xmax], mood of person, action of person, satisfaction estimate)

Example:
[
  {
    "Person": "Adult Male",
    "Persons bounding box": [120, 200, 300, 400],
    "mood the person": "Neutral",
    "action of person": "Standing",
    "satisfaction estimate": "Medium"
  }
]
"""

The result is JSON, you can process the JSON and draw bounding boxes over the image of dimensions 1000x1000. As you can see, GEMINI can estimate, a person's actions, moon, satisfaction, and much more. My use case further implements motion detection to significantly reduce costs and analyze just images of interest.

   {
    "Person": "Adult Male",
    "Persons bounding box": [57, 636, 182, 666],
    "mood the person": "Neutral",
    "action of person": "Walking",
    "satisfaction estimate": "Medium"
  }
   {
    "Person": "Adult Male",
    "Persons bounding box": [57, 636, 182, 666],
    "mood the person": "Neutral",
    "action of person": "Walking",
    "satisfaction estimate": "Medium"
   {
    "Person": "Adult Male",
    "Persons bounding box": [57, 636, 182, 666],
    "mood the person": "Neutral",
    "action of person": "Walking",
   {

Gemini code key parts

RTSP stream capture by OpenCV
Motion detection via background subtraction by OpenCV
Integration with Gemini Flash 2.0 API for image analysis
JSON output parsing and annotation of the frames

This code begins by importing essential libraries.

import cv2
import numpy as np
import base64
import requests
import time
import json
from google import generativeai as genai
from google.generativeai import types

These libraries provide the following functionality:

OpenCV is used for video handling and image processing (motion detection).
Google Generative AI handles communication with Gemini.
Requests communicate with Google REST API.
JSON to process output.
Base64 is image representation over REST API.

In Google AI Studio, You need to create your API key. It is for free.

client = genai.Client(api_key='YOUR_API_KEY')

The other important configuration is rtsp stream source. Mine is coming from OBS to rtspSimpleServer.

# === CONFIGURATION ===
RTSP_URL = 'rtsp://your_rtsp_stream'

Motion Detection Loop

Prepare the initial static frame (background) for motion comparison.

cap = cv2.VideoCapture(RTSP_URL)
_, first_frame = cap.read()
first_gray = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
first_gray = cv2.GaussianBlur(first_gray, (21, 21), 0)

Then capture cap.read() frames of rtsp in the while loop.

    ret, frame = cap.read()
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (21, 21), 0)
    diff = cv2.absdiff(first_gray, gray)

This diff between the two image in time.



    thresh = cv2.threshold(diff, MOTION_THRESHOLD, 255, cv2.THRESH_BINARY)

Binary threshold will create image of 0 and 255 gray image. So, Black and white.



    thresh = cv2.dilate(thresh, None, iterations=2)
    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

Contour over the binary dilated threshold.

Determines if detected contours represent significant motion.

    movement_detected = any(cv2.contourArea(c) > AREA_THRESHOLD for c in contours)

If there is a motion, send the image to GEMINI for analysis.

Sending Frame to Gemini

if movement_detected:
    _, buffer = cv2.imencode('.jpg', frame)
    prompt = """
    ANALYZE and return json by filling 'yy' in the following structure...
    """
    client = genai.Client(api_key='YOUR_API_KEY')
    response = client.models.generate_content(
            model="gemini-2.0-flash-exp",
            contents=[
                prompt,
                types.Part.from_bytes(data=buffer.tobytes(), mime_type="image/jpeg")
         ]
        )

Encodes the frame, constructs a custom prompt for analysis, and sends it to Gemini.

Parsing and visualization

def draw_result(frame, result_json):
    results = json.loads(result_json)
    for person in results:
        box = person["Persons bounding box"]
        label = f"{person['Person']}, {person['mood the person']}, {person['action of person']}, {person['satisfaction estimate']}"
        cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
        cv2.putText(...)

Draws bounding boxes and labels returned by Gemini over the video frame.

cv2.imshow("Live", frame)
cv2.imwrite(f"annotated_result_{int(time.time())}.jpg", frame)

Shows and optionally saves annotated frames locally.

Setup configuration

First, I need the RTSP stream for testing. There are many ways to do so. You can use only use FFMPEG to create rtsp stream for example. I chose OBS studio which opens the file and sends a video to rtspsimpleServer. The RTSP simple server takes stream from the OBS client and provides further. The output of rtspSimpleServer (rtsp://localhost:8554/mystream) is then captured by this simple Python program that tests GEMINI. The Python script called the Gemini API with your token and returns structured JSON that can be processed in many ways.

Full source code of Gemini showcase

Do not forget, you need to install dependencies and put your YOUR_API_KEY into the code.

import cv2
import cv2
import numpy as np
import base64
import requests
import time
import json
from google import genai
from google.genai import types

# === CONFIGURATION ===
RTSP_URL = 'rtsp://localhost:8554/mystream'
MOTION_THRESHOLD = 25
AREA_THRESHOLD = 5000
WAIT_SECONDS_AFTER_MOTION = 3

cap = cv2.VideoCapture(RTSP_URL)
_, first_frame = cap.read()

if first_frame is None:
    raise Exception("Couldn't read from the stream")
first_frame = cv2.resize(first_frame, (1000, 1000))
first_gray = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
first_gray = cv2.GaussianBlur(first_gray, (21, 21), 0)

print("Monitoring RTSP stream...")

last_sent = time.time()

def draw_result(frame, result_json):
    try:
        # Remove markdown-style code fences if present
        result_json = result_json.strip()
        if result_json.startswith("```json"):
            result_json = result_json[7:]  # remove ```json\n
        if result_json.endswith("```"):
            result_json = result_json[:-3]

        results = json.loads(result_json)

        for person in results:
            box = person["Persons bounding box"]
            ymin, xmin, ymax, xmax = box

            label = f"{person['Person']}, {person['mood the person']}, {person['action of person']}, {person['satisfaction estimate']}"

            cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
            cv2.putText(frame, label, (xmin, ymin - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
    except Exception as e:
        print("Could not parse or draw results:", e)

while True:
    ret, framecap = cap.read()
    frame = cv2.resize(framecap, (1000, 1000))
    if not ret:
        print("Frame capture failed, retrying...")
        continue

    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (21, 21), 0)

    diff = cv2.absdiff(first_gray, gray)
    thresh = cv2.threshold(diff, MOTION_THRESHOLD, 255, cv2.THRESH_BINARY)[1]
    thresh = cv2.dilate(thresh, None, iterations=2)

    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    movement_detected = any(cv2.contourArea(c) > AREA_THRESHOLD for c in contours)

    if movement_detected and (time.time() - last_sent) > WAIT_SECONDS_AFTER_MOTION:
        print("Motion detected, sending to Gemini...")
    #
        _, buffer = cv2.imencode('.jpg', frame)
    #
        prompt = """
ANALYZE and return json by filling as example

Construct JSON where each detected person has:
(Person, Persons bounding box[ymin, xmin, ymax, xmax], mood of person, action of person, satisfaction estimate)

Example:
[
  {
    "Person": "Adult Male",
    "Persons bounding box": [120, 200, 300, 400],
    "mood the person": "Neutral",
    "action of person": "Standing",
    "satisfaction estimate": "Medium"
  }
]
"""

        client = genai.Client(api_key='YOUR_API_KEY')
        response = client.models.generate_content(
            model="gemini-2.0-flash-exp",
            contents=[
                prompt,
                types.Part.from_bytes(data=buffer.tobytes(), mime_type="image/jpeg")
         ]
        )

        result_text = response.text.strip()
        print("Gemini JSON Output:\n", result_text)

        draw_result(frame, result_text)
        cv2.imwrite(f"annotated_result_{int(time.time())}.jpg", frame)
        last_sent = time.time()

cap.release()

Final words

Clap share like and subscribe.

This pipeline combines the power of old motion detection and a new approach and new multimodal Gemini model API by Google for for smarter surveillance systems. These models are still expensive, but already capable of replacing networks like Yolo, Not by the cost, but by the capabilities. This already happens many times and the cost reduction is just a matter of time. The systems for detection, gender estimation, and action recognition will no longer be several steps of your computer vision pipeline but handled by one multimodal black-box. How cool is that?

I am curious how models such a Gemini can transform computer vision. Wait for the next experiments with this incredible model.

Gemini vision applications Google vision analysis Multimodal models

RTSP analysis using Google GEMINI

Gemini code key parts

Motion Detection Loop

Sending Frame to Gemini

Parsing and visualization

Setup configuration

Full source code of Gemini showcase

Final words

Popular Posts

How to Capture RTSP Video Streams Using OpenCV (Installed via VCPKG)

Guide to HLS Live Streaming with FFmpeg CLI

Opencv GStreamer (windows) video streaming tutorial + full source code for RTSP HLS streaming

OpenCV 4.5 simple optical flow GPU tutorial cuda::FarnebackOpticalFlow

Opencv installed by VCPKG package manager for rapid start of Visual Studio 2022 project