Gemini api RTSP video stream analysis - detection, satisfaction, gender and action recognition
I was a big advocate of using a detection-based cascade classifier learned by AdaBoost, and Waldboost with features like the LBP, HAAR, and HOG to detect categories in images. It was about 10 years ago. There was no YOLO networks and the Neural networks in those days were just too expensive for our projects. All changed very quickly in favor of DNNs of all kinds. Now the multimodal models seem to be quite overhead for computer vision. For example, using GEMINI API instead of YOLO looks like a hammer used in a chess match, but GEMINI can achieve much more than YOLO is possible. The multi-modal model will be cheaper, smaller, and produce structured outputs as GEMINI is already doing. It could replace all computer vision tasks in black-box. It will be more expensive, but also more available and capable than other methods. So let's see if history will take repeat and new will replace the great old achievements of machine learning again.
Let's dive into GEMINI for vision apps, with JSON response in Pythond reading rtsp video stream.
RTSP analysis using Google GEMINI

Let me start with a prompt to Google's Gemini API for intelligent person detection. You can see the response above in the picture. You can achieve structured JSON output by simply defining a prompt for your application and sending over the API together with an image. You can test, before you build such an app in https://aistudio.google.com. You will need to vision anyway to get an access token.
prompt = """
ANALYZE and return json by filling as example
Construct JSON where each detected person has:
(Person, Persons bounding box[ymin, xmin, ymax, xmax], mood of person, action of person, satisfaction estimate)
Example:
[
{
"Person": "Adult Male",
"Persons bounding box": [120, 200, 300, 400],
"mood the person": "Neutral",
"action of person": "Standing",
"satisfaction estimate": "Medium"
}
]
"""
The result is JSON, you can process the JSON and draw bounding boxes over the image of dimensions 1000x1000. As you can see, GEMINI can estimate, a person's actions, moon, satisfaction, and much more. My use case further implements motion detection to significantly reduce costs and analyze just images of interest.prompt = """
ANALYZE and return json by filling as example
Construct JSON where each detected person has:
(Person, Persons bounding box[ymin, xmin, ymax, xmax], mood of person, action of person, satisfaction estimate)
Example:
[
{
"Person": "Adult Male",
"Persons bounding box": [120, 200, 300, 400],
"mood the person": "Neutral",
"action of person": "Standing",
"satisfaction estimate": "Medium"
}
]
"""
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
"satisfaction estimate": "Medium"
}
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
"satisfaction estimate": "Medium"
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
{
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
"satisfaction estimate": "Medium"
}
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
"satisfaction estimate": "Medium"
{
"Person": "Adult Male",
"Persons bounding box": [57, 636, 182, 666],
"mood the person": "Neutral",
"action of person": "Walking",
{
Gemini code key parts
- RTSP stream capture by OpenCV
- Motion detection via background subtraction by OpenCV
- Integration with Gemini Flash 2.0 API for image analysis
- JSON output parsing and annotation of the frames
This code begins by importing essential libraries.
import cv2
import numpy as np
import base64
import requests
import time
import json
from google import generativeai as genai
from google.generativeai import types
These libraries provide the following functionality:- OpenCV is used for video handling and image processing (motion detection).
- Google Generative AI handles communication with Gemini.
- Requests communicate with Google REST API.
- JSON to process output.
- Base64 is image representation over REST API.
In Google AI Studio, You need to create your API key. It is for free.
client = genai.Client(api_key='YOUR_API_KEY')
The other important configuration is rtsp stream source. Mine is coming from OBS to rtspSimpleServer.# === CONFIGURATION ===
RTSP_URL = 'rtsp://your_rtsp_stream'
Motion Detection Loop
Prepare the initial static frame (background) for motion comparison.cap = cv2.VideoCapture(RTSP_URL)
_, first_frame = cap.read()
first_gray = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
first_gray = cv2.GaussianBlur(first_gray, (21, 21), 0)
Then capture cap.read() frames of rtsp in the while loop. ret, frame = cap.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)
diff = cv2.absdiff(first_gray, gray)
This diff between the two image in time.![]() |
thresh = cv2.threshold(diff, MOTION_THRESHOLD, 255, cv2.THRESH_BINARY)
Binary threshold will create image of 0 and 255 gray image. So, Black and white.
thresh = cv2.dilate(thresh, None, iterations=2)
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
Determines if detected contours represent significant motion.
movement_detected = any(cv2.contourArea(c) > AREA_THRESHOLD for c in contours)
If there is a motion, send the image to GEMINI for analysis.
Sending Frame to Gemini
if movement_detected:
_, buffer = cv2.imencode('.jpg', frame)
prompt = """
ANALYZE and return json by filling 'yy' in the following structure...
"""
client = genai.Client(api_key='YOUR_API_KEY')
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[
prompt,
types.Part.from_bytes(data=buffer.tobytes(), mime_type="image/jpeg")
]
)
Encodes the frame, constructs a custom prompt for analysis, and sends it to Gemini.
Parsing and visualization
def draw_result(frame, result_json):
results = json.loads(result_json)
for person in results:
box = person["Persons bounding box"]
label = f"{person['Person']}, {person['mood the person']}, {person['action of person']}, {person['satisfaction estimate']}"
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
cv2.putText(...)
Draws bounding boxes and labels returned by Gemini over the video frame.
cv2.imshow("Live", frame)
cv2.imwrite(f"annotated_result_{int(time.time())}.jpg", frame)
Shows and optionally saves annotated frames locally.
Setup configuration
First, I need the RTSP stream for testing. There are many ways to do so. You can use only use FFMPEG to create rtsp stream for example. I chose OBS studio which opens the file and sends a video to rtspsimpleServer. The RTSP simple server takes stream from the OBS client and provides further. The output of rtspSimpleServer (rtsp://localhost:8554/mystream) is then captured by this simple Python program that tests GEMINI. The Python script called the Gemini API with your token and returns structured JSON that can be processed in many ways.
Full source code of Gemini showcase
import cv2
import cv2
import numpy as np
import base64
import requests
import time
import json
from google import genai
from google.genai import types
# === CONFIGURATION ===
RTSP_URL = 'rtsp://localhost:8554/mystream'
MOTION_THRESHOLD = 25
AREA_THRESHOLD = 5000
WAIT_SECONDS_AFTER_MOTION = 3
cap = cv2.VideoCapture(RTSP_URL)
_, first_frame = cap.read()
if first_frame is None:
raise Exception("Couldn't read from the stream")
first_frame = cv2.resize(first_frame, (1000, 1000))
first_gray = cv2.cvtColor(first_frame, cv2.COLOR_BGR2GRAY)
first_gray = cv2.GaussianBlur(first_gray, (21, 21), 0)
print("Monitoring RTSP stream...")
last_sent = time.time()
def draw_result(frame, result_json):
try:
# Remove markdown-style code fences if present
result_json = result_json.strip()
if result_json.startswith("```json"):
result_json = result_json[7:] # remove ```json\n
if result_json.endswith("```"):
result_json = result_json[:-3]
results = json.loads(result_json)
for person in results:
box = person["Persons bounding box"]
ymin, xmin, ymax, xmax = box
label = f"{person['Person']}, {person['mood the person']}, {person['action of person']}, {person['satisfaction estimate']}"
cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
cv2.putText(frame, label, (xmin, ymin - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
except Exception as e:
print("Could not parse or draw results:", e)
while True:
ret, framecap = cap.read()
frame = cv2.resize(framecap, (1000, 1000))
if not ret:
print("Frame capture failed, retrying...")
continue
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)
diff = cv2.absdiff(first_gray, gray)
thresh = cv2.threshold(diff, MOTION_THRESHOLD, 255, cv2.THRESH_BINARY)[1]
thresh = cv2.dilate(thresh, None, iterations=2)
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
movement_detected = any(cv2.contourArea(c) > AREA_THRESHOLD for c in contours)
if movement_detected and (time.time() - last_sent) > WAIT_SECONDS_AFTER_MOTION:
print("Motion detected, sending to Gemini...")
#
_, buffer = cv2.imencode('.jpg', frame)
#
prompt = """
ANALYZE and return json by filling as example
Construct JSON where each detected person has:
(Person, Persons bounding box[ymin, xmin, ymax, xmax], mood of person, action of person, satisfaction estimate)
Example:
[
{
"Person": "Adult Male",
"Persons bounding box": [120, 200, 300, 400],
"mood the person": "Neutral",
"action of person": "Standing",
"satisfaction estimate": "Medium"
}
]
"""
client = genai.Client(api_key='YOUR_API_KEY')
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[
prompt,
types.Part.from_bytes(data=buffer.tobytes(), mime_type="image/jpeg")
]
)
result_text = response.text.strip()
print("Gemini JSON Output:\n", result_text)
draw_result(frame, result_text)
cv2.imwrite(f"annotated_result_{int(time.time())}.jpg", frame)
last_sent = time.time()
cap.release()
Final words
Clap share like and subscribe.This pipeline combines the power of old motion detection and a new approach and new multimodal Gemini model API by Google for for smarter surveillance systems. These models are still expensive, but already capable of replacing networks like Yolo, Not by the cost, but by the capabilities. This already happens many times and the cost reduction is just a matter of time. The systems for detection, gender estimation, and action recognition will no longer be several steps of your computer vision pipeline but handled by one multimodal black-box. How cool is that?
I am curious how models such a Gemini can transform computer vision. Wait for the next experiments with this incredible model.