
Table of Contents
What happens when you give a camera the ability to understand what it sees — and let anyone use it for free? The Problem with Traditional Auto-Tracking
Last month, a church volunteer asked me how to auto-track their pastor during Sunday services. They’d looked at dedicated auto-tracking cameras — $5,000+, and they can only follow faces.
- What if the pastor walks behind the pulpit?
- What if they want to track the worship leader instead?
This required a new camera, new training, and new expense.A Zero-Shot Solution
I showed them a browser tab. They typed “person at podium” into a text field, pointed their existing PTZOptics camera at the stage, and clicked Start. The camera followed the pastor. When they changed the text to “person with guitar,” it followed the worship leader.
- No retraining.
- No new hardware.
- No code.
That tool is one of 17 we just open-sourced in the Visual Reasoning Playground.
- Live Demo: github.io/visual-reasoning-playground
- Source: com/streamgeeks/visual-reasoning-playground
The Shift: From Training Models to Talking to Them
Traditional computer vision requires you to train a model for every object you want to detect. For example, to track a basketball, you would need to collect thousands of images, label them, train, validate, and deploy. To track a coffee mug next week, you start over.
Vision Language Models (VLMs) break this loop. A VLM takes an image and a text prompt and returns structured answers. You can ask it to detect objects, describe scenes, or answer questions—all in natural language, all zero-shot.
- No training pipeline.
- No dataset curation.
The model powering our tools is Moondream (https://moondream.ai), a small, fast VLM designed for real-time visual reasoning. Sending it a frame and a prompt like “find the person in the red shirt,” returns bounding box coordinates in ~200ms, fast enough to drive a PTZ camera in real time.What’s in the Playground

All 17 tools are standalone HTML+JS applications:
- No React.
- No build step.
- No npm install.
You can open index.html in a browser and go. While they share a lightweight library for common tasks, each tool is self-contained enough to read and understand quickly.
Here are five tools that tend to stop people mid-scroll:1. PTZ Auto-Tracker
Describe any object, and the camera follows it.
Workflow:
- Camera Frame → Moondream /detect “person in blue shirt”
- → Bounding box [0.3, 0.2, 0.6, 0.8]
- → Calculate offset from center
- → PTZ command: pan left 3°, tilt up 1°
- → Repeat every 500ms
It works with PTZOptics cameras or any PTZ camera with an HTTP API. The tracking loop runs entirely in the browser.2. Gesture OBS Control
Hold up a thumbs-up and your stream switches scenes. An open palm starts recording.
- No StreamDeck.
- No hotkeys.
- No touching anything.
The tool sends frames to Moondream asking “what gesture is the person making?” and maps responses to OBS WebSocket commands.3. Smart Counter
Draw a virtual line across a doorway. The system counts people crossing it, tracking entries in one direction and exits in the other. This can be used for retail foot traffic analytics, event occupancy tracking, or simply counting how many times a pet uses a door.4. Tracking Comparison: MediaPipe vs. Moondream
This tool is for engineers. It runs MediaPipe (local, ~10ms, limited to faces/hands/poses) and Moondream (cloud, ~200ms, tracks anything you describe) side-by-side on the same camera feed, allowing for real-time comparison of latency, accuracy, and flexibility tradeoffs.5. Voice Triggers
OpenAI’s Whisper model runs entirely in the browser via WebGPU/WASM.
- No API key.
- No server.
- No data leaving your machine.
You can define trigger phrases—such as “camera one,” “start recording,” or “wide shot”—and map them to actions. The ~40MB model downloads once and caches locally.
Every tool follows the same pattern: a video source feeds frames to a client, which calls the Moondream API, and the response is mapped to an action.
The shared library handles the plumbing:
- moondream-client.js manages API calls and retries.
- video-source-adapter.js normalizes webcams and video files into a consistent frame source.
- api-key-manager.js stores credentials in localStorage.
Vanilla JS was a deliberate choice. These tools are meant to be read, forked, and modified by people who aren’t primarily software developers—broadcast engineers, AV integrators, and streaming hobbyists. A 200-line app.js that anyone can follow beats a webpack bundle every time.Why Open Source, Why Now
VLMs crossed a practical threshold in the last year, allowing Moondream to run inference fast enough for real-time camera control. Browser APIs (WebRTC, WebGPU, Web Speech) are also mature enough to build serious applications without a backend.
The broadcast and ProAV industry is full of experts who intimately understand cameras, switchers, and production workflows but haven’t had an on-ramp to AI that speaks their language. These tools are that on-ramp. Each tool includes sample videos for experimentation without any hardware.

The companion book, Visual Reasoning AI for Broadcast and ProAV https://visualreasoning.ai/book, covers the theory. The Playground is the hands-on half: working code you can run in the next 30 seconds.Try It
Everything runs live on GitHub Pages right now. No installation, no sign-up.
- Grab a free API key from console.moondream.ai.
- Open any tool.
- Paste in the key.
To run locally or contribute: git clone, python server.py, done.
The repo is MIT-licensed. You’re encouraged to fork it, break it apart, and use pieces in your own projects.
About the Author
Paul Richards is CRO at PTZOptics and Chief Streaming Officer at StreamGeeks. He has authored over 10 books on audiovisual and live streaming technology. His latest book, Visual Reasoning AI for Broadcast and ProAV (https://visualreasoning.ai/book), is available now.
