How It Works
Visual search and semantic transcript search run simultaneously when you type a query. Results are ranked by combined relevance so the best matches surface first.
CLIP-based visual embeddings match your text description against scene thumbnails. Search for “sunset over water” or “person writing on whiteboard” and get frame-accurate results from scenes that look like what you described.
Word embeddings match the meaning of your query against transcript segments. “Discussing the project timeline” finds “we need to figure out when this ships” even though the words are completely different.
Find Similar
Click “Find Similar” on any scene card to search for visually similar scenes across your library. Uses cosine distance between SigLIP embeddings to rank results by visual similarity.
Search within the current video or across your entire library.
Results show thumbnails, similarity scores, and match reason pills.
Click any result to jump to that scene in its video.
Works without visual models too - falls back to text-based similarity (Tantivy BM25) if CLIP models aren't downloaded.
Search Features
Filter results by dominant scene color. Click a color swatch in the search filters to find scenes with matching color palettes.
Automatically detected objects (people, vehicles, animals, text, props) are searchable alongside visual descriptions. Search for “laptop” or “red car” to find every occurrence.
Both AI models (SigLIP and MiniLM ONNX) run locally on your machine. No API calls, no per-query cost. Visual embeddings are stored in your local search index.
Visual search models are optional and download on demand. SigLIP handles image-text matching, MiniLM handles semantic text similarity. Both are ONNX format and run with CUDA or Metal acceleration when available, with CPU fallback. Your search index works without them (keyword search only) until you choose to enable visual search.