Intelligent Surveillance Beyond Detection
A hybrid multimodal pipeline combining lightweight edge screening with Vision Language Model verification for violence detection in CCTV environments.
We record almost everything. We understand almost none of it.
Modern surveillance is bottlenecked not by cameras, but by interpretation. The footage exists. The question is how to read it intelligently — without watching every frame, and without paying to run a heavy model on all of it.
Traditional CCTV systems generate enormous volumes of video. A mid-sized transit hub can produce more footage in a single day than a team could review in a year.
Monitoring everything manually is unrealistic. Operator vigilance falls off within minutes, and the events that matter — assaults, altercations, crowd violence — are rare, brief, and easy to miss.
Yet processing every frame with a large multimodal model is prohibitively expensive. At scale, the inference bill alone makes naïve “run the big model everywhere” deployments commercially impossible.
Efficiency vs. Robustness
Lightweight edge models are cheap and fast, but brittle — they miss nuance and raise false alarms. Large vision–language models are accurate and explainable, but slow and costly.
A two-tier pipeline that spends compute only where it counts
Cheap screening runs everywhere at the edge. Expensive reasoning runs rarely, in the cloud. A single routing decision separates the two — and defines the system.
CCTV Footage
INGESTContinuous multi-camera RTSP streams are decoded into frame sequences at the edge node.
Edge Screening
STAGE 1A lightweight spatiotemporal classifier runs on every clip — cheap, fast, always-on.
Suspicion Scoring
STAGE 2Motion energy, pose dynamics and classifier confidence are fused into a single calibrated suspicion score.
Selective Escalation
ROUTEROnly clips above an uncertainty-aware threshold are escalated. The rest are logged and discarded.
VLM Verification
STAGE 3A Vision Language Model inspects escalated clips, reasoning over the scene and producing a natural-language judgement.
Decision Engine
OUTPUTVerified events are scored, explained, evidenced and routed to operators with a full audit trail.
One threshold decides whether the expensive model ever runs
Selective escalation is the heart of the system. Watch how identical infrastructure routes two very different scenes — and why most footage never costs a cloud call.
A sharp rise in motion energy and abnormal pose dynamics pushes suspicion past threshold. The clip is escalated for Vision Language Model verification.
Upload footage. Get instant AI analysis.
Drop any CCTV clip and the pipeline screens at the edge, then escalates only what's ambiguous to the Vision Language Model — with a full natural-language rationale attached.
This is a research demonstration interface using pre-computed dissertation experiment outputs. It is not a deployed surveillance product.
Drop your CCTV footage here
Supports MP4, AVI, MOV, MKV — up to 2 GB
or try a research pipeline scenario:
Robustness held, cost collapsed
Evaluated on 20 clips from the RWF-2000 dataset validation partition (10 violence, 10 non-violence). The hybrid pipeline retains strong accuracy while routing only 11 of 20 clips to the cloud model.
The hybrid average stays close to edge latency because the costly VLM path is taken rarely.
What this work adds to the field
Five contributions, each addressing a gap left by single-model surveillance systems.
Hybrid AI Architecture
A formalised two-tier design that composes a lightweight discriminative screener with a generative vision–language verifier, rather than treating them as alternatives.
Edge–Cloud Collaboration
A routing contract that keeps always-on inference local and reserves bandwidth-heavy reasoning for the cloud, only when justified.
Computational Efficiency
A 45% reduction in multimodal verification requests in the evaluated configuration (11/20 clips escalated), with 85% classification accuracy maintained.
Uncertainty Handling
An uncertainty-aware escalation threshold that treats the screener's confidence as a first-class routing signal, not a binary gate.
Deployment Realism
Designed around real-world constraints — bandwidth, cost ceilings, operator load and auditability — rather than benchmark conditions alone.
The research, in brief
This platform is the public-facing artefact accompanying a Coventry University Computer Science dissertation.
Can a hybrid pipeline that screens at the edge and verifies with a vision–language model match single-model accuracy for violence detection, while substantially reducing computational cost?
Objectives
- Design a hybrid edge–cloud pipeline that screens all footage cheaply and verifies selectively.
- Define an uncertainty-aware escalation policy that minimises cloud cost at fixed recall.
- Quantify the accuracy–cost trade-off against single-model baselines.
Methodology
- A lightweight spatiotemporal classifier provides a calibrated suspicion score at the edge.
- A vision–language model verifies only escalated clips, producing an explainable judgement.
- Threshold selection is framed as a constrained optimisation over a labelled validation set.
Experimental Evaluation
- Benchmarked on a balanced held-out set against edge-only and VLM-everywhere baselines.
- Reported accuracy, precision, recall, F1, per-clip latency and cloud-call volume.
- Ablations isolate the contribution of the uncertainty-aware threshold.
Future Research
- Online adaptation of the escalation threshold to scene and time-of-day drift.
- Multi-camera spatial reasoning for tracking events across overlapping views.
- Human-in-the-loop feedback to continually recalibrate the screener.