Loading…

Coventry University · CS Dissertation

Intelligent Surveillance Beyond Detection

A hybrid multimodal pipeline combining lightweight edge screening with Vision Language Model verification for violence detection in CCTV environments.

Explore Research View Demo

86%

F1 Score

−45%

VLM Calls

38ms

Edge Latency

01The Research Problem

We record almost everything. We understand almost none of it.

Modern surveillance is bottlenecked not by cameras, but by interpretation. The footage exists. The question is how to read it intelligently — without watching every frame, and without paying to run a heavy model on all of it.

Traditional CCTV systems generate enormous volumes of video. A mid-sized transit hub can produce more footage in a single day than a team could review in a year.

Monitoring everything manually is unrealistic. Operator vigilance falls off within minutes, and the events that matter — assaults, altercations, crowd violence — are rare, brief, and easy to miss.

Yet processing every frame with a large multimodal model is prohibitively expensive. At scale, the inference bill alone makes naïve “run the big model everywhere” deployments commercially impossible.

The Core Tension

Efficiency vs. Robustness

Lightweight edge models are cheap and fast, but brittle — they miss nuance and raise false alarms. Large vision–language models are accurate and explainable, but slow and costly.

Edge classifier — cost12%

Edge classifier — robustness58%

VLM everywhere — cost96%

VLM everywhere — robustness91%

Hours of footage per camera, monthly

A 64-camera site streams continuously, around the clock.

Of footage a human operator actually watches

Attention degrades sharply after 20 minutes of monitoring.

0.0×

Compute cost of frame-by-frame VLM analysis

Relative to a lightweight edge classifier baseline.

02System Architecture

A two-tier pipeline that spends compute only where it counts

Cheap screening runs everywhere at the edge. Expensive reasoning runs rarely, in the cloud. A single routing decision separates the two — and defines the system.

Edge Tier

Cloud Tier

CCTV Footage
INGEST
Continuous multi-camera RTSP streams are decoded into frame sequences at the edge node.
Edge Screening
STAGE 1
A lightweight spatiotemporal classifier runs on every clip — cheap, fast, always-on.
Suspicion Scoring
STAGE 2
Motion energy, pose dynamics and classifier confidence are fused into a single calibrated suspicion score.
Selective Escalation
ROUTER
Only clips above an uncertainty-aware threshold are escalated. The rest are logged and discarded.
VLM Verification
STAGE 3
A Vision Language Model inspects escalated clips, reasoning over the scene and producing a natural-language judgement.
Decision Engine
OUTPUT
Verified events are scored, explained, evidenced and routed to operators with a full audit trail.

03Selective Escalation

One threshold decides whether the expensive model ever runs

Selective escalation is the heart of the system. Watch how identical infrastructure routes two very different scenes — and why most footage never costs a cloud call.

Suspicion Timeline · t0 → t6

THRESHOLD 50

Escalate ↑

Peak Suspicion

88/100

Edge Cost

38ms

Routing Decision

Escalated to VLM

A sharp rise in motion energy and abnormal pose dynamics pushes suspicion past threshold. The clip is escalated for Vision Language Model verification.

Data Path

Edge Screening

Escalate → Cloud VLM

VLM Verification

04Video Upload & Analysis

Upload footage. Get instant AI analysis.

Drop any CCTV clip and the pipeline screens at the edge, then escalates only what's ambiguous to the Vision Language Model — with a full natural-language rationale attached.

This is a research demonstration interface using pre-computed dissertation experiment outputs. It is not a deployed surveillance product.

Drop your CCTV footage here

Supports MP4, AVI, MOV, MKV — up to 2 GB

or try a research pipeline scenario:

Edge screening

MobileNetV2 screener

Suspicion scoring

Threshold-based routing

AI rationale

Natural-language output

MP4AVIMOVMKV

05Research Evaluation

Robustness held, cost collapsed

Evaluated on 20 clips from the RWF-2000 dataset validation partition (10 violence, 10 non-violence). The hybrid pipeline retains strong accuracy while routing only 11 of 20 clips to the cloud model.

Accuracy

0.0%

Overall correctness

Precision

0.0%

Few false alarms

Recall

0.0%

Few missed events

F1 Score

0.0%

Balanced measure

Confusion Matrix · n = 20

Pred. Violent

Pred. Normal

Actual Violent

Actual Normal

CorrectError

Latency Analysis · per clip

Edge screen only38 ms

Hybrid pipeline (avg)142 ms

VLM on every clip920 ms

The hybrid average stays close to edge latency because the costly VLM path is taken rarely.

−0%

VLM API calls

11 of 20 clips escalated; 9 handled locally by the edge screener.

API calls saved

9 of 20 baseline VLM calls avoided; baseline = 20, hybrid = 11.

06Research Contribution

What this work adds to the field

Five contributions, each addressing a gap left by single-model surveillance systems.

Hybrid AI Architecture

A formalised two-tier design that composes a lightweight discriminative screener with a generative vision–language verifier, rather than treating them as alternatives.

Edge–Cloud Collaboration

A routing contract that keeps always-on inference local and reserves bandwidth-heavy reasoning for the cloud, only when justified.

Computational Efficiency

A 45% reduction in multimodal verification requests in the evaluated configuration (11/20 clips escalated), with 85% classification accuracy maintained.

Uncertainty Handling

An uncertainty-aware escalation threshold that treats the screener's confidence as a first-class routing signal, not a binary gate.

Deployment Realism

Designed around real-world constraints — bandwidth, cost ceilings, operator load and auditability — rather than benchmark conditions alone.

07About the Dissertation

The research, in brief

This platform is the public-facing artefact accompanying a Coventry University Computer Science dissertation.

Research Question

Can a hybrid pipeline that screens at the edge and verifies with a vision–language model match single-model accuracy for violence detection, while substantially reducing computational cost?

Objectives

Design a hybrid edge–cloud pipeline that screens all footage cheaply and verifies selectively.
Define an uncertainty-aware escalation policy that minimises cloud cost at fixed recall.
Quantify the accuracy–cost trade-off against single-model baselines.

Methodology

A lightweight spatiotemporal classifier provides a calibrated suspicion score at the edge.
A vision–language model verifies only escalated clips, producing an explainable judgement.
Threshold selection is framed as a constrained optimisation over a labelled validation set.

Experimental Evaluation

Benchmarked on a balanced held-out set against edge-only and VLM-everywhere baselines.
Reported accuracy, precision, recall, F1, per-clip latency and cloud-call volume.
Ablations isolate the contribution of the uncertainty-aware threshold.

Future Research

Online adaptation of the escalation threshold to scene and time-of-day drift.
Multi-camera spatial reasoning for tracking events across overlapping views.
Human-in-the-loop feedback to continually recalibrate the screener.

Full methodology, related work and complete results are documented in the accompanying dissertation manuscript.

Intelligent Surveillance Beyond Detection

We record almost everything. We understand almost none of it.

Efficiency vs. Robustness

A two-tier pipeline that spends compute only where it counts

CCTV Footage

Edge Screening

Suspicion Scoring

Selective Escalation

VLM Verification

Decision Engine

One threshold decides whether the expensive model ever runs

Upload footage. Get instant AI analysis.

Drop your CCTV footage here

Robustness held, cost collapsed

What this work adds to the field

Hybrid AI Architecture

Edge–Cloud Collaboration

Computational Efficiency

Uncertainty Handling

Deployment Realism

The research, in brief

Objectives

Methodology

Experimental Evaluation

Future Research