Loading…
Coventry University · CS Dissertation

Intelligent Surveillance Beyond Detection

A hybrid multimodal pipeline combining lightweight edge screening with Vision Language Model verification for violence detection in CCTV environments.

86%
F1 Score
−45%
VLM Calls
38ms
Edge Latency
01The Research Problem

We record almost everything. We understand almost none of it.

Modern surveillance is bottlenecked not by cameras, but by interpretation. The footage exists. The question is how to read it intelligently — without watching every frame, and without paying to run a heavy model on all of it.

Traditional CCTV systems generate enormous volumes of video. A mid-sized transit hub can produce more footage in a single day than a team could review in a year.

Monitoring everything manually is unrealistic. Operator vigilance falls off within minutes, and the events that matter — assaults, altercations, crowd violence — are rare, brief, and easy to miss.

Yet processing every frame with a large multimodal model is prohibitively expensive. At scale, the inference bill alone makes naïve “run the big model everywhere” deployments commercially impossible.

The Core Tension

Efficiency vs. Robustness

Lightweight edge models are cheap and fast, but brittle — they miss nuance and raise false alarms. Large vision–language models are accurate and explainable, but slow and costly.

Edge classifier — cost12%
Edge classifier — robustness58%
VLM everywhere — cost96%
VLM everywhere — robustness91%
0+
Hours of footage per camera, monthly
A 64-camera site streams continuously, around the clock.
0%
Of footage a human operator actually watches
Attention degrades sharply after 20 minutes of monitoring.
0.0×
Compute cost of frame-by-frame VLM analysis
Relative to a lightweight edge classifier baseline.
02System Architecture

A two-tier pipeline that spends compute only where it counts

Cheap screening runs everywhere at the edge. Expensive reasoning runs rarely, in the cloud. A single routing decision separates the two — and defines the system.

Edge Tier
Cloud Tier
  • CCTV Footage

    INGEST

    Continuous multi-camera RTSP streams are decoded into frame sequences at the edge node.

  • Edge Screening

    STAGE 1

    A lightweight spatiotemporal classifier runs on every clip — cheap, fast, always-on.

  • Suspicion Scoring

    STAGE 2

    Motion energy, pose dynamics and classifier confidence are fused into a single calibrated suspicion score.

  • Selective Escalation

    ROUTER

    Only clips above an uncertainty-aware threshold are escalated. The rest are logged and discarded.

  • VLM Verification

    STAGE 3

    A Vision Language Model inspects escalated clips, reasoning over the scene and producing a natural-language judgement.

  • Decision Engine

    OUTPUT

    Verified events are scored, explained, evidenced and routed to operators with a full audit trail.

03Selective Escalation

One threshold decides whether the expensive model ever runs

Selective escalation is the heart of the system. Watch how identical infrastructure routes two very different scenes — and why most footage never costs a cloud call.

Suspicion Timeline · t0 → t6
THRESHOLD 50
Escalate ↑
t0
t1
t2
t3
t4
t5
t6
Peak Suspicion
88/100
Edge Cost
38ms
Routing Decision
Escalated to VLM

A sharp rise in motion energy and abnormal pose dynamics pushes suspicion past threshold. The clip is escalated for Vision Language Model verification.

Data Path
Edge Screening
Escalate → Cloud VLM
VLM Verification
04Video Upload & Analysis

Upload footage. Get instant AI analysis.

Drop any CCTV clip and the pipeline screens at the edge, then escalates only what's ambiguous to the Vision Language Model — with a full natural-language rationale attached.

This is a research demonstration interface using pre-computed dissertation experiment outputs. It is not a deployed surveillance product.

Drop your CCTV footage here

Supports MP4, AVI, MOV, MKV — up to 2 GB

or try a research pipeline scenario:

Edge screening
MobileNetV2 screener
Suspicion scoring
Threshold-based routing
AI rationale
Natural-language output
MP4AVIMOVMKV
05Research Evaluation

Robustness held, cost collapsed

Evaluated on 20 clips from the RWF-2000 dataset validation partition (10 violence, 10 non-violence). The hybrid pipeline retains strong accuracy while routing only 11 of 20 clips to the cloud model.

Accuracy
0.0%
Overall correctness
Precision
0.0%
Few false alarms
Recall
0.0%
Few missed events
F1 Score
0.0%
Balanced measure
Confusion Matrix · n = 20
Pred. Violent
Pred. Normal
Actual Violent
0
TP
0
FN
Actual Normal
0
FP
0
TN
CorrectError
Latency Analysis · per clip
Edge screen only38 ms
Hybrid pipeline (avg)142 ms
VLM on every clip920 ms

The hybrid average stays close to edge latency because the costly VLM path is taken rarely.

0%
VLM API calls
11 of 20 clips escalated; 9 handled locally by the edge screener.
0
API calls saved
9 of 20 baseline VLM calls avoided; baseline = 20, hybrid = 11.
06Research Contribution

What this work adds to the field

Five contributions, each addressing a gap left by single-model surveillance systems.

01

Hybrid AI Architecture

A formalised two-tier design that composes a lightweight discriminative screener with a generative vision–language verifier, rather than treating them as alternatives.

02

Edge–Cloud Collaboration

A routing contract that keeps always-on inference local and reserves bandwidth-heavy reasoning for the cloud, only when justified.

03

Computational Efficiency

A 45% reduction in multimodal verification requests in the evaluated configuration (11/20 clips escalated), with 85% classification accuracy maintained.

04

Uncertainty Handling

An uncertainty-aware escalation threshold that treats the screener's confidence as a first-class routing signal, not a binary gate.

05

Deployment Realism

Designed around real-world constraints — bandwidth, cost ceilings, operator load and auditability — rather than benchmark conditions alone.

07About the Dissertation

The research, in brief

This platform is the public-facing artefact accompanying a Coventry University Computer Science dissertation.

Research Question

Can a hybrid pipeline that screens at the edge and verifies with a vision–language model match single-model accuracy for violence detection, while substantially reducing computational cost?

01

Objectives

  • Design a hybrid edge–cloud pipeline that screens all footage cheaply and verifies selectively.
  • Define an uncertainty-aware escalation policy that minimises cloud cost at fixed recall.
  • Quantify the accuracy–cost trade-off against single-model baselines.
02

Methodology

  • A lightweight spatiotemporal classifier provides a calibrated suspicion score at the edge.
  • A vision–language model verifies only escalated clips, producing an explainable judgement.
  • Threshold selection is framed as a constrained optimisation over a labelled validation set.
03

Experimental Evaluation

  • Benchmarked on a balanced held-out set against edge-only and VLM-everywhere baselines.
  • Reported accuracy, precision, recall, F1, per-clip latency and cloud-call volume.
  • Ablations isolate the contribution of the uncertainty-aware threshold.
04

Future Research

  • Online adaptation of the escalation threshold to scene and time-of-day drift.
  • Multi-camera spatial reasoning for tracking events across overlapping views.
  • Human-in-the-loop feedback to continually recalibrate the screener.
Full methodology, related work and complete results are documented in the accompanying dissertation manuscript.