This review systematizes and analyzes modern approaches to the intelligent detection of anomalies in human behavior based on deep learning in video surveillance systems. The work explores key methods, including hybrid architectures, generative models, and multimodal approaches. The main purpose of the study is to identify the key limitations of existing solutions and propose ways to overcome them by developing a new conceptual architecture.
The analysis showed that modern models achieve high accuracy (F1-score in the range of 90-95%) on standard datasets, but face three fundamental problems: a lack of labeled anomaly data, high computational complexity that prevents real-time operation on edge devices, and low reliability with external interference.
To solve these problems, a hybrid multimodal architecture is proposed that uses compressed-domain analysis to optimize the speed of inference and a Gated Cross-Attention mechanism for intelligent merging of video and audio streams. The proposed architecture demonstrates the potential for creating a reliable, scalable and proactive monitoring system.
https://orcid.org/0009-0001-4436-5154