Abstract:
In crowd analysis, video data incurs challenges due to occlusion, crowd densities, and dynamic environmental conditions. To address
these challenges and to enhance the accuracy we have proposed Behavioral Crowd Counting (BCC) that combines the Congested Scene
Recognition Network (CSRNet) with Unet in video data. The CSRNet combines two networks namely a (1) frontend for feature extraction and
(2) backend for the generation of a density map. It effectively tallies individuals within densely populated regions, offering a solution to the
high crowd densities constraints. The Unet builds the semantic map and refines the semantic and density map of CSRNet. The Unet unravels
complex patterns and connections among individuals in crowded settings, capturing spatial dependencies within densely populated scenes. It
also offers the flexibility to incorporate attention maps as optional inputs to differentiate crowd regions from the background. We have also
developed new video datasets namely Behavioral Video Dataset from the image dataset of the fine-grain crowd-counting to evaluate the BCC
model. Datasets include standing vs sitting, waiting vs non-waiting, towards vs away, and violent vs non-violent videos, offering insights into
posture, activity, directional movement, and aggression in various environments. The empirical findings illustrate that our approach is more
efficient than others in behavioral crowd counting within video datasets, consisting of congested scenes as indicated by metrics MSE, MAE,
and CMAE.