OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Authors

  • Dr. Divya Nimma PhD in Computational Science, University of Southern Mississippi, Data Analyst in UMMC
  • Arjun Uddagiri Chief Executive Officer, Gloom Dev Pvt Ltd, Penamaluru, Vijayawada, Andhra Pradesh, India

DOI:

https://doi.org/10.70135/seejph.vi.2341

Keywords:

Vision Transformer, Object Detection, Object Classification, Pyramid Vision Transformer, Adaptive Patch, Intelligent method

Abstract

In this paper, we address the computational chal- lenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based dif- ferentiable Top-K operator. Our extensive experiments, primar- ily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20 percent less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.

Downloads

Published

2024-11-21

How to Cite

Nimma, D. D., & Uddagiri, A. (2024). OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers. South Eastern European Journal of Public Health, 2103–2118. https://doi.org/10.70135/seejph.vi.2341

Issue

Section

Articles