OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Dr. Divya Nimma; Arjun Uddagiri

doi:10.70135/seejph.vi.2341

Authors

Dr. Divya Nimma PhD in Computational Science, University of Southern Mississippi, Data Analyst in UMMC
Arjun Uddagiri Chief Executive Officer, Gloom Dev Pvt Ltd, Penamaluru, Vijayawada, Andhra Pradesh, India

DOI:

https://doi.org/10.70135/seejph.vi.2341

Keywords:

Vision Transformer, Object Detection, Object Classification, Pyramid Vision Transformer, Adaptive Patch, Intelligent method

Abstract

In this paper, we address the computational chal- lenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based dif- ferentiable Top-K operator. Our extensive experiments, primar- ily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20 percent less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.

OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Call for Papers

indexing

Make a Submission

sidebar1

Benefits of Publishing Open Access

sidebar2

Public Health in Europe

sidebar3

Downloads