OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers
DOI:
https://doi.org/10.70135/seejph.vi.2341Keywords:
Vision Transformer, Object Detection, Object Classification, Pyramid Vision Transformer, Adaptive Patch, Intelligent methodAbstract
In this paper, we address the computational chal- lenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based dif- ferentiable Top-K operator. Our extensive experiments, primar- ily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20 percent less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.