A Framework: Region-Frame-Attention-Compact Bilinear Pooling Layer Based S2VT For Video Description
DOI:
https://doi.org/10.14738/aivp.74.6717Abstract
In the video description task, the temporal information and visual information of the video are very important for video understanding, and high-level semantic information contained in mixed features of text features and video features plays an important role in the generation of video caption.In order to generate accurate and appropriate video captions.Based on the S2VT (sequence to sequence: video to text)framework, we propose a video description neural network framework (RFAC-S2VT) with a two-level attention and compact linear pooling layer (CBP) fusion.We use visual information and category information from the dataset for class training, and then we use CNN to extract the trained visual features.In the encodering stage,this paper designs a regional attention mechanism to dynamically focus on each frame of video,and then the region-weighted 2D visual features and C3D visual features containing temporal information are then fused together. We use the characteristic of model to model the fusion visual features with temporal information.In the decodering stage, this paper designs a frame-level attention ,and then fine-grained the video features which has been focusd by frame-level attention and the text features in the dataset by using compact linear pooling layer (CBP),finally model generated relevant video caption.We validate the proposed network framework on the MSR-VTT dataset,the results show that our proposed neural network framework is competitive on this dataset and current state of the art.