A Framework: Region-Frame-Attention-Compact Bilinear Pooling Layer Based S2VT For Video Description
In the video description task, the temporal information and visual information of the video are very important for video understanding, and high-level semantic information contained in mixed features of text features and video features plays an important role in the generation of video caption.In order to generate accurate and appropriate video captions.Based on the S2VT (sequence to sequence: video to text)framework, we propose a video description neural network framework (RFAC-S2VT) with a two-level attention and compact linear pooling layer (CBP) fusion.We use visual information and category information from the dataset for class training, and then we use CNN to extract the trained visual features.In the encodering stage,this paper designs a regional attention mechanism to dynamically focus on each frame of video,and then the region-weighted 2D visual features and C3D visual features containing temporal information are then fused together. We use the characteristic of model to model the fusion visual features with temporal information.In the decodering stage, this paper designs a frame-level attention ,and then fine-grained the video features which has been focusd by frame-level attention and the text features in the dataset by using compact linear pooling layer (CBP),finally model generated relevant video caption.We validate the proposed network framework on the MSR-VTT dataset,the results show that our proposed neural network framework is competitive on this dataset and current state of the art.
Copyright (c) 2019 Advances in Image and Video Processing
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors wishing to include figures, tables, or text passages that have already been published elsewhere are required to obtain permission from the copyright owner(s) for both the print and online format and to include evidence that such permission has been granted when submitting their papers. Any material received without such evidence will be assumed to originate from the authors.
All authors of manuscripts accepted for publication in the journal Transactions on Networks and Communications are required to license the Scholar Publishing to publish the manuscript. Each author should sign one of the following forms, as appropriate:
License to publish; to be used by most authors. This grants the publisher a license of copyright. Download forms (MS Word formats) - (doc)
Publication agreement — Crown copyright; to be used by authors who are public servants in a Commonwealth country, such as Canada, U.K., Australia. Download forms (Adobe or MS Word formats) - (doc)
License to publish — U.S. official; to be used by authors who are officials of the U.S. government. Download forms (Adobe or MS Word formats) – (doc)
The preferred method to submit a completed, signed copyright form is to upload it within the task assigned to you in the Manuscript submission system, after the submission of your manuscript. Alternatively, you can submit it by email email@example.com