[1]

H. Sang and G. Hai, “A Framework: Region-Frame-Attention-Compact Bilinear Pooling Layer Based S2VT For Video Description”, EJAS, vol. 7, no. 4, pp. 17–30, Sep. 2019.