In your paper, section 3.4 "Recursive Joint Cross-Modal Attention", you mentioned" the use of a fully connected layer in equation (1). And, in equations (2\3\4) and (5\6\7) , the learnable weight matrix appears to be inconsistent with the specific implementation details that I found in the DCNLayer class code you provided. Additionally, in your paper, you describe that "In order to obtain more refined feature representations,
the attended features of each modality are again fed as input to the respective joint cross-modal attention module," referring to equation (11), but I cannot find a corresponding implementation for this. Could you kindly clarify this issue? If possible, could you provide a more detailed related code?
In your paper, section 3.4 "Recursive Joint Cross-Modal Attention", you mentioned" the use of a fully connected layer in equation (1). And, in equations (2\3\4) and (5\6\7) , the learnable weight matrix appears to be inconsistent with the specific implementation details that I found in the DCNLayer class code you provided. Additionally, in your paper, you describe that "In order to obtain more refined feature representations,
the attended features of each modality are again fed as input to the respective joint cross-modal attention module," referring to equation (11), but I cannot find a corresponding implementation for this. Could you kindly clarify this issue? If possible, could you provide a more detailed related code?