Abstract
With the explosive growth of multimodal data, cross-modal retrieval has drawn increasing research interests. Hashing-based methods have made great advancements in cross-modal retrieval due to the benefits of low storage cost and fast query speed. However, there still exists a crucial challenge to improve the accuracy of cross-modal retrieval due to the heterogeneity gap between modalities. To further tackle this problem, in this paper, we propose a new two-staged cross-modal retrieval method, called Deep Semantic Hashing with Dual Attention (DSHDA). In the first stage of DSHDA, a Semantic Label Network (SeLabNet) is designed to extract label semantic features and hash codes by training the multi-label annotations, which can make the learning of different modalities in a common semantic space and bridge the modality gap effectively. In the second stage of DSHDA, we propose a deep neural network to simultaneously integrate feature and hash code learning for each modality into the same framework, the training of the framework is guided by the label semantic features and hash codes generated from SeLabNet to maximize the cross-modal semantic relevance. Moreover, dual attention mechanisms are used in our neural networks: (1) Lo-attention is used to extract the local key information of each modality and improve the quality of modality features. (2) Co-attention is used to strengthen the relationship between different modalities to produce more consistent and accurate hash codes. Extensive experiments on two real datasets with image-text modalities demonstrate the superiority of the proposed method in cross-modal retrieval tasks.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893
Cao Y, Long M, Wang J, Liu S (2017) Collective deep quantization for efficient cross-modal retrieval. In: 31st AAAI conference on artificial intelligence, pp 3974–3980
Wang B, Yang Y, Xu X, Hanjalic A, Shen H (2017) Adversarial cross-modal retrieval. In: Proceedings of the 2017 ACM multimedia conference, pp 154–162
Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4269–4278
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp 686–701
Gu J, Cai J, Joty S.R., Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189
Hu D, Nie F, Li X (2018) Deep binary reconstruction for cross-modal hashing. IEEE Trans Multimedia 21(4):973
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimedia 19(9):2033
Shi Y, You X, Zheng F, Wang S, Peng Q (2019) Equally-guided discriminative hashing for cross-modal retrieval. In: Twenty-eighth international joint conference on artificial intelligence, pp 4767–4773
Zhang J, Peng Y (2020) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimedia 22(1):174
Wang D, Cui P, Ou M, Zhu W (2015) Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimedia 17(9):1404
Liu W, Mu C, Kumar S, Chang S (2014) Discrete graph hashing. In: Advances in Neural Information Processing Systems, pp 3419–3427
Ding K, Fan B, Huo C, Xiang S, Pan C (2017) Cross-modal hashing via rank-order preserving. IEEE Trans Multimedia 19(3):571
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10386–10395
Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 197–204
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: 2015 IEEE conference on computer vision and pattern recognition, pp 3864–3872
Mandal D, Chaudhury K, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2633–2641
Jiang Q, Li W (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Thirty-First AAAI conference on artificial intelligence, pp 1618–1625
Weng W, Wu J, Yang L, Liu L, Hu B (2019) Label-based deep semantic hashing for cross-modal retrieval. In: Neural Information Processing, pp 24–36
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: 2018 IEEE conference on computer vision and pattern recognition, pp 4242–4251
Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision, pp 591–606
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 415–424
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082
Zhang D, Li W (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Twenty-Eighth AAAI Conference on Artificial Intelligence, pp 2177–2183
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872
Xu X, Shen F, Yang Y, Shen HT, Li X (2017) Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans Image Process 26(5):2494
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: 32nd AAAI conference on artificial intelligence, pp 539–546
Yang M, Zhao W, Xu W, Feng Y, Zhao Z, Chen X, Lei K (2019) Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 21(4):1047
Wu Q, Shen C, Wang P, Dick A, Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367
Chen K, Zhao T, Yang M, Liu L, Tamura A, Wang R, Utiyama M, Sumita E (2018) A neural approach to source dependence based context model for statistical machine translation. IEEE/ACM Trans Audio Speech Language Process 26(2):266
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014). Return of the devil in the details: Delving deep into convolutional nets, arXiv:1405.3531
He K, Jian S (2015) Convolutional neural networks at constrained time cost. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 5353–5360
Huiskes M, Lew M (2008) The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43
Chua T, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, p 48
Liu W, Mu C, Kumar S, Chang SF (2014) Discrete graph hashing. In: Proceedings of the 27th international conference on neural information processing systems, pp 3419–3427
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence, pp 1360–1365
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence, pp 3890–3896
Acknowledgements
This research is supported by the National Natural Science Foundation of China under Grants Nos. 61872191 and 41571389.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Optimization algorithm
It is intractable to optimize Eq. (5) directly since it is non-convex with variables \(\omega ^p\), \(\omega ^t\), \(\mathbf{B }\). However, it is convex when taking one variable with the other two variables fixed. Therefore, we use an alternating learning strategy that fixing two parameters and updating the left one at a time until convergence. The whole alternating learning procedure is shown in Algorithm 1 and the detailed optimization steps are listed as follows:
Step 1: Optimize \(\omega ^p\) with \(\omega ^t\) and \(\mathbf{B }\) fixed. We use SGD with a BP algorithm to optimize the CNN parameter \(\omega ^p\) of the image modality. For each sampled point \(\mathbf{p }_j\), we can compute the gradient for \(\hat{\mathbf{F }}_{*i}^p\) as follows:
Compute the gradient for \(\hat{\mathbf{H }}_{*i}^p\) as follows:
Then \(\frac{\partial {J}}{\partial {\omega ^p}}\) can be computed with \(\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^p}}\) and \(\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^p}}\) by using the chain rule, based on which BP can be used to update the parameter \(\omega ^p\).
Step 2: Optimize \(\omega ^t\) with \(\omega ^p\) and \(\mathbf{B }\) fixed. We use SGD with a BP algorithm to optimize deep neural network parameter \(\omega ^t\) of the text modality. For each sampled point \(\mathbf{t }_j\), we can compute the gradient for \(\hat{\mathbf{F }}_{*i}^t\) as follows:
Compute the gradient for \(\hat{\mathbf{H }}_{*i}^t\) as follows:
Then \(\frac{\partial {J}}{\partial {\omega ^t}}\) can be computed with \(\frac{\partial {J}}{\partial {\hat{\mathbf{F }}_{*i}^t}}\) and \(\frac{\partial {J}}{\partial {\hat{\mathbf{H }}_{*i}^t}}\) by using the chain rule, based on which BP can be used to update the parameter \(\omega ^t\).
Step 3: Optimize \(\mathbf{B }\) with \(\omega ^p\) and \(\omega ^t\) fixed. The objective function shown in Eq. (5) can be reformulated as follows:
which is rewritten as follows:
To maximize the above formulation, we need to keep the two values of the product the same sign. Therefore, the following formulation can be obtained:

Time complexity of DSHDA
As Fig. 1, DSHDA consists of two stages, each of which consists of different modules further. We will analyze the time complexity of these modules one by one as follows.
The first stage of DSHDA is the SeLabNet, which is a network model with multiple full connected layers. The time complexity of the network model can be expressed as follows:
where \(w_l\) is the number of neurons in the l-th layer and \(d_F\) is the total number of full connected layers.
Let \(w_0=k\) be the dimension of the input multi-label annotation. Then, according to Eq. (19) and Table 1, we can obtain the time complexity of the semantic feature generation module \(G^l\) and semantic hash code generation mobule \(D^l\) as following Eqs. (20) and (21), respectively.
The second stage of DSHDA is the image and text network, which consists of five modules, including feature learning module (\(E^p\) and \(E^t\)), Lo-attention (\(A^p\) and \(A^t\)), semantic feature generation (\(G^p\) and \(G^t\)), B-structure (\(V^p\) and \(V^t\)) and semantic hash code generation (\(D^p\) and \(D^t\)).
For the image network, the image feature learning module \(E^p\) is a five-layer convolutional network. The time complexity of the network model with multiple convolutional layers is given by [37]:
where \(d_L\) is the number of convolutional layers, \(s_l\) is the spatial size (length) of the filter in the l-th layer, \(n_l\) is the number of filters in the l-th layer, \(m_l\) is the spatial size of the output feature map in the l-th layer, and \(n_{l-1}\) is also known as the number of input channels of the l-th layer.
Let \(t_l\) be the stride of the filter in the l-th layer, we have
Besides, the length of the original image feature \(d_p\) can be expressed as:
where \(m_0\) and \(n_0\) can be considered as the spatial size and the number of channels of the input feature map for the first convolutional layer, respectively.
Then, Eq. (22) can be transformed as:
Now, Let \(n_0=3\) represent the three color channels of the input image. According to Eq. (25) and Table 2, we can obtain the time complexity of module \(E^p\) as follows:
Since the dimension of the output image feature maps by module \(E^p\) is \(d_C \cdot d_H \cdot d_W\), we have \(d_C=265\), \(d_H \cdot d_W=m_5^2=d_p/48\). These feature maps will be the input of the Lo-attention module \(A^p\), which contains is a convolutional layer with a \(1 \times 1\) kernel size. Then, the time complexity of module \(A^p\) is:
With the same approach, we can obtain the time complexity of other modules in the image and text network of DSHDA. Due to space limitations, we omit the process of the analysis and just list the time complexity of DSHDA modules in Table 11.
Thus, the total time complexity of DSHDA is the sum of all modules in Table 11. Besides, noticing that the values of k and c are \(O(10^2)\), \(d_p\) and \(d_t\) are \(O(10^3)\) in practice, we can obtain the time complexity of DSHDA is:
where \(\epsilon >1\) is a scale factor of the time complexity.
Rights and permissions
About this article
Cite this article
Wu, J., Weng, W., Fu, J. et al. Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput & Applic 34, 5397–5416 (2022). https://doi.org/10.1007/s00521-021-06696-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06696-y