Abstract: Cross-modal retrieval is a technique that uses one modality to query another modality in multimedia data (e.g., retrieving images based on text, or retrieving text based on images). It can ...