Learning shared semantic space for speech-to-text translation