Research: Current Projects
Latest Projects
ADAPTATIONS IN CONVERSATION: ENGAGING VOICES, FACES, BRAINS AND MACHINES
Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Paul Tupper (Mathematics, 間眅埶AV), Maggie Clarke (間眅埶AV ImageTech Lab), Dawn Behne (Psychology, Norwegian University of Science and Technology), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), and members of the 間眅埶AV Language and Brain Lab.
We are routinely engaged in face-to-face conversations, voice-only phone chats, and increasingly, video-based online communication. We interact with people with different language backgrounds, and nowadays even with AI chatbots. Our experiences thus involve adjustments in our speech production and perception based on whom we communicate with and in what environments. In this research, we explore how conversation partners from different backgrounds (e.g., native-nonnative, human-AI) adjust speech for successful communication. Specifically, we collect audio-video recordings of live conversations involving interactive computer-game tasks that elicit words with specific sound contrasts and examine how interlocutors adapt their speech to resolve miscommunications as the conversation progresses. Speakers facial movements during target sound productions and acoustic correlates of the same productions are analyzed, along with how these differences are perceived. Neural data are also collected to study brain activities during the conversation. Finally, visual, acoustic, perceptual and neural data are brought together using computational modeling to develop prediction for which adaptive attributes improve the likelihood of accurate communication.
CREATING ADAPTIVE VOCAL INTERFACES IN HUMAN-AI INTERACTIONS
Funding: FASS Breaking Barriers Grant, 間眅埶AV Faculty of Arts and Social Sciences
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Henny Yeung (Co-Investigator, Linguistics, 間眅埶AV), Angelica Lim (Co-Investigator, Computer Science, 間眅埶AV), and members of the Language and Brain Lab, Language and Development Lab, and Rosie Lab.
AI-powered vocal interfaces are rapidly increasing in prevalence. Subsequently, a pressing issue is that communication with these interfaces can break down, especially when speaking or listening is challenging (for language learners, children, speech-hearing impaired individuals, in noisy conditions, etc.). The goal of this research is to investigate how humans and vocal interfaces adapt their speech in the face of these misunderstandings in three experiments. Specifically, Study 1 asks how human speech production changes in response to misperceptions from AI-powered vocal interfaces. Study 2 creates an adaptive AI-powered vocal interface that better communicates when humans misunderstand. Study 3 brings this work outside 間眅埶AV to the community, and examines naturalistic interactions between humans and social robots that implement the adaptive conversational platform developed in Study 1 and Study 2. Findings will improve the technology behind existing virtual assistants, fostering technological engagement in education and in other diverse, multilingual environments.
HYPER-ARTICULATION IN AUDITORY-VISUAL COMMUNICATION
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), Dawn Behne (Psychology, Norwegian University of Science and Technology), Ghassan Harmaneh (Computer Science, 間眅埶AV), Paul Tupper (Mathematics, 間眅埶AV), and members of the 間眅埶AV Language and Brain Lab
Human speech involves multiple styles of communication. In adverse listening environments or in challenging linguistic conditions, speakers often alter their speech productions using a clarified articulation style termed hyperarticulation, with the intention of improving listener intelligibility and comprehension. Questions thus arise as to what strategies speakers use to enhance their speech and whether they are effective in improving intelligibility and comprehension. This research examines hyperarticulation in words differing in voice and facial cues to identify which speech-enhancing cues are important to make words more distinctive. We examine (1) both acoustic voice properties and visual mouth configurations in hyperarticulated words, using innovative computerized sound and image analysis techniques; (2) the intelligibility of hyperarticulated words presenting speaker voice and/or speaker face to perceivers for word identification, and (3) the relationship of speaker-perceiver behavior based on computational and mathematical modeling to determine how speakers and perceivers cooperate to encode and decode hyperarticulated cues in order to achieve optimal communication.
VISUAL PROCESSING OF PROSODIC AND SEGMENTAL SPEECH CUES: AN EYE-TRACKING STUDY
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (Co-PI, Linguistics, 間眅埶AV), Henny Yeung (Co-PI, Linguistics, 間眅埶AV), and members of the 間眅埶AV Language and Brain Lab and Language and Development Lab.
Facial gestures carry important linguistic information and improve speech perception. Research including our own (Garg, Hamarneh, Jongman, Sereno, and Wang, 2019; Tang, Hannah, Jongman, Sereno, Hamarneh, and Wang, 2015) indicates that movements of the mouth help convey segmental information while eyebrow and head movements help convey prosodic and syllabic information. Perception studies using eye-tracking techniques have also shown that familiarity with a language influence looking time at different facial areas (Barenholtz, Mavica, and Lewkowicz, 2016; Lewkowicz and Hansen-Tift, 2012). However, it is not clear the extent to which attention to different facial areas (e.g., mouth vs. eyebrows) differ for prosodic and segmental information and as a function of language familiarity. Using eye-tracking, the present study investigates 3 questions. Firstly, we focus on differences in eye gazing patterns to see how different prosodic structures are processed in a familiar language versus a non-familiar language. Secondly, we focus on monolingual processing of segmental and prosodic information. Thirdly, we compare the results of segmental differences and prosodic differences in familiar versus non-familiar languages. Results of this research have significant implications in improving strategies for language learning and early intervention.
CONSUMING TEXT AND AUDIO FAKE NEWS IN A FIRST AND SECOND LANGUAGE
Funding: 間眅埶AV FASS Kickstarter Grant
Research Team: Henny Yeung (PI), Maite Taboada (Co-PI), Yue Wang (Collaborator), Linguistics, 間眅埶AV.
Interest in digital mediaparticularly in disinformation, or fake newshas surged. Almost all work on this topic, however, has looked only at native speakers consumption of English-language media. We ask here how fake news is consumed in ones first language (L1) vs. in a second language (L2), since decision-making, moral judgments, and lie detection are all influenced by whether one uses a first or second language. Only a few prior studies have asked how we consume fake news in an L2, and results are mixed, limited to a single dimension of media evaluation (believability), and only explore text and not audio speech. Objective 1 of this study is to ask how text and acoustic signatures of truthfulness differ in written and audio news excerpts in English, French, and Mandarin. Objective 2 asks how L1-English vs. L1- L1-French and L1-Mandarin speakers may show distinct tendencies to believe in, change attitudes about, and engage with (true or fake) text and audio news clips in English. Results have profound implications for Canada, where L2 consumers of English-language media are numerous.
Recent Projects
AUTOMATED LIP-READING: EXTRACTING SPEECH FROM VIDEO OF A TALKING FACE
Funding: Next Big Question Fund, 間眅埶AV's Big Data Initiative
Research Team: Yue Wang (PI, 間眅埶AV Linguistics); Ghassan Harmaneh (間眅埶AV Computing Science); Paul Tupper (間眅埶AV Mathematics); Dawn Behne (Psychology, Norwegian University of Science and Technology, Norway); Allard Jongman and Joan Sereno (Linguistics, University of Kansas, USA).
Speaking face-to-face, voice and coordinated facial movements are simultaneously used to perceive speech. In noisy environments, seeing a speakers facial movements makes speech perception easier. Similarly, with multimedia, we rely on visual cues when the audio is not transmitted well (e.g., during video conferencing) or in noisy backgrounds. In the current era of social media, we increasingly encounter multimedia-induced challenges where the audio signal in the video is of poor quality or misaligned (e.g., via Skype). The next big question for speech scientists, and relevant for all multimedia users, is what speech information can be extracted from a face and whether the corresponding audio signal can be recreated from it to enhance speech intelligibility. This project tackles the issue by integrating machine-learning and linguistic approaches to develop an automatic face-reading system that identifies and extracts attributes of visual speech to reconstruct the acoustic information of a speaker's voice.
FACESCAN: VISUAL PROCESSING OF PROSODIC AND SEGMENTAL SPEECH CUES: AN EYE-TRACKING STUDY
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (間眅埶AV Linguistics), Henny Yeung (間眅埶AV Linguistics), and members of the Language and Brain Lab (間眅埶AV) and Language and Development Lab (間眅埶AV).
Facial gestures carry important linguistic information and improve speech perception. Research including our own (Garg, Hamarneh, Jongman, Sereno, and Wang, 2019; Tang, Hannah, Jongman, Sereno, Hamarneh, and Wang, 2015) indicates that movements of the mouth help convey segmental information while eyebrow and head movements help convey prosodic and syllabic information. Perception studies using eye-tracking techniques have also shown that familiarity with a language influence looking time at different facial areas (Barenholtz, Mavica, and Lewkowicz, 2016; Lewkowicz and Hansen-Tift, 2012). However, it is not clear the extent to which attention to different facial areas (e.g., mouth vs. eyebrows) differ for prosodic and segmental information and as a function of language familiarity. Using eye-tracking, the present study investigates 3 questions. Firstly, we focus on differences in eye gazing patterns to see how different prosodic structures are processed in a familiar language versus a non-familiar language. Secondly, we focus on monolingual processing of segmental and prosodic information. Thirdly, we compare the results of segmental differences and prosodic differences in familiar versus non-familiar languages. Results of this research have significant implications in improving strategies for language learning and early intervention.
HYPER-ARTICULATION IN AUDITORY-VISUAL COMMUNICATION
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (PI, 間眅埶AV Linguistics); Dawn Behne (Psychology, Norwegian University of Science and Technology, Norway); Allard Jongman and Joan Sereno (Linguistics, University of Kansas, USA); Paul Tupper (間眅埶AV Mathematics); Ghassan Harmaneh (間眅埶AV Computing Science).
Human speech involves multiple styles of communication. In adverse listening environments or in challenging linguistic conditions, speakers often alter their speech productions using a clarified articulation style termed hyperarticulation, with the intention of improving listener intelligibility and comprehension. Questions thus arise as to what strategies speakers use to enhance their speech and whether they are effective in improving intelligibility and comprehension. This research examines hyperarticulation in words differing in voice and facial cues to identify which speech-enhancing cues are important to make words more distinctive. We examine (1) both acoustic voice properties and visual mouth configurations in hyperarticulated words, using innovative computerized sound and image analysis techniques; (2) the intelligibility of hyperarticulated words presenting speaker voice and/or speaker face to perceivers for word identification, and (3) the relationship of speaker-perceiver behavior based on computational and mathematical modeling to determine how speakers and perceivers cooperate to encode and decode hyperarticulated cues in order to achieve optimal communication.
COMMUNICATING PITCH IN CLEAR SPEECH
Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)
Research Team: Yue Wang (PI, 間眅埶AV Linguistics); Allard Jongman, Joan Sereno, and Rustle Zeng (Linguistics, University of Kansas, USA); Paul Tupper (間眅埶AV Mathematics); Ghassan Harmaneh (間眅埶AV Computing Science); Keith Leung (間眅埶AV Linguistics); Saurabh Garg (間眅埶AV Language and Brain Lab, and Pacific Parkinsons Research Centre, UBC).
This research investigates the role of clear speech in communicating pitch-related information: lexical tone. The objectives are to identify how speakers modify their tone production while still maintaining tone category distinctions, and how perceivers utilize tonal enhancement and categorical cues from different forms of input. These questions are addressed in a series of inter-related studies examining articulation, acoustics, intelligibility, and neuro-processing of clear-speech tones.
ROLE OF LINGUISTIC EXPERIENCE IN AUDIO-VISUAL SYNCHRONY PERCEPTION
Research Team: Dawn Behne (PI), Yue Wang, and members of the Speech Lab (Norwegian University of Science and Technology) and Language and Brain Lab (間眅埶AV)
The temporal alignment of what we hear and see is fundamental for the cognitive organization of information from our environment. Research indicates that a perceiver織s experience influences sensitivity to audio-visual (AV) synchrony. We theorize that experience that enhances sensitivity to speech sound distinctions in the temporal domain would enhance sensitivity in AV synchrony perception. With this basis, a perceiver whose native language (L1) involves duration-based phonemic distinctions would be expected to be more sensitive to AV synchrony in speech than for an L1 which has less use of temporal cues. In the current study, simultaneity judgment data from participants differing in L1 experience with phonemic duration (e.g., English, Norwegian, Estonian) were collected using speech tokens with different degrees of AV alignments: from audio preceding the video (audio-lead)to the audio and video being physically aligned (synchronous) to video preceding the audio (video-lead). Findings of this research contribute to understanding the underpinnings of experience and AV synchrony perception.
Ongoing Projects
Multi-lingual and Multi-modal Speech Perception, Processing, and Learning
This research investigates how linguistic information from auditory, visual and gestural modalities affects the perception and production of speech sounds. Particularly, the research addresses how native and nonnative speakers with different linguistic backgrounds use multi-modal speech information and what factors affect their perception from different input modalities. Furthermore, given that face-to-face communication often occurs between native and nonnative speakers, we also investigate how visual information resulting from errors in nonnative speech production may affect speech intelligibility and how speakers modify their speech productions in response to communicative needs in different speech contexts.
EXAMINING VISIBLE ARTICULATORY FEATURES IN CLEAR AND CONVERSATIONAL SPEECH
Research Team: Lisa Tang, Ghassan Hamarneh (Computing Science,間眅埶AV), Allard Jongman, Joan Sereno (University of Kansas), Beverly Hannah, Keith Leung, Yue Wang
This project examines the effects of speech style (conversational and clear) and modality (auditory and visual) on articulatory and acoustic characteristics as well as intelligibility of speech sounds. Using state-of-the-art-computer-vision and image processing techniques, we examine videos of speakers' faces and extract movements in different speech styles. Their acoustic correlates are examined through detailed acoustic measurements. We also examine how native and nonnative perceivers use visual articulatory information in their perception of these speech sounds differing in style.
CAN CO-SPEECH HAND GESTURES FACILITATE THE LEARNING OF NON-NATIVE SPEECH SOUNDS?
Research Team: Allard Jongman, Joan Sereno, Katelyn Eng, Beverly Hannah, Keith Leung, Yue Wang
This project predicts that the incorporation of hand gestures indicating lexical tone directionality during training will result in higher post-test tone identification accuracy than a second group of trainees who received no hand gestures or third group who received no face information. Training for each of the 54 English-speaking participants will look at the four Mandarin tones using video trials.
THE EFFECTS OF VISUAL INFORMATION ON PERCEIVING ACCENTED SPEECH
Research Team: Saya Kawase, Yue Wang, Beverly Hannah
This study is to examine how visual phonetic information in nonnative speech productions affects native listeners perception of foreign accent. Native English listeners are asked to judge stimuli spoken by non-native Japanese speakers in an accent rating task. The Japanese speakers are also matched with a group of native (e.g., English) controls. Given that native listeners perceive errors of L2 production both visually and auditorily, audiovisual stimuli are expected to be perceived as having a stronger foreign accent, especially for the more visually salient ones.
AUDITORY AND ARTICULATORY PRIMING EFFECTS ON THE PERCEPTION AND PRODUCTION OF SPEECH SOUNDS
Research Team: Lindsay Walker, Trude Heift, Yue Wang
This research investigates how auditory and articulatory priming segments affect the production and perception of speech sounds, respectively. Specifically, this study looks at late learners of English whose native language is Cantonese and who have difficulty perceiving and producing English voiced obstruents. Auditory primes of these difficult segments are followed by a production task in order to assess whether priming facilitates more accurate pronunciation. Additionally, articulatory primes are followed by a perception task in order to assess whether articulating a segment can facilitate better perception. It is expected that priming in either domain will be facilitative given that previous research has shown a strong connection between speech production and speech perception.
The Processing and Learning of Prosody
Using EEG and behavioral testing methods, this project addresses how linguistic prosody is processed in the brain, and how its neural organization may be affected by linguistic and non-linguistic experience and learning such as musical training experience. The goal of this study is to investigate the extent to which neural processing in second language (L2) learning is influenced by linguistic experience, or reflects a human hardwired ability to process general physical properties.
EFFECTS OF LINGUISTIC AND MUSICAL TRAINING EXPERIENCE ON THE PERCEPTION OF LEXICAL AND MELODIC PITCH INFORMATION
Research Team: Daniel Chang, Yue Wang, Nancy Hedberg
This research examines how tone-language experience influences the perception of music. Native Cantonese speakers, native English speakers, and early English-Cantonese bilinguals will be asked to participate in a Relative-pitch task and an Absolute-pitch task. The present study wants to explore whether early exposure to a tone language, Cantonese for example, facilitates the musical ability of absolute pitch and relative pitch. That is, this study will enable us to know whether speaking a tone language is beneficial to music perception.
ACOUSTIC-PERCEPTUAL PROPERTIES OF CROSS-LANGUAGE LEXICAL-TONE SYSTEMS
Research Team: Jennifer Alexander, Yue Wang
Lexical-tone systems use pitch to signal word meaning; they exist in 70% of languages but are under-studied compared to segmental (consonant/vowel) systems. We extend a well-studied model of second-language sound-structure perception (the Perceptual Assimilation Model, Best and Tyler, 2007), which has traditionally focused on segments, to lexical tones. In doing so, we aim to determine the effect of native-language tone experience on perception of novel lexical tones.
We first aim to evaluate whether and how experience with a tone language affects the organization of non-native tones in acoustic-perceptual space. Listeners will use a free classification paradigm (Clopper, 2008) to classify native- and non-native lexical tones. We then will examine how perceptual proximity affects identification of non-native tone categories: listeners are expected to more quickly and accurately identify tones belonging to contrastive categories present in their native inventories. Finally, we investigate how perceptual proximity affects discrimination of non-native tones. We predict that listeners will more quickly and accurately discriminate, and will be more sensitive to differences between, tones judged to be highly dissimilar (relative to tones judged to be highly similar).
ELECTROPHYSIOLOGICAL STUDY OF LINGUISTIC AND NON-LINGUISTIC PITCH PROCESSING
Research Team: Yang Zhang (University of Minnesota), Dawn Behne (Norwegian University of Science and Technology), Angela Cooper, Yue Wang
Using high-density ERP, this research examines speech and non-speech pitch processing by both tone and non-tone language speakers. We intend to investigate the extent to which early perceptual sensitivities and late categorization abilities are influenced by linguistic and/or musical experience with pitch, and whether such experience is transferrable for tonal pattern processing between speech and non-speech. Additionally, we examine learning-induced brain plasticity through training nonnative tone language learners to perceive linguistic tones.
EFFECTS OF LINGUISTIC AND MUSICAL EXPERIENCE ON NON-NATIVE PERCEPTION OF THAI VOWEL DURATION
Research Team: Angela Cooper, Richard Ashley (Bienen School of Music, Northwestern University), Yue Wang
The present study investigates the influence of linguistic and musical experience on non-native perception of speaking-rate-varied Thai phonemic vowel length distinctions. Utilizing identification and AX discrimination tasks, we hypothesized that native Thai listeners would be more accurate at identifying and discriminating these native vowel length contrasts than the English group, across speaking rates. Furthermore, native group was not predicted to be as sensitive to withincategory differences (such as long vowels at fast and normal rates) as the non-native group. Finally, Given that musicians are trained to discern temporal distinctions in music, English musicians were predicted to be more accurate at identifying and discriminating non-native vowel length distinctions than the English non-musicians, particularly at faster rates of speech.
HEMISPHERIC PROCESSING OF PITCH ACCENT IN JAPANESE BY NATIVE AND NON-NATIVE LISTENERS
Research Team: Xianghua Wu, Jung-yueh Tu, Saya Kawase, Yue Wang
Using the dichotic listening paradigm, this study investigates the hemispheric processing of Japanese pitch accent by native and non-native listeners. The main questions addressed include the extent to which temporal window and functional load of speech prosody, as well as listeners linguistic experience affect the hemisphere specialization of Japanese pitch accent. Specifically, this study examines: (1) how native and non-native speakers process a type of lexical prosody (pitch accent) imposed on disyllabic words, (2) if processing of the disyllabic prosody is different from that of a monosyllabic one, such as lexical tones in Mandarin, and (3) for non-native listeners, if their tone/stress language background affects the processing of Japanese pitch accent. We are currently investigating how learners of Japanese process pitch accent patterns.
NEURAL CORRELATES OF MATHEMATICAL PROCESSING IN BILINGUALS
Research Team:Ping Li , Shin-Yi Fang (Pennsylvania State University), Yue Wang
This research investigates the roles of working memory capacity, strategies to solve mathematic questions, and level of proficiency in modulating neural response patterns during mathematical processing in English-Chinese bilinguals.