Research: Current Projects
Latest Projects
ADAPTATIONS IN CONVERSATION: ENGAGING VOICES, FACES, BRAINS AND MACHINES
Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Paul Tupper (Mathematics, 間眅埶AV), Maggie Clarke (間眅埶AV ImageTech Lab), Dawn Behne (Psychology, Norwegian University of Science and Technology), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), and members of the 間眅埶AV Language and Brain Lab.
We are routinely engaged in face-to-face conversations, voice-only phone chats, and increasingly, video-based online communication. We interact with people with different language backgrounds, and nowadays even with AI chatbots. Our experiences thus involve adjustments in our speech production and perception based on whom we communicate with and in what environments. In this research, we explore how conversation partners from different backgrounds (e.g., native-nonnative, human-AI) adjust speech for successful communication. Specifically, we collect audio-video recordings of live conversations involving interactive computer-game tasks that elicit words with specific sound contrasts and examine how interlocutors adapt their speech to resolve miscommunications as the conversation progresses. Speakers facial movements during target sound productions and acoustic correlates of the same productions are analyzed, along with how these differences are perceived. Neural data are also collected to study brain activities during the conversation. Finally, visual, acoustic, perceptual and neural data are brought together using computational modeling to develop prediction for which adaptive attributes improve the likelihood of accurate communication.
CREATING ADAPTIVE VOCAL INTERFACES IN HUMAN-AI INTERACTIONS
Funding: FASS Breaking Barriers Grant, 間眅埶AV Faculty of Arts and Social Sciences
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Henny Yeung (Co-Investigator, Linguistics, 間眅埶AV), Angelica Lim (Co-Investigator, Computer Science, 間眅埶AV), and members of the Language and Brain Lab, Language and Development Lab, and Rosie Lab.
AI-powered vocal interfaces are rapidly increasing in prevalence. Subsequently, a pressing issue is that communication with these interfaces can break down, especially when speaking or listening is challenging (for language learners, children, speech-hearing impaired individuals, in noisy conditions, etc.). The goal of this research is to investigate how humans and vocal interfaces adapt their speech in the face of these misunderstandings in three experiments. Specifically, Study 1 asks how human speech production changes in response to misperceptions from AI-powered vocal interfaces. Study 2 creates an adaptive AI-powered vocal interface that better communicates when humans misunderstand. Study 3 brings this work outside 間眅埶AV to the community, and examines naturalistic interactions between humans and social robots that implement the adaptive conversational platform developed in Study 1 and Study 2. Findings will improve the technology behind existing virtual assistants, fostering technological engagement in education and in other diverse, multilingual environments.
HYPER-ARTICULATION IN AUDITORY-VISUAL COMMUNICATION
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), Dawn Behne (Psychology, Norwegian University of Science and Technology), Ghassan Harmaneh (Computer Science, 間眅埶AV), Paul Tupper (Mathematics, 間眅埶AV), and members of the 間眅埶AV Language and Brain Lab
Human speech involves multiple styles of communication. In adverse listening environments or in challenging linguistic conditions, speakers often alter their speech productions using a clarified articulation style termed hyperarticulation, with the intention of improving listener intelligibility and comprehension. Questions thus arise as to what strategies speakers use to enhance their speech and whether they are effective in improving intelligibility and comprehension. This research examines hyperarticulation in words differing in voice and facial cues to identify which speech-enhancing cues are important to make words more distinctive. We examine (1) both acoustic voice properties and visual mouth configurations in hyperarticulated words, using innovative computerized sound and image analysis techniques; (2) the intelligibility of hyperarticulated words presenting speaker voice and/or speaker face to perceivers for word identification, and (3) the relationship of speaker-perceiver behavior based on computational and mathematical modeling to determine how speakers and perceivers cooperate to encode and decode hyperarticulated cues in order to achieve optimal communication.
VISUAL PROCESSING OF PROSODIC AND SEGMENTAL SPEECH CUES: AN EYE-TRACKING STUDY
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)
Research Team: Yue Wang (Co-PI, Linguistics, 間眅埶AV), Henny Yeung (Co-PI, Linguistics, 間眅埶AV), and members of the 間眅埶AV Language and Brain Lab and Language and Development Lab.
Facial gestures carry important linguistic information and improve speech perception. Research including our own (Garg, Hamarneh, Jongman, Sereno, and Wang, 2019; Tang, Hannah, Jongman, Sereno, Hamarneh, and Wang, 2015) indicates that movements of the mouth help convey segmental information while eyebrow and head movements help convey prosodic and syllabic information. Perception studies using eye-tracking techniques have also shown that familiarity with a language influence looking time at different facial areas (Barenholtz, Mavica, and Lewkowicz, 2016; Lewkowicz and Hansen-Tift, 2012). However, it is not clear the extent to which attention to different facial areas (e.g., mouth vs. eyebrows) differ for prosodic and segmental information and as a function of language familiarity. Using eye-tracking, the present study investigates 3 questions. Firstly, we focus on differences in eye gazing patterns to see how different prosodic structures are processed in a familiar language versus a non-familiar language. Secondly, we focus on monolingual processing of segmental and prosodic information. Thirdly, we compare the results of segmental differences and prosodic differences in familiar versus non-familiar languages. Results of this research have significant implications in improving strategies for language learning and early intervention.
CONSUMING TEXT AND AUDIO FAKE NEWS IN A FIRST AND SECOND LANGUAGE
Funding: 間眅埶AV FASS Kickstarter Grant
Research Team: Henny Yeung (PI), Maite Taboada (Co-PI), Yue Wang (Collaborator), Linguistics, 間眅埶AV.
Interest in digital mediaparticularly in disinformation, or fake newshas surged. Almost all work on this topic, however, has looked only at native speakers consumption of English-language media. We ask here how fake news is consumed in ones first language (L1) vs. in a second language (L2), since decision-making, moral judgments, and lie detection are all influenced by whether one uses a first or second language. Only a few prior studies have asked how we consume fake news in an L2, and results are mixed, limited to a single dimension of media evaluation (believability), and only explore text and not audio speech. Objective 1 of this study is to ask how text and acoustic signatures of truthfulness differ in written and audio news excerpts in English, French, and Mandarin. Objective 2 asks how L1-English vs. L1- L1-French and L1-Mandarin speakers may show distinct tendencies to believe in, change attitudes about, and engage with (true or fake) text and audio news clips in English. Results have profound implications for Canada, where L2 consumers of English-language media are numerous.
Recent Projects
AUTOMATED LIP-READING: EXTRACTING SPEECH FROM VIDEO OF A TALKING FACE
Funding: Next Big Question Fund, 間眅埶AV's Big Data Initiative
Research Team: Yue Wang (PI, 間眅埶AV Linguistics); Ghassan Harmaneh (間眅埶AV Computing Science); Paul Tupper (間眅埶AV Mathematics); Dawn Behne (Psychology, Norwegian University of Science and Technology, Norway); Allard Jongman and Joan Sereno (Linguistics, University of Kansas, USA).
Speaking face-to-face, voice and coordinated facial movements are simultaneously used to perceive speech. In noisy environments, seeing a speakers facial movements makes speech perception easier. Similarly, with multimedia, we rely on visual cues when the audio is not transmitted well (e.g., during video conferencing) or in noisy backgrounds. In the current era of social media, we increasingly encounter multimedia-induced challenges where the audio signal in the video is of poor quality or misaligned (e.g., via Skype). The next big question for speech scientists, and relevant for all multimedia users, is what speech information can be extracted from a face and whether the corresponding audio signal can be recreated from it to enhance speech intelligibility. This project tackles the issue by integrating machine-learning and linguistic approaches to develop an automatic face-reading system that identifies and extracts attributes of visual speech to reconstruct the acoustic information of a speaker's voice.
COMMUNICATING PITCH IN CLEAR SPEECH
Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)
Research Team: Yue Wang (PI, 間眅埶AV Linguistics); Allard Jongman, Joan Sereno, and Rustle Zeng (Linguistics, University of Kansas, USA); Paul Tupper (間眅埶AV Mathematics); Ghassan Harmaneh (間眅埶AV Computing Science); Keith Leung (間眅埶AV Linguistics); Saurabh Garg (間眅埶AV Language and Brain Lab, and Pacific Parkinsons Research Centre, UBC).
This research investigates the role of clear speech in communicating pitch-related information: lexical tone. The objectives are to identify how speakers modify their tone production while still maintaining tone category distinctions, and how perceivers utilize tonal enhancement and categorical cues from different forms of input. These questions are addressed in a series of inter-related studies examining articulation, acoustics, intelligibility, and neuro-processing of clear-speech tones.
MULTI-LINGUAL AND MULTI-MODAL SPEECH PERCEPTION, PROCESSING, AND LEARNING
Funding: Social Sciences and Humanities Research Council of Canada (SSHRC) (2012-2017)
Research Team: Yue Wang (PI, Linguistics, 間眅埶AV), Joan Sereno (Linguistics, University of Kansas), Allard Jongman (Linguistics, University of Kansas), Ghassan Harmaneh (Computer Science, 間眅埶AV), and members of 間眅埶AV Language and Brain Lab.
The temporal alignment of what we hear and see is fundamental for the cognitive organization of information from our environment. Research indicates that a perceiver織s experience influences sensitivity to audio-visual (AV) synchrony. We theorize that experience that enhances sensitivity to speech sound distinctions in the temporal domain would enhance sensitivity in AV synchrony perception. With this basis, a perceiver whose native language (L1) involves duration-based phonemic distinctions would be expected to be more sensitive to AV synchrony in speech than for an L1 which has less use of temporal cues. In the current study, simultaneity judgment data from participants differing in L1 experience with phonemic duration (e.g., English, Norwegian, Estonian) were collected using speech tokens with different degrees of AV alignments: from audio preceding the video (audio-lead)to the audio and video being physically aligned (synchronous) to video preceding the audio (video-lead). Findings of this research contribute to understanding the underpinnings of experience and AV synchrony perception.