԰AV

Image credit: AWC2024

Graduate Studies Research Spotlight

԰AV Linguistics has a strong presence at the joint 186th Meeting of the ASA and AWC2024

May 19, 2024

By Nicole North

The joint 186th Meeting of the Acoustical Society of America () and 2024 Acoustics Week in Canada conference () was held at the Shaw Convention Centre in Ottawa from May 13th to 17th. This international conference features the latest developments in acoustics and vibration.

The Acoustical Society of America is an international scientific society founded in 1929 dedicated to generating, disseminating, and promoting the knowledge of acoustics and its practical applications.

The Canadian Acoustical Association () is a professional, interdisciplinary organization that fosters communication among people working in all areas of acoustics in Canada, promotes the growth and practical application of knowledge in acoustics, and encourages education, research, protection of the environment, and employment in acoustics. 

Scroll to the bottom of the article to view the extensive photo gallery. 

԰AV Linguistics PhD student Sylvia Cho presented her poster titled Perception of speaker identity for bilingual voices.

Abstract
Voice is often described as an “auditory face”; it provides important information concerning speaker identity (e.g., age, height, sex). The acoustic properties related to voice can also vary substantially within a speaker based on one’s emotional, social, and linguistic states. Recent work suggests that biological components have the greatest impact in the acoustic variability found in voice, followed by language-specific factors and speaking style [Lee & Kreiman, J. Acoust. Soc. Am. 153, A295 (2023)]. The effects of such within- vs. between-speaker acoustic variability on the perception of speaker identity, however, have not been explored. The present study therefore examines the perception of speaker identity in bilingual voices. The prediction is that acoustic variability will also affect speaker identity perception: voices will be discriminated best for between-speaker samples, while within-speaker variability will not affect perception of speaker to the same extent. To test this prediction, listeners participated in a voice discrimination task using bilingual voice data produced by Korean heritage speakers across different languages (Korean, English) and speech styles (read, extemporaneous). The data will be analyzed to measure the effects of speaker, language, and speech style on voice discrimination. The results will be reported in relevance to the relationship between bilingualism and speech style on voice quality and speaker identity. 

Ivan Fong, an MA student at ԰AV Linguistics, presented his poster titled Phonetic adaptation in conversation: The case of Cantonese tone merging. Ivan was selected as a winner of the student poster presentation award, for which he was given a $300 USD cash prize.

Abstract
Phonetic adaptation occurs when one interlocutor adjusts their speech to converge to or diverge from that of their conversation partner to enhance intelligibility. While most research investigates segmental adaptations, our study focuses on suprasegmentals, specifically Cantonese tone merging. Some Cantonese speakers (“mergers”) are found to merge certain lexical tones (e.g., mid-level Tone3 and low-level Tone6), which may cause confusions when interacting with non-merger speakers. Previous research has shown that a merger may unmerge a level tone pair (Tone3/Tone6) when shadowing a non-merger. However, still unclear is whether such changes result from automatic acoustic mimicking or reflect goal-oriented adaptations for intelligibility benefits. This study uses an unscripted conversation task involving a merger and a non-merger playing a video game, where productions of merged tones may cause confusions, thus motivating goal-oriented adaptations. Initial acoustic analyses focus on average F0 and F0 taken at 10 points along the contour in target Tone3 and Tone6 productions by mergers. Differences in these values for Tone3 versus Tone6 provide evidence that a merger is unmerging the tone pair. Preliminary results show increasing unmerging trends as the task progresses, suggesting progressive alignment toward a non-merger’s productions for intelligibility gains. 

Jetic Gū, a Computational Linguistics PhD student and one of the managers at the Language and Brain Lab, presented his poster titled A new experimental design to study speech adaptations in spontaneous human-computer conversations.

Abstract
Interest is growing for how human interlocutors make phonetic adaptations during spontaneous conversations. Given the increasing popularity of AI chatbots, research also needs to account for adaptations in human-computer interactions, an area under-investigated presumably due to methodological challenges in generating controlled conversational responses. Most studies involve scripted computer output, which may obstruct the dynamicity and the oral-aural medium of a natural conversation. To circumvent these constraints, we present a new experimental design that generates unscripted audio computer responses in human-computer conversations during a collaborative game played on Zoom. This design is unique in several aspects. First, the game (Escape Room) requires discussions on placing pictures (depicting target words/sounds) in specific locations, where misperceptions of target words between interlocutors may cause confusions, thus motivating natural adaptations. Second, to enable real-time computer responses, we adopt the wizard-of-oz paradigm typically used in the field of human-computer interaction, where a human confederate inputs text responses behind-the-scenes. Third, a programmable text-to-speech synthesiser converts the text input to audio output. The design demonstrated in this presentation opens the door to new analyses, tracking the dynamicity of speech adjustments over time. Moreover, it is generalisable to studying speech adaptations across interlocutor backgrounds. 

, a postdoctoral fellow at ԰AV Linguistics, presented his poster titled Real-time speech adaptations in conversations between human interlocutor and AI confederate.

Abstract
Compared to human-directed adaptations, less is known about how humans adjust their speech for intelligibility benefits while interacting with an AI-powered voice interface. In this study, we investigate human speech adaptations in human-to-human versus human-to-AI unscripted conversations. Specifically, we examine the production of words containing intervocalic /t-d/ in a conversation between a speaker who distinguishes these two stops (e.g., metal-medal) and a speaker (“flapper”) who merges the two stops into a flap /ɾ/. We predict that misperceptions of intervocalic /t-d/ may cause confusions, thus motivating adaptations. We record native Canadian-English speakers (flappers) while playing a video game on Zoom in two conversation settings: with (1) a human non-flapper and (2) an AI non-flapper (computer-generated speech). Acoustic analyses of the productions by human flapper speakers include features specific to stop-flap distinctions as well as global features (e.g., overall duration). In both human- and AI-directed speech, we expect human interlocutors to change flapped productions to stops to enhance intelligibility, particularly late in the conversation. Moreover, we expect differences between human- and AI-directed adaptations, with the former dominantly employing sound-specific features and the latter relying more on global hyperarticulation. Understanding these interlocutor-oriented adaptations may inform the technology behind human-computer interfaces. 

԰AV Linguistics PhD student Han Zhang presented her poster during a special session called VowelFest. Zhang’s poster is titled Speech adaptation in conversation: Effects of segment cue-weighting strategy for non-native speakers.

Abstract
Compared to human-directed adaptations, less is known about how humans adjust their speech for intelligibility benefits while interacting with an AI-powered voice interface. In this study, we investigate human speech adaptations in human-to-human versus human-to-AI unscripted conversations. Specifically, we examine the production of words containing intervocalic /t-d/ in a conversation between a speaker who distinguishes these two stops (e.g., metal-medal) and a speaker (“flapper”) who merges the two stops into a flap /ɾ/. We predict that misperceptions of intervocalic /t-d/ may cause confusions, thus motivating adaptations. We record native Canadian-English speakers (flappers) while playing a video game on Zoom in two conversation settings: with (1) a human non-flapper and (2) an AI non-flapper (computer-generated speech). Acoustic analyses of the productions by human flapper speakers include features specific to stop-flap distinctions as well as global features (e.g., overall duration). In both human- and AI-directed speech, we expect human interlocutors to change flapped productions to stops to enhance intelligibility, particularly late in the conversation. Moreover, we expect differences between human- and AI-directed adaptations, with the former dominantly employing sound-specific features and the latter relying more on global hyperarticulation. Understanding these interlocutor-oriented adaptations may inform the technology behind human-computer interfaces. 

԰AV Linguistics PhD student Sylvia Cho
Patricia Keating (UCLA) and Ivan Fong
Fenqi Wang and Jetic Gū
Ivan Fong, Jetic Gū, Han Zhang, and Fenqi Wang
Brian Diep (UBC) and Sylvia Cho
Computational Linguistics PhD student Jetic Gū
԰AV Linguistics PhD student Han Zhang
Brian Diep, Han Zhang, Sylvia Cho, Fenqi Wang, and Ivan Fong
԰AV Linguistics MA student Ivan Fong
԰AV Linguistics postdoc Fenqi Wang
Han Zhang presents to a large crowd
Han Zhang, Brian Diep, Sylvia Cho, and Ivan Fong