Some questions that motivate my research:

  • How does language vary and change?
  • How do people use language variation as a part of their personal style?
  • How do we build tools that are sensitive to these stylistic differences?

Supporting the linguistics community through software

Modern research relies on computer software to automate data extraction and analysis, and this software doesn’t write itself. Within my field of sociolinguistics, researchers rely on a set of tools which time-align transcripts to speech sounds (forced alignment) and perform acoustic analyses of the speech sounds identified. One common tool is the Forced Alignment and Vowel Extraction (FAVE) package developed around the turn of the millenium in part for work on Philadelphia English. While FAVE is relatively ubiquitous in sociolinguistics research there are three main problems:

  1. A transcript is required, but transcription is time consuming and expensive. This is a bottleneck in research, and contemporary transcription tools are not well suited to this task out-of-the-box.
  2. The system is biased towards particular North American English accents. Using the software on Englishes from Europe or Asia often leads to errors which can cause incorrect findings.
  3. The software is old and difficult to maintain leading to lost time as research teams need to fix errors.

I have worked to address these shortcomings in various ways. Reducing the research bottleneck, I have built an interface for researchers which allows them to transcribe their data using langauge models optimized for the data we frequently encounter. Sociolinguists often work with conversational data, and this includes overlaps, stutters, ums-and-uhs, and other disfluencies which are important data points. Most popular language models are optimized for transcribing a speech or narrative rather than a conversation, and they tend to discard disfluencies. I tested and selected open source language models which can reliably identify different speakers and retain disfluencies, with a focus on speakers of non-standard Englishes. For the data extraction and measurement stage, I work with Josef Freuhwald (University of Kentucky) on a system which is language agnostic so that researchers of any language can have a standard, reliable data pipeline. While these and other software projects are in the works, I updated and maintained the legacy version as a stop-gap for researchers. I migrated the prior code base to Python 3, implemented tests and a CI/CD pipeline, and introduced a build system for distribution via PyPI to simplify user installation. By providing a consistent, reliable, and standard set of tools for end-to-end processing of sociolinguistic data, researchers can save time using a solution that just works, and the community can be more confident in research results because the pipeline used is open and easily inspected to ensure programming errors are not a confound.

Speech in non-urban California

Most research into how Californians talk focuses on residents of large, coastal cities. In my PhD program I researched how Californians outside the urban centers talk, going to towns across the state and doing oral history interviews with life-long residents of these communities. With Rob Podesva (Stanford University) I found that speakers in these communities used different parts of the “California” accent to signal aspects of their personality and values, and that this differed across the state based on differences in these communities. For example, the Latinx communities in Bakersfield and Merced are long-established and a large proportion of the local populations, but while they are similar, not all Latinx speakers or communities are the same. I found evidence that speakers in these communities differ in how they pronounce the “a” sound in “trap” and “tram”, and these differences relate to how Latinx identity is performed in these different communities. This and other findings demonstrate that our understanding of language in California is impoverished when we do not consider the racial and geographic diversity of communities in the state.

History of a family of languages in Papua New Guinea

Kate Lindsey (Boston University) and colleagues have recently documented a family of languages in southern Papua New Guinea along the Pahoturi River. These languages share a common ancestor, but we don’t know where that common ancestor was spoken or how it was related to other languages in the area. One hypothesis is that speakers of these languages moved North from Australia. Another hypothesis is that speakers moved South from other parts of Papua New Guinea. To find evidence for these historical migrations, I have been reconstructing the common ancestor of these languages and evaluating which neighboring languages share traits with the reconstructed language. I have focused on a series of sounds, /kw/ and /gw/, which linguists call labialized velars. These sounds are found in some neighbors of the Pahoturi River languages, and some contemporary Pahoturi River languages have these sounds too. Were these sounds borrowed into Pahoturi languages from neighbors, or were these sounds in their common ancestor? If the sounds were in the common ancestor of the Pahoturi River languages, then it suggests that these languages are related to their Papuan neighbors rather than their Australian neighbors. By comparing the contemporary Pahoturi River languages, I showed that we can build a phylogenetic tree relating all the contemporary languages to a common ancestor through a series of language changes that happened in some communities but not others. Based on this evidence and previous work, I argue in favor of the Papuan hypothesis over the Australian hypothesis.