Building Inclusive Natural Language Processing for Speakers of marginalised and under-represented languages


Every day, millions of people rely on technology like digital assistants, speech-to-text software, and voice-operated GPS systems for everyday life. However, for speakers of African American Vernacular English (AAVE), these technologies can be problematic because large natural language processing (NLP) models are often unable to understand or generate AAVE words. These models can also be trained on data scraped from the web, which can lead to the perpetuation of racial bias and stereotypical associations.

When biased models are used by companies to make important decisions, it can have serious consequences for AAVE speakers, such as unfairly restricted access to social media, housing, loans, and even the law enforcement and judicial systems. It can also have serious repercussions for companies under new regulations that have recently been enacted and others that are in the pipeline.

For the past couple of years, machine learning specialists like Jazmia Henry, a machine learning specialist and fellow at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and the Center for Comparative Studies in Race and Ethnicity (CCSRE),  have been working to incorporate AAVE into NLP models in a responsible and inclusive manner. Ms. Henry alone has created an open-source database of more than 141,000 AAVE words to help researchers and builders design models that are less susceptible to bias. The goal is for social and computational linguists, computer scientists, anthropologists, social scientists, and others to use this database for research and testing, ultimately growing it into a true representation of AAVE and providing feedback for future algorithm improvements.

AAVE is a language of perseverance and uplift, created from African languages that were thought to have been lost during the slave trade migration. Specialists like Ms. Henry became interested in including AAVE and other marginalised languages in NLP models because of their personal experiences with such languages (in Ms. Henry’s case, creole) spoken by close relatives and friends and the shame and stigma associated with such languages in their community.

In the particular case of creole, the creation of the database was not without obstacles, as AAVE evolves much more quickly than other languages and often has unique meanings for words. The database is broken down into four collections: a lyric collection of songs by 105 artists, a leadership collection of speeches by influential individuals, a book collection of works from historically Black book archives, and a social media collection of video transcripts, blog posts, and tweets.

The hope is that this project will inspire researchers to question and push the field forward to ensure that marginalised and under-represented languages are represented in NLP languages. It may also help social and computational linguists determine if such languages are their own language or dialect and explore links to other languages known to have emerged from areas where the language is known to have originally come from.

In Ms. Henry’s case, the database she has created has the potential to remove the shame associated with AAVE and instil pride in its speakers. AAVE may be proof that the legacy of African languages has been retained, despite the losses that may have been suffered during the slave trade. By including AAVE in NLP models, researchers can not only improve technology for its speakers but also further understand and celebrate the rich cultural heritage it represents. This is, however, not an isolated case and can pave the way for other such initiatives to be undertaken.

Latest News Articles

Leave a Reply

Bernard Mallia

Article Author

Prefer listening?

If you prefer to listen to, instead of reading the text on this page, all you need to do is to put your device sound on, hit the play button below,  sit back, relax and leave everything else to us.

Algorithmic BrAInBuilding Inclusive Natural Language Processing for Speakers of marginalised and under-represented languages

Narration brought to you by

Algorithmic Brain