Building Inclusive Natural Language Processing for Speakers of marginalised and under-represented languages

Every day, millions of people rely on technology like digital assistants, speech-to-text software, and voice-operated GPS systems for everyday life. However, for speakers of African American Vernacular English (AAVE), these technologies can be problematic because large natural language processing (NLP) models are often unable to understand or generate AAVE words. These models can also be trained on data scraped from the web, which can lead to the perpetuation of racial bias and stereotypical associations.

When biased models are used by companies to make important decisions, it can have serious consequences for AAVE speakers, such as unfairly restricted access to social media, housing, loans, and even the law enforcement and judicial systems. It can also have serious repercussions for companies under new regulations that have recently been enacted and others that are in the pipeline.

For the past couple of years, machine learning specialists like Jazmia Henry, a machine learning specialist and fellow at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and the Center for Comparative Studies in Race and Ethnicity (CCSRE), have been working to incorporate AAVE into NLP models in a responsible and inclusive manner. Ms. Henry alone has created an open-source database of more than 141,000 AAVE words to help researchers and builders design models that are less susceptible to bias. The goal is for social and computational linguists, computer scientists, anthropologists, social scientists, and others to use this database for research and testing, ultimately growing it into a true representation of AAVE and providing feedback for future algorithm improvements.

AAVE is a language of perseverance and uplift, created from African languages that were thought to have been lost during the slave trade migration. Specialists like Ms. Henry became interested in including AAVE and other marginalised languages in NLP models because of their personal experiences with such languages (in Ms. Henry’s case, creole) spoken by close relatives and friends and the shame and stigma associated with such languages in their community.

In the particular case of creole, the creation of the database was not without obstacles, as AAVE evolves much more quickly than other languages and often has unique meanings for words. The database is broken down into four collections: a lyric collection of songs by 105 artists, a leadership collection of speeches by influential individuals, a book collection of works from historically Black book archives, and a social media collection of video transcripts, blog posts, and tweets.

The hope is that this project will inspire researchers to question and push the field forward to ensure that marginalised and under-represented languages are represented in NLP languages. It may also help social and computational linguists determine if such languages are their own language or dialect and explore links to other languages known to have emerged from areas where the language is known to have originally come from.

In Ms. Henry’s case, the database she has created has the potential to remove the shame associated with AAVE and instil pride in its speakers. AAVE may be proof that the legacy of African languages has been retained, despite the losses that may have been suffered during the slave trade. By including AAVE in NLP models, researchers can not only improve technology for its speakers but also further understand and celebrate the rich cultural heritage it represents. This is, however, not an isolated case and can pave the way for other such initiatives to be undertaken.

AAVE African American Vernacular English declining languages marginalised languages Natural language processing NLP Racial bias Social media

Latest News Articles

Bernard Mallia

January 9, 2023

Article Author

Prefer listening?

If you prefer to listen to, instead of reading the text on this page, all you need to do is to put your device sound on, hit the play button below, sit back, relax and leave everything else to us.

Algorithmic BrAInBuilding Inclusive Natural Language Processing for Speakers of marginalised and under-represented languages

Narration brought to you by

Building Inclusive Natural Language Processing for Speakers of marginalised and under-represented languages

Latest News Articles

Fire Ants: A Burning Problem for the Mediterranean

European Parliament Adopts Landmark Artificial Intelligence Act

Graphene’s Leap in Human Health: A Safe and Revolutionary Material

A Scientific Leap in Combustion: Rotating Detonation Engines

7 Reasons The Traditional Database Model Is Going Through A Radical Transformation And Why They Matter

The Space Station That Almost Wasn’t

Hydrogen Trains: Is a Reshaping of the Future of European Rail on the Horizon?

Microplastic Pollution in the Mediterranean: Environmental Impacts and Regional Management Strategies

Breakthroughs in Thermoacoustic Stirling Generators Lead to Innovative Energy Conversion

The Future of Source Search: How Human-AI Collaboration is Transforming Efficacy

The The UK’s Return to Horizon Europe: New Prospects for Scientific Research and Collaboration

The MIICT Supports I-NERGY: A Step Towards Sustainable Energy Solutions

Avoiding Sophisticated Scammers

Geoffrey Hinton Highlights Urgent Need for Effective Governance Measures in the Face of AI’s Societal Implications and Misuse

The European Union’s New Rules for Generative AI

CERES Event Sheds Light on the Future of Agriculture and Innovative Solutions

Harnessing the Power of Specialised AI: The Stickle-Brick Approach

Introducing AtlasAI

How AI and data enrichment may safeguard the disadvantaged during a recession

Prefer listening?