Inclusive AI for a Culturally Rich Future – Bea Guevarra

Large Language Models (LLMs) are rapidly weaving itself into the fabric of our lives. However, its true potential can only be unlocked if it understands the rich tapestry of human cultures and languages. Culturally inclusive LLMs hold the key to transforming how we interact with technology and ensuring it serves the global population fairly.

The Importance of Multilingual AI

For AI to be truly impactful, it needs to interact with people from diverse backgrounds. Edward Sapir and Lee Whorf proposed what is called the Theory of Linguistic Relativity they state that the language we speak influences how we perceive the world. We categorize and understand experiences through the words available to us.

If AI is restricted to a few languages, it can only understand the world through those specific lenses. This creates a narrow and potentially biased view of reality. Language is not monolithic, and opportunities may be missed in developing generative AI tools for non-standard languages and dialects.

To illustrate the concept, imagine AI as a chef. If it only knows recipes from one culture (one language), its cooking will be limited. But with access to recipes from various cultures (multiple languages), it can create a richer and more diverse culinary experience.  Similarly, multilingual AI requires access to the vast spectrum of human experiences expressed through languages from all corners of the globe.

The Digital Language Divide: A Barrier for Both People and AI 

The digital age presents a significant challenge: the digital language divide. This gap not only limits access to online information for speakers of less dominant languages but also hinders the development of AI itself.

Currently, the vast majority of online content is in English, leaving speakers of other languages at a disadvantage. With limited data available in other languages, AI systems struggle to learn and process information effectively. This, in turn, perpetuates the digital divide by leading to AI applications that primarily cater to English users.

The problem is further compounded by the digital footprint. Speakers of under-resourced languages often have limited access to digital services, resulting in less data being available about their languages. This means their languages are less likely to be included in training data for AI systems, further marginalizing them in the digital world.

A stark example of this divide can be seen in Wikipedia, a beacon of free knowledge. The English Wikipedia hosts over 6 million articles, while Bengali Wikipedia, representing the sixth most-spoken native language, has only about 100,000 articles. This “digital language divide” hinders the global exchange of knowledge, and has become a persistent problem within collaborative platforms that aim to democratize access. 

Why is it important to have content in multiple languages?

Throughout history, language has been wielded as a tool to silence and marginalize. Colonial powers like Britain in India and France in Vietnam imposed their languages, aiming to weaken local identities and establish dominance.

Today, the battleground has shifted: language-based technologies like generative AI. If not developed inclusively, AI risks becoming a gatekeeper, perpetuating inequalities in the digital age.

When content is available in someone’s native language, people can better understand information, participate in online communities, and feel more included in the digital world. It can also help preserve cultural heritage and traditions. It allows people to connect with their roots and express themselves authentically. 

How can the youth address the digital language divide? 

Addressing these issues will not only enhance the performance of generative AI but also ensure it serves all communities equitably. 

As a young person passionate about bridging the digital language divide, you can be a powerful changemaker. Contribute data by sharing creative infographics, stories, and social media content in your native language. Or you can become an AI tester, providing feedback on the accuracy of regional dialects, the slang commonly used by young people, and cultural sensitivity in AI models designed for your language. With this you can also identify and report instances of bias in AI outputs, fostering a culture of inclusivity in AI development. 

Talk to local government officials about promoting the development of standardized data formats and protocols for multilingual AI models to ensure interoperability. Raise awareness within your communities about the importance of multilingual AI and the need for participation in data collection and testing.

These are just to name a few but by taking these actions, you and your peers, fostering collaboration, can ensure AI empowers everyone, regardless of language. 

The Future of Multilingual AI

Promising machine translation (MT) technologies, such as Google Translate, are emerging as effective tools to address this language divide. 

  • Google Translate adds 110 new languages, including Cantonese as a highly-requested feature, alongside the Shahmukhi variety of Punjabi, five Philippine dialects like Hiligaynon, Kapampangan, etc., and the Papua New Guinean creole Tok Pisin. This expansion leverages the power of their PaLM 2 AI model.
  • Mozilla’s Common Voice invites global participation to contribute voice data in various languages. This helps develop voice recognition systems that understand the global linguistic spectrum, from accents and dialects from Jakarta to Johannesburg.
  • Bharat GPT is a powerful example of the nuanced application of AI in multilingual contexts like India. Supporting over 14 Indian languages, this LLM taps into the country’s rich linguistic tapestry, providing access across video, voice, and text mediums.

To combat bias in generative AI, we need a proactive approach that embraces regional and linguistic nuances from the very beginning. This requires a diverse “human-in-the-loop” approach, involving communities from the start. Their varied inputs, including dialects and idioms, ensure large language models (LLMs) represent the richness of human language.

But inclusion goes beyond technical solutions.  Efforts to involve underrepresented communities must be transparent and respectful.  Language is not just a communication tool, it’s deeply personal. 

Imagine an AI that speaks every language and navigates cultural nuances. It could revolutionize how we interact with technology, fostering global connection and understanding. The future of AI shouldn’t be about exclusion, but about embracing the diversity of our languages and cultures. By building inclusivity into the foundation of AI development, we can pave the way for a truly global AI landscape that benefits everyone.

For related articles or studies:

Written by Bea Guevarra