Imagine walking down a bustling street in one of Africa’s cosmopolitan cities, listening to the cacophony of conversations. You hear words – but more than that, you hear both foreign and familiar words, a mix of languages, even in one conversation. Take this example from Nigeria, where English and Yoruba merge in one message: “Mo fe ra motor fun omo mi ni next week”, which translates as “I want to buy a car for my child next week.” This phenomenon is known as code-switching – a common linguistic practice in multilingual cultures where people switch between languages in a single discourse.
About 40% of the world’s population is bilingual and almost 20% is multilingual. Africa alone accounts for about 2 000 out of an estimated 7 000 languages in the world, making code-switching an inevitable occurrence.
However, most African languages are low-resourced and under-represented in recent natural language processing technologies, the most popular being large language models such as ChatGPT. It is therefore imperative to move towards an equitable representation of these languages, and ensure that everyone has equal access to these technologies.
Current research into code-switching has received increased attention over the past decade, with the biggest being on corpus creation, benchmark development and the evaluation of downstream tasks in English to Spanish/Hindi/Chinese. Very little research or available data exists for African languages.
So what do you do when you don’t have data? You become creative.
To address the issue of data scarcity, a research team in the Department of Computer Science at the University of Pretoria (UP) embarked on utilising large language models like ChatGPT to generate code-switched text. Methodologies include linguistic prompting, in-context learning and zero/few-shot fine-tuning.
“Initial results suggest that the embedded knowledge of high-resource languages can be useful in closing the gap in low-resource language availability,” says PhD candidate Michelle Terblanche. “The goal is to develop sustainable methods for corpus creation and to make these resources available to the larger community to continuously support advancements in this research field.”
This research opens up possibilities for transdisciplinary collaboration in a variety of disciplines, such as:
“Data quality is at the centre of the research,” Michelle says. As part of her research, she is focusing on developing methods for evaluating synthetic data with minimal human intervention and low-cost computational resources. “We can then address a broader range of applications, catering for language diversity,” she adds.
“Our aim is to position Africans at the forefront of shaping artificial intelligence (AI) for our own benefit,” says Professor Vukosi Marivate, who holds the ABSA UP Chair of Data Science. “This research is a step towards preserving a rich culture and to focus attention on developing technologies that serve the ever-growing multilingual population.”
In alignment with this goal, Prof Marivate – who received a Faculty Research Award from JP Morgan Chase in the category AI to Empower Employees – is channelling a portion of these funds into a specific aspect of Dr Kayode Olaleye’s broader research. This collaboration aims to enhance financial inclusion irrespective of linguistic habits or educational background.
“Through breakthroughs in code-switching modelling, we envision financial institutions being able to offer more relevant, accessible and engaging experiences to a wide array of customers,” Dr Olaleye says. “This research could facilitate the provision of bilingual customer service, the creation of multilingual documentation and the crafting of marketing materials specifically designed to resonate with the diverse linguistic habits of customers.”
Read more research stories like this in the Re.Search magazine.