By John Walubengo
As artificial intelligence (AI) continues to transform global industries, a significant oversight in its development still needs to be made aware of the underrepresentation of African datasets in most global AI training data sets.
This gap curtails the effectiveness of AI applications across the African continent and perpetuates a broader issue of bias within these global AI systems.
This situation presents an urgent need for inclusive AI training datasets that genuinely reflect the diversity of global populations.
It is no secret that AI systems perform best when trained on comprehensive and diverse datasets. However, most data used to train these systems predominantly comes from North America, Europe, and Asia.
This exclusion of African data leads to AI models often needing to be equipped to understand or interact appropriately with the nuances of African environments, languages, and cultural contexts.
Impact of un-representative AI training data
For instance, let us consider a healthcare AI system designed to diagnose skin diseases. Such a system trained primarily on data from Caucasian skin types will inevitably underperform when diagnosing conditions on darker African skin, leading to misdiagnoses and inadequate healthcare delivery.
Similarly, voice recognition and natural language technologies that power everything from customer service bots to intelligent assistants need to be more responsive to the accents and languages of the African continent. This frustrates users and limits the adoption and utility of AI technologies in Africa.
The consequences of these data gaps are not abstract—they are real and palpable.
They affect you and me at the grassroots levels. One common example is your Google Map assistant, who always pronounces the names of the African roads with an English or French accent that confuses rather than assists the African user.
Similarly, local businesses, educational institutions, and healthcare providers that attempt to implement similar global AI solutions encounter barriers that impede their efficiency and effectiveness -simply because the AI needs to understand the local context better.
Additional examples include an AI-driven educational tool that might fail to recognise local dialects in spoken answers, unfairly penalising students for their linguistic background.
A financial service using a global AI tool for credit scoring might misjudge an individual’s creditworthiness simply because its training data needs to reflect local economic activities and norms, such as M-PESA, rather than credit card transactions.
The Debate on Data Inclusion
The discussion about integrating African datasets into AI training regimes involves a spectrum of opinions. On one side, some argue that the integration process could be more convenient and cost-effective, especially given the legal and logistical hurdles of data collection in diverse regulatory environments across Africa.
They need to be more concerned about the feasibility of harmonising such vast amounts of data.
On the other side, proponents of data inclusivity argue that the long-term benefits—creating AI systems that are truly global and non-discriminatory—far outweigh the initial challenges. They point out that more representative data would lead to more robust AI systems that can perform effectively across all geographies.
Addressing this critical AI data gap requires concerted efforts from multiple stakeholders, some of which are suggested below:
Local Data Initiatives: African tech companies and research institutions must collect, curate, and make extensive local datasets that reflect our context available.
Global Partnerships: International AI developers should partner with local African entities to understand and incorporate local data needs into their development processes.
Regulatory Frameworks: Governments across Africa should develop frameworks that facilitate data sharing while protecting individual privacy to support creating inclusive AI systems.
In conclusion, the absence of African datasets in AI development is a glaring oversight that needs immediate and sustained attention.
Ensuring that AI systems are trained on diverse global datasets is crucial for accuracy and fairness.
It’s time for the AI development community to embrace this diversity fully to enhance technological effectiveness across Africa and ensure equitable AI impacts worldwide.
John Walubengo is an ICT Lecturer and Consultant. @jwalu.