When does an AI model become “Arab”?

Jan 29, 2026

Training an Arabic LLM that reflects local values

Artificial intelligence competition in the Middle East is often framed as an infrastructure story. Data centers are being built in the desert. Chip access is being negotiated at the highest levels. Sovereign wealth funds are sending billions toward Silicon Valley to build their own AI infrastructure based on U.S. models, and using hyperscalers to advance their AI ecosystems.

Most attention from observers is often on the physical hardware of AI ecosystems in the Middle East, largely because this is where the most consequential decisions are made regarding long-term access to technology. If the UAE and Saudi Arabia are successful in transforming their countries into AI hubs, this will be an important step for the Middle East. But, a lesser known competition I find interesting is the question of indigenous “Arab AI”. As the Middle East nations move to catch up to the rapidly advancing global AI race, how will 400 million Arabic speakers interact with artificial intelligence, and how does AI itself come to understand the Arab world?

Arabic remains one of the most significant unsolved problems in natural language processing. The language spans 30+ dialects across 22 countries, features morphological complexity that defeats standard tokenization approaches, and carries cultural and religious context that generic models routinely mishandle. According to recent analysis, Arabic is spoken by over 491 million people across 22+ countries—yet only 0.5% of natural language processing (NLP) research focuses on it. This gap creates an opening that multiple actors have recognized, though they’re approaching it in fundamentally different ways, with different implications for the region.

Hardware over software?

The dominant AI narrative in the Middle East focuses on hardware. Last year, the Commerce Department approved 70,000 advanced chips to the UAE and Saudi Arabia, conditioned on Gulf partners distancing themselves from Chinese technology firms. Meanwhile, American hyperscalers like Amazon, Microsoft Azure, Google Cloud and Oracle Cloud have been welcomed by the region and are building data centers across the region and partnering with the region’s top AI companies, including the UAE’s G42 and Saudi’s Humain. China still maintains a foothold, but it currently is dwarfed by the GCC’s increasing lock-in to American innovation.

The hardware story matters because compute is a prerequisite for the Arab Gulf’s serious AI development. But the hardware infrastructure does not solve the Arabic problem. The flagship models from American companies remain English-first systems with Arabic capability added on. They can translate and respond in Arabic, but they reason through conceptual reasoning and thinking architecture developed in English and trained predominantly on English-language data. When these models process Arabic queries, they’re performing sophisticated translation rather than native comprehension.

Compute alone doesn’t produce AI systems that understand the Arab world on its own terms. A Saudi researcher querying an American model about regional dynamics receives answers filtered through frameworks that reflect how Silicon Valley conceptualizes the world—not how Riyadh does. The same goes for a UAE official querying a Chinese model. If the roots of Arab AI are non-Arab (assuming they are built and trained Chinese or American models), when does an AI model become distinctively “Arab” in thought, reasoning, and language?

The Gulf’s Arab LLM vision

The Gulf states are taking on the challenge. Both Saudi Arabia and the UAE have launched serious indigenous Arabic AI initiatives.

Saudi Arabia’s SDAIA developed ALLaM, training it on a 500-billion-token Arabic dataset—the world’s largest—assembled by mobilizing 16 government entities. Over 400 subject matter experts tested the model through more than a million prompts. The resulting system, now deployed through HUMAIN Chat, explicitly encodes Islamic values and regional cultural context. Saudi officials describe it as “sovereign AI”—built in the Kingdom, by Saudi talent, for Arabic speakers.

The UAE’s Jais, developed by G42’s Inception unit in collaboration with the Mohamed bin Zayed University of Artificial Intelligence, takes a different approach. The latest Jais 2 release, built with 70 billion parameters and trained on the largest Arabic-first dataset, covers a large range of Arab dialects and aims to accurately reflect and respond appropriately to the cultural norms, values, and references of Arabic-speaking communities. The developers have gathered conversational datasets across regional dialects, enabling the model to respond in Lebanese Arabic to Lebanese users, Gulf Arabic to Gulf users, something other global models cannot yet do. Why is that important? If you ask ChatGPT to translate a text passage into Syrian dialect, it can do a surprisingly good job. But, it is required to translate from English to Syrian Arabic through a complex algorithm. Imagine the difference if Arab AI has a strong Arabic ontological and context layer which can think and reason in Arabic?

There are structural challenges. Researchers from the Association of Computing Machinery (ACM) note that “...limited regional collaboration and infrastructure across Arabic-speaking countries continue to hinder large-scale development. While resource-rich nations have invested in AI research, the absence of a cohesive research network and weak industry-academia integration prevent widespread progress.”

As they note, resource-rich nations are investing in AI, but are they investing in Arab AI? This may be one of important turning points in enhancing AI diffusion for Arabic-speaking users in the Arab world.

China’s growing role

While the U.S. has been particularly restrictive (at least until the Trump administration) on the hardware side, Chinese AI firms have taken a markedly different approach. Instead of just focusing on hardware (where they are losing to the United States), Chinese firms are engaging directly with Arabic linguistic and cultural challenges through sustained research collaboration with key Arab AI institutions.

The most striking example is AceGPT, a large language model built through partnership between Saudi Arabia’s King Abdullah University of Science and Technology (KAUST) and two Chinese institutions: the Chinese University of Hong Kong, Shenzhen, and the Shenzhen Research Institute of Big Data. The project was launched by a Chinese-American professor at KAUST and built on Meta’s open-source Llama 2 architecture. Unlike models adapted for Arabic after the fact, AceGPT was designed from inception with Arabic linguistic and cultural alignment as primary objectives—what the developers describe as training for “cultural sensitivity and alignment with local values.”

The results have been significant. According to the Shenzhen Research Institute, by the end of 2024, AceGPT offered models across multiple sizes (7B, 13B, 32B, and 70B parameters), and “significantly outperforms competitors—specifically the Jais model developed in the UAE—making it the world’s leading open-source Arabic large language model.” The trilingual design encompassing Arabic, Chinese, and English is reflective of the research team’s composition, but also a strategic calculation that Arabic AI development might advance faster through Chinese collaboration than through adaptation of English-first American models.

Huawei has pursued a parallel track through infrastructure. In May 2024, Huawei Cloud revealed a 100-billion-parameter Arabic large language model based on PanGu, originally trained on Chinese language data and then adapted for Arabic. The model launched through Huawei’s new Cairo data center positions Chinese AI infrastructure at the entry point for Arabic speakers across North Africa and the broader Middle East, especially in regions where American tech presence remains thinner and Chinese infrastructure investments more substantial.

The financial ties run deeper still. Saudi Aramco’s venture arm, Prosperity7 Ventures, invested $400 million in Zhipu AI, one of China’s leading generative AI companies and a direct open-source competitor to OpenAI. Zhipu AI’s recently-released GLM-4.7 model is said to both outcompete ChatGPT and rival Claude’s Sonnet 4.5. The investment valued Zhipu at approximately $3 billion and made Prosperity7 the sole foreign investor in China’s flagship effort to build a domestic open-source AI champion.

China’s engagement moves beyond just infrastructure investment and anchors direct partnerships on the linguistic and cultural dimensions that make Arabic AI genuinely difficult. Perhaps this is a move to cement GCC access in a world where Chinese technology still cannot outcompete U.S. hardware, but it offers the Arab world something that the Arab world desperately wants as AI advances: an Arab AI.

But the collaboration is not benign. What appears on the surface as economic cooperation may involve exposing critical national data to Beijing. One observer wrote regarding Saudi Arabia that these deals involve “sharing Saudi Arabia’s national data with Chinese algorithms. In exchange for transferring these technologies, Beijing gains access to data flowing from Saudi smart cities to its oil industries—data that constitutes 21st century raw gold for China.” This is the type of exposure which tends to provoke pushback by U.S. policymakers who oppose giving the most advanced hardware to Gulf Arab partners.

An Arab Ontology?

Both the great power approaches and the indigenous Gulf efforts share a common gap: they focus on language fluency. What makes Arab AI will likely need to factor an indigenous knowledge architecture. Speaking fluent Arabic (and all of its dialects) is necessary, but insufficient for an AI model to genuinely “understand the Arab world”. One challenge is that “understanding the Arab world” is a data question. The model has to be trained on Arabic data, which brings with it all the complexities of the region, including competing theories of governance, religion, history, society, but also poetry, history, world views, and more. That is hard to capture when training an AI model. Meanwhile, “understanding the Arab world” also requires structured ontologies that define which entities, data, and sources matter, how these relate to each other, and how regional dynamics actually unfold.

What does ontology mean in the case of Arab AI? Let’s use a couple of case studies:

If you ask: Tell me about PIF

A language model trained on Arabic text knows that “PIF” refers to Saudi Arabia’s Public Investment Fund. But an ontology would encode that PIF is distinct from the Saudi state, that it operates through dozens of subsidiary vehicles, that its investment decisions reflect both commercial logic and strategic priorities set by the Crown Prince, and that its portfolio companies maintain relationships with specific ministries, regional governments, and international partners. The ontology captures not just what PIF is, but how it functions within a web of entities—sovereign wealth funds, ruling family offices, state-owned enterprises, private conglomerates, tribal networks—that collectively constitute Gulf political economy.

If you ask: Tell me about Fairuz

A language model trained on Arabic text is likely to know that Fairuz is a legendary Lebanese singer. But an ontology would enable the AI model to understand that Fairuz emerged from the Rahbani brothers’ musical compositions, which blended Lebanese folk traditions with Western orchestration to create something deliberately pan-Arab in its aspirations. A strong Arab ontology would connect this historical foundation to the region-wide love and devotion her songs inspired. This includes works that became anthems of Lebanese national identity, while becoming background music for nostalgic instagram posts shared thousands of times in English and Arabic. It would trace how her influence threads through contemporary artists like Mashrou’ Leila’s indie rock in Beirut, Cairokee’s post-revolution anthems in Egypt, Omar Souleyman’s Syrian dabke-electronic fusion that found audiences in Berlin, the UK, and Brooklyn. The ontology captures not just who Fairuz is, but how she functions within a web of cultural production—composers, lyricists, national broadcasting elites who decided what millions would hear (think of Egyptian radio’s dominance through the 1970s), and the generational memory that made her voice synonymous with morning rituals across the Arab world.

Data and sources require their own ontological mapping. AI models need to know which information streams are authoritative for which questions.

Official Saudi press releases signal government positions but obscure internal debates.
Emirati business publications capture commercial activity but underreport political dynamics.
Arabic-language social media reflects popular sentiment but amplifies certain voices over others.
Lebanese financial journalism offers regional perspective but carries its own biases.

A structured ontology maps these sources to their domains of reliability, their known blind spots, and their interlinking relationships to power centers, which enables the AI model to weight and triangulate information rather than treating all inputs equivalently.

If you ask: What is an Arab

This is the most fundamental challenge: how does an AI model define what “Arab” actually is and what narrative of “Arab” is the correct one to portray? The Arab world is not a monolith, and any ontology must grapple with competing frameworks for understanding what binds—or divides—the region.

It would reason with fundamental identity questions:

Is “Arab” primarily a linguistic category, encompassing everyone from Moroccan Berbers who speak Darija to Lebanese Christians who code-switch between Arabic, French, and English?
Is Arab a political identity forged through twentieth-century nationalism, with its centers in Cairo, Damascus, and Baghdad?
Is it a cultural sphere defined by shared literary traditions, musical forms, and social customs—or a religious civilization where Islam provides the organizing grammar?

While an Arab AI model trained on Arabic text absorbs all these competing frameworks without distinguishing between them. An ontology forces a choice—or at minimum, makes the competing frameworks explicit so that users understand which version of “Arab” the system is reasoning through.

Building such ontologies for the Arab world remains will be a large-scale undertaking. Even if Models can be trained to speak Arabic fluently, without a strong ontology, they are no better at understanding the Arab world than me on my first trip to Jordan as a 17 year old teenager with no Arabic going to work on an archaeological dig. Thus, it requires both a strong Arabic language fluency to achieve a broader diffusion of AI integration in the Arab world as well as the right way of thinking and reasoning in Arabic about the Arab world.

What comes next?

There is a clear interest and demand from the Arab world for an “Arab AI”. The U.S. is overly focused on hardware deployment and cooperation through infrastructure, but seems less interested in the region’s Arabic LLM pursuits. Meanwhile, China has used the Arabic language processing needs as an entry point for deepening AI partnerships. This, however, does not necessarily provide the Gulf Arab states the type of technology they need to build to sustain compute capacity necessary to scale their AI ecosystems (Arabic LLMs are not the only objective). One commentator writing with The National argues that Arabic AI represents “a huge gap and a major opportunity: whoever builds the best models for Arabic will gain a strategic data advantage in a massive underserved market.” Both great powers recognize this. So do the Gulf states themselves.

Several dynamics are important to watch:

First, the AI entity that achieves superior Arabic language capability will gain significant advantages in regional AI adoption, whether through indigenous development, American adaptation, or Chinese collaboration. This capability seems to be genuinely contested, and no actor has established clear dominance.

Second, language capability and knowledge architecture are distinct challenges requiring different investments. The Gulf states have focused primarily on the former through infrastructure investments and hardware acquisition. But they have also advanced the Arab AI model through several regional initiatives, but there is still much work to be done. AI hardware and chips will create power compute capacity, but the development of a stronger Arab ontology and knowledge architecture will be critical for building an AI model for the Arab world.

Third, the Gulf’s indigenous models represent something genuinely new in the space. I AI systems designed explicitly to encode regional values and cultural context. This is not too far off from what Chinese authorities seek to create, but within the context of reinforcing CCP governance and Chinese values. This raises questions about how designers pick and choose what values to keep or cast out. This is the ethical layer which requires concurrent discussion as Arab AI is being developed. Whether HUMAIN Chat’s “Islamic values” alignment or Jais’s dialect coverage translate into competitive advantage remains to be seen, but the attempt itself signals the region has deeper ambitions to build an AI model for broader diffusion across the region. As the Arab world may be majority Muslim, it is not fully muslim and indigenous models will need to account for that diversity.

Fourth, the Gulf Arab states still prefer American platforms despite their Arabic limitations. OpenAI’s ChatGPT holds roughly 90% market share in Saudi Arabia and the UAE. Naturally, users are defaulting to the most capable general-purpose system even when it handles their language imperfectly. This creates a window for Arabic-native alternatives but also demonstrates how quickly that window could close.

In the end, I think discourse on indigenous Arab AI is an important because when (not if) Arab AI is developed, its owners and operators have the potential to shape how nearly half a billion people experience artificial intelligence—and how AI models come to represent a deeply complex, historic, yet beautiful region.

Kai Fu Lee, author of AI Superpowers and a leading thinker on AI in China, summarized the position well in his oped with Arab News:

“It may take time for countries to figure out their strategy for building a sovereign AI. But it is critical for the Arab world to quickly catalyze the creation of culturally appropriate LLMs and build a rich ecosystem to allow AI-powered Arabic apps to blossom.”

Coffee in the Desert

Discussion about this post

Ready for more?