Understanding Foundation Models

Training Data

How training data quality, language coverage, and domain coverage shape foundation model capability, cost, and reliability.

Training Data

An AI model is only as good as the data it was trained on. Training data determines what a model can do, where it struggles, and which applications it can support reliably.

Data Sets the Boundary

If there's no Vietnamese in the training data, the model won't be able to translate from English into Vietnamese. Similarly, if an image classification model sees only animals in its training set, it won't perform well on photos of plants.

More Task Data Can Help

If you want a model to improve on a certain task, you might want to include more data for that task in the training data.

Collection Is Expensive

Collecting sufficient data for training a large model isn't easy, and it can be expensive.

Available Data Shapes Models

Model developers often have to rely on available data, even if this data doesn't exactly meet their needs.

Common Crawl and C4

For example, a common source for training data is Common Crawl, created by a nonprofit organization that sporadically crawls websites on the internet. In 2022 and 2023, this organization crawled approximately 2-3 billion web pages each month. Google provides a clean subset of Common Crawl called the Colossal Clean Crawled Corpus, or C4 for short.

The data quality of Common Crawl, and C4 to a certain extent, is questionable -- think clickbait, misinformation, propaganda, conspiracy theories, racism, misogyny, and every sketchy website you've ever seen or avoided on the internet.

A study by the Washington Post shows that the 1,000 most common websites in the dataset include several media outlets that rank low on NewsGuard's scale for trustworthiness. In lay terms, Common Crawl contains plenty of fake news.

Yet, simply because Common Crawl is available, variations of it are used in most foundation models that disclose their training data sources, including OpenAI's GPT-3 and Google's Gemini. I suspect that Common Crawl is also used in models that don't disclose their training data. To avoid scrutiny from both the public and competitors, many companies have stopped disclosing this information.

Some teams use heuristics to filter out low-quality data from the internet. For example, OpenAI used only the Reddit links that received at least three upvotes to train GPT-2. While this does help screen out links that nobody cares about, Reddit isn't exactly the pinnacle of propriety and good taste.

Curating for the Work You Need

The "use what we have, not what we want" approach may lead to models that perform well on tasks present in the training data but not necessarily on the tasks you care about.

To address this issue, it's crucial to curate datasets that align with your specific needs.

This section focuses on curating data for specific languages and domains, providing a broad yet specialized foundation for applications within those areas. Chapter 8 explores data strategies for models tailored to highly specific tasks.

While language- and domain-specific foundation models can be trained from scratch, it's also common to finetune them on top of general-purpose models.

The impact of data quality is discussed more in Chapter 8.

Multilingual Models

English dominates the internet. An analysis of the Common Crawl dataset shows that English accounts for almost half of the data (45.88%), making it eight times more prevalent than the second-most common language, Russian (5.97%) (Lai et al., 2023). See Table 2-1 for a list of languages with at least 1% in Common Crawl. Languages with limited availability as training data -- typically languages not included in this list -- are considered low-resource.

Table 2-1. The most common languages in Common Crawl, a popular dataset for training LLMs. Source: Lai et al. (2023).

LanguageCodePop.CC size
(M)(%)Cat.
Englishen1,45245.8786H
Russianru2585.9692H
Germande1345.8811H
Chinesezh1,1184.8747H
Japanesejp1254.7884H
Frenchfr2744.7254H
Spanishes5484.4690H
Italianit682.5712H
Dutchnl302.0585H
Polishpl451.6636H
Portuguesept2571.1505H
Vietnamesevi851.0299H

Many other languages, despite having a lot of speakers today, are severely underrepresented in Common Crawl. Table 2-2 shows some of these languages. Ideally, the ratio between world population representation and Common Crawl representation should be 1. The higher this ratio, the more under-represented this language is in Common Crawl.

Table 2-2. Examples of under-represented languages in Common Crawl. The last row, English, is for comparison. The numbers for % in Common Crawl are taken from Lai et al. (2023).

LanguageSpeakers (million)% world population1% in Common CrawlWorld: Common Crawl Ratio
Punjabi1131.41%0.0061%231.56
Swahili710.89%0.0077%115.26
Urdu2312.89%0.0274%105.38
Kannada640.80%0.0122%65.57
Telugu951.19%0.0183%64.89
Gujarati620.78%0.0126%61.51
Marathi991.24%0.0213%58.10
Bengali2723.40%0.0930%36.56
English145218.15%45.88%0.40

What Underrepresentation Does

Given the dominance of English in the internet data, it's not surprising that general-purpose models work much better for English than other languages, according to multiple studies.

MMLU

On the MMLU benchmark, a suite of 14,000 multiple-choice problems spanning 57 subjects, GPT-4 performed much better in English than under-represented languages like Telugu, as shown in Figure 2-1 (OpenAI, 2023).

Project Euler

When tested on six math problems on Project Euler, Yennie Jun found that GPT-4 was able to solve problems in English more than three times as often compared to Armenian or Farsi.2

Figure 2-1. On the MMLU benchmark, GPT-4 performs better in English than in any other language. To obtain MMLU in other languages, OpenAI translated the questions using Azure AI Translator.

GPT-4 failed in all six questions for Burmese and Amharic, as shown in Figure 2-2.

Figure 2-2. GPT-4 is much better at math in English than in other languages.

Under-representation is a big reason for this underperformance. The three languages that have the worst performance on GPT-4's MMLU benchmarks -- Telugu, Marathi, and Punjabi -- are also among the languages that are most under-represented in Common Crawl.

However, under-representation isn't the only reason. A language's structure and the culture it embodies can also make a language harder for a model to learn.

Why Translation Is Not Enough

Given that LLMs are generally good at translation, can we just translate all queries from other languages into English, obtain the responses, and translate them back into the original language? Many people indeed follow this approach, but it's not ideal.

Translation Requires Understanding

This requires a model that can sufficiently understand under-represented languages to translate.

Translation Can Lose Information

Some languages, like Vietnamese, have pronouns to denote the relationship between the two speakers. When translating into English, all these pronouns are translated into I and you, causing the loss of the relationship information.

Models can also have unexpected performance challenges in non-English languages. For example, NewsGuard found that ChatGPT is more willing to produce misinformation in Chinese than in English.

In April 2023, NewsGuard asked ChatGPT-3.5 to produce misinformation articles about China in English, simplified Chinese, and traditional Chinese. For English, ChatGPT declined to produce false claims for six out of seven prompts. However, it produced false claims in simplified Chinese and traditional Chinese all seven times. It's unclear what causes this difference in behavior.3

Tokenization, Latency, and Cost

Other than quality issues, models can also be slower and more expensive for non-English languages. A model's inference latency and cost is proportional to the number of tokens in the input and response. It turns out that tokenization can be much more efficient for some languages than others.

Benchmarking GPT-4 on MASSIVE, a dataset of one million short texts translated across 52 languages, Yennie Jun found that, to convey the same meaning, languages like Burmese and Hindi require a lot more tokens than English or Spanish.

English

For the MASSIVE dataset, the median token length in English is 7.

Hindi

The median token length in Hindi is 32.

Burmese

The median token length in Burmese is 72, which is ten times longer than in English.

Assuming that the time it takes to generate a token is the same in all languages, GPT-4 takes approximately ten times longer in Burmese than in English for the same content. For APIs that charge by token usage, Burmese costs ten times more than English.

Language-Specific Models

To address this, many models have been trained to focus on non-English languages. The most active language, other than English, is undoubtedly Chinese, with ChatGLM, YAYI, Llama-Chinese, and others.

There are also models in French (CroissantLLM), Vietnamese (PhoGPT), Arabic (Jais), and many more languages.

Domain-Specific Models

General-purpose models like Gemini, GPTs, and Llamas can perform incredibly well on a wide range of domains, including but not limited to coding, law, science, business, sports, and environmental science. This is largely thanks to the inclusion of these domains in their training data.

Figure 2-3 shows the distribution of domains present in Common Crawl according to the Washington Post's 2023 analysis.4

Figure 2-3. Distribution of domains in the C4 dataset. Reproduced from the statistics from the Washington Post. One caveat of this analysis is that it only shows the categories that are included, not the categories missing.

As of this writing, there haven't been many analyses of domain distribution in vision data. This might be because images are harder to categorize than texts.5 However, you can infer a model's domains from its benchmark performance.

Table 2-3 shows how two models, CLIP and Open CLIP, perform on different benchmarks. These benchmarks show how well these two models do on birds, flowers, cars, and a few more categories, but the world is so much bigger and more complex than these few categories.

Table 2-3. Open CLIP and CLIP's performance on different image datasets.

DatasetCLIP
Accuracy of ViT-B/32 (OpenAI)
Open CLIP
Accuracy of ViT-B/32 (Cade)
ImageNet63.262.9
ImageNet v2-62.6
Birdsnap37.846.0
Country21117.814.8
Oxford 102 Category Flower66.766.0
German Traffic Sign Recognition Benchmark32.242.0
Stanford Cars59.479.3
UCF10164.563.1

When General-Purpose Data Is Not Enough

Even though general-purpose foundation models can answer everyday questions about different domains, they are unlikely to perform well on domain-specific tasks, especially if they never saw these tasks during training.

Drug Discovery

Drug discovery involves protein, DNA, and RNA data, which follow specific formats and are expensive to acquire. This data is unlikely to be found in publicly available internet data.

Cancer Screening

Cancer screening typically involves X-ray and fMRI (functional magnetic resonance imaging) scans, which are hard to obtain due to privacy.

To train a model to perform well on these domain-specific tasks, you might need to curate very specific datasets.

AlphaFold

One of the most famous domain-specific models is perhaps DeepMind's AlphaFold, trained on the sequences and 3D structures of around 100,000 known proteins.

BioNeMo

NVIDIA's BioNeMo focuses on biomolecular data for drug discovery.

Med-PaLM2

Google's Med-PaLM2 combined the power of an LLM with medical data to answer medical queries with higher accuracy.
Domain-specific models are especially common for biomedicine, but other fields can benefit from domain-specific models too. It's possible that a model trained on architectural sketches can help architects much better than Stable Diffusion, or a model trained on factory plans can be optimized for manufacturing processes much better than a generic model like ChatGPT.

This section gave a high-level overview of how training data impacts a model's performance. Next, let's explore the impact of how a model is designed on its performance.

Footnotes

  1. A world population of eight billion was used for this calculation.
  2. "GPT-4 Can Solve Math Problems--but Not in All Languages" by Yennie Jun. You can verify the study using OpenAI's Tokenizer.
  3. It might be because of some biases in pre-training data or alignment data. Perhaps OpenAI just didn't include as much data in the Chinese language or China-centric narratives to train their models.
  4. "Inside the Secret List of Websites That Make AI like ChatGPT Sound Smart", Washington Post, 2023.
  5. For texts, you can use domain keywords as heuristics, but there are no obvious heuristics for images. Most analyses I could find about vision datasets are about image sizes, resolutions, or video lengths.
Copyright © 2026