Translation of Large Language Models

An Excursion Into How Generative AI Models, such as ChatGPT, Have An English Bias

Written by stephens on 9 Aug 2023

Amazing Times

In the world of artificial intelligence, language models have been making quite a splash. Large language models (LLMs), such as OpenAI's GPT family, have garnered considerable attention in the machine learning and natural language processing communities. However, their use is not limited to English; they can also comprehend and output in other languages, including Korean. This blog post aims to highlight the benefits of using an app for generative AI prompts, such as Translaite.

Translaite first translates non-English prompts into English (via DeepL, prompts OpenAI, and then translates back the output into the input language. This process allows users to engage with the advanced AI models in their language, making it more accessible and user-friendly. But why should one do this?

Understanding Tokenization for Language Models

Tokenization, the process of splitting input text into pieces or tokens, is a crucial step in how LLMs work. The GPT models can process e.g. Japanese text because they use a flexible tokenizer, tiktoken. Tiktoken is a tool that is used to count the number of tokens in a given piece of text. It uses the same tokenization rules as the Hugging Face's Transformers library, which is primarily designed to work with English language text.

The tokenization process involves splitting a piece of text into smaller units, or tokens, which can be individual words, phrases, or even single characters. This process is language-dependent, as different languages have different rules for how words and sentences are formed and structured.

tokenization.png

Since tiktoken uses rules that are tailored for English, it may not accurately tokenize text in other languages. For example, it might not correctly handle special characters, accents, or language-specific punctuation in non-English languages. Tokenization often treats each word or part of a word as a separate token. This works well for English and other languages that use spaces to separate words. However, languages like Japanese or Chinese, which do not use spaces, face challenges. In these languages, a single character can represent a whole word or concept, and these characters often require more bytes to represent in digital form than English words.

For instance, the Japanese character for 'dog' takes three tokens to represent in GPT models, compared to just one token for the English word 'dog'. This means that processing Japanese text requires more tokens than processing the equivalent English text (check this great article out for more detail).

Why does this matter? OpenAI charges for the use of its models per token. Therefore, processing non-English languages like Korean can be significantly more expensive than processing English. This unequal tokenization process, favoring English and disadvantaging other languages, contributes to the higher cost of using AI models for non-English languages.

This also means that, e.g. it takes more tokens to represent Korean text. Translaite bridges this gap by translating the non-English prompts into English, ensuring efficient tokenization.

Biased Training Data in AI

ChatGPT-3, like its predecessors, was trained on a vast amount of data. However, a significant concern is the language distribution in the training material. English overwhelmingly dominates the dataset, accounting for 92.1% of total characters. The second most common language, French, only makes up 1.78%, followed closely by German at 1.68%. Spanish, Italian, and Portuguese also feature, but each represents less than 1% of the total characters. Even Japanese, a widely spoken language, only accounts for 0.16% of the total characters. This disproportionate representation of English in the training data inevitably biases the performance of ChatGPT towards English, potentially affecting its performance in non-English tasks. This bias underscores the need for more balanced and diverse training data to ensure the equitable performance of AI models across different languages. The language bias in AI models like ChatGPT-3 can be problematic for several reasons

black_and_white.png

Performance Disparity The model will perform better in English tasks than in other languages. This means that users who do not speak English as their first language will have a less effective and potentially frustrating experience.

Cultural Bias Language is closely tied to culture. By primarily training on English-language text, the model may unintentionally perpetuate biases inherent in English-language material and fail to understand or respect cultural nuances present in other languages.

Accessibility and Inclusivity AI has the potential to be a universal tool, accessible and useful to people regardless of their language or location. However, a bias towards English limits this potential and excludes a significant portion of the global population.

Misinterpretation and Miscommunication For languages that are underrepresented in the training data, the model may misunderstand or misinterpret text inputs, leading to incorrect or inappropriate responses. This can also lead to miscommunication in critical situations.

Ethical Considerations From an ethical standpoint, it's important that AI systems are fair and equitable. A system that is biased towards one language over others raises questions about fairness and representation.

Limitation in Global Adoption For AI to be truly globally adopted and effective, it needs to understand and generate all languages accurately. The current bias might limit its adoption in non-English speaking regions or applications.

Therefore, it's crucial to work towards more balanced representation in training data, not just in terms of language, but also in terms of the cultural, social, and demographic aspects that language carries with it.

Enhancing Performance

Despite the tokenization bias and training imbalances, GPT models perform well in Korean. They can understand your instructions, answer back in Korean fluently and naturally, and reject inappropriate requests. However, they are slower in Korean due to the suboptimal tokenization. Translaite mitigates this by translating the prompts into English, thereby enhancing performance.

In conclusion, the use of language models in artificial intelligence has revolutionized the field of machine learning and natural language processing. However, their application in non-English languages has faced challenges due to tokenization biases and training data imbalances. Tokenization, the process of splitting text into smaller units, can be problematic for languages like Korean that have different linguistic structures. This unequal tokenization process leads to higher costs and slower performance for non-English languages compared to English. Additionally, the biased training data, with English dominating the dataset, affects the performance of AI models in non-English tasks and perpetuates cultural biases.

all_inclusive.png

To address these issues, the Translaite provides a solution by translating non-English prompts into English, allowing users to effectively engage with advanced AI models in their language. This approach enhances performance and mitigates tokenization biases, making AI more accessible, inclusive, and equitable for users of all languages. It also highlights the importance of balanced representation in training data, not only in terms of language but also in terms of cultural and demographic aspects. By working towards a more diverse and representative training data, we can ensure the fair and effective adoption of AI models globally, benefiting users in Korean and beyond.

Curious about how Translaite works? Go ahead and try it out