In the world of artificial intelligence, language models have been making quite a splash. Large language models (LLMs), such as OpenAI GPT, have garnered considerable attention in the machine learning and natural language processing communities. However, their use is not limited to English; they can also comprehend and output in other languages. However, this comes with some crucial limitations.
Translaite is made with non-English users in mind. It first translates non-English prompts into English, prompts OpenAI, and then translates back the output into the input language. This process allows users to engage with the advanced AI models in their language, making it more accessible and user-friendly.
ChatGPT-3, and likely its successor, was trained on a vast amount of data. However, a significant concern is the language distribution in the training material. English overwhelmingly dominates the dataset, accounting for 92.1% of total characters (see Training data). The second most common language, French, only makes up 1.78%, followed closely by German at 1.68%. Spanish, Italian, and Portuguese also feature, but each represents less than 1% of the total characters. Even Japanese, a widely spoken language, only accounts for 0.16% of the total characters. This disproportionate representation of English in the training data inevitably biases the performance of ChatGPT-3 towards English, potentially affecting its performance in non-English tasks.
Tokenization, the process of splitting input text into pieces or tokens, is a crucial step in how LLMs work. The GPT models can process non-English text because they use a flexible tokenizer, tiktoken. However, the tokenization process is biased towards English, because it takes more tokens to represent non-English text.
Tokenization often treats each word or part of a word as a separate token. This works well for English and other languages that use spaces to separate words. However, languages like Japanese or Chinese, which do not use spaces, face challenges. In these languages, a single character can represent a whole word or concept, and these characters often require more bytes to represent in digital form than English words, making the use slower and more expensive.
Despite the training imbalances and tokenization bias, GPT models perform well in e.g. Japanese. They can understand Japanese instructions, answer back in Japanese fluently and naturally, and reject inappropriate requests. However, they are slower in Japanese due to the suboptimal tokenization and a lack of knowledge due to less training material. Translaite mitigates this by translating non-English prompts into English, and the response back into the input language, thereby enhancing performance.
Translaite offers significant benefits, especially for non-English users. It ensures efficient tokenization, overcomes language bias, and enhances performance. Moreover, it makes advanced AI models more accessible and user-friendly, fostering a more inclusive AI environment.