3 questions for... our Data Scientist Jan

1. How are language models defined in the context of natural language processing and what are the differences between large and small language models?

Large neural language models have been on everyone's lips at least since the release of Chat-GPT. In fact, since the pioneering days of computer science, linguists have been concerned with the question of whether one can view language as a probability distribution over the set of all words and learn this distribution from large collections of texts. The historical roots of language models can be traced back to Claude Shannon's article "A Mathematical Theory of Communications", which founded the field of information theory in 1948. The simplest language model, the n-gram model, expresses the probability of a word as a conditional probability given the previous n-1 words. Such a model can already learn that "the sun" is more likely to be followed by "shine" than "blow". However, such a model is overwhelmed with the generation of longer, coherent texts. Modern language models replace the primitive statistical model of conditional probability with a large neural network, often with billions of parameters. However, the task remains the same: predict the next word in a sequence of words. Large language models are successfully applied to problems such as sentiment analysis, automatic summarization, and machine translation of texts. The difference between "small" and "big" language models is not clearly defined. One possibility would be to draw the line on GPT-3 sized models: First, these models are so large that they cannot currently be run on a single workstation, requiring the processing power of a small to medium sized data center. Second, GPT-3 has just demonstrated that, above a certain size, language models exhibit emergent behavior: they can correctly solve problems "on demand" without having been specifically trained to do so beforehand.

2. What are the limitations of using large language models and how can they be circumvented?

A short conversation with Chat-GPT is enough to recognize the limits of such a model: for a language model it is only a small step from fact to fiction – the main thing is that the text produced looks plausible. If the solution of a calculation or logic task requires several steps, the model often produces obviously wrong solutions. In addition to facts, the model also learns common stereotypes and prejudices and may pass them on unfiltered. A review of the generated output is critical to ensure that it is content-accurate and ethical. At second glance, further problems come to light: in order to limit the required computing power, the "attention span" of a language model is also limited, currently to the amount of text of about one printed page. With each new model, the length of the text that can be processed to the maximum is growing, but Tolstoy's “War and Peace” cannot digest even GPT-4 as a whole. And did you want to pay for that? A price of 0.02 US dollars for 1000 tokens (GPT-3) sounds cheap at first. But if you take into account that a typical German word consists of three to four tokens, you pay OpenAI roughly 20 dollars to have Tolstoy's classic read by GPT-3. After all, "War and Peace" is now free of copyright and, as a literary work, does not contain any content relevant to data protection law. With corporate data, the situation is different. Often less is therefore more: a small model, operated "on premises", can be superior to the larger cloud solution for the practical considerations mentioned. However, without the emergent problem-solving capabilities of a large language model, it is necessary to train the model for the problem to be solved. In comparison to classic statistical models, a smaller language model also manages with surprisingly few examples - but the success of the training depends to a large extent on the experience of the responsible data scientist.

3. How to improve the accuracy and relevance of a large language model for specific domains and applications?

At the time of the first large language models, retraining with problem-specific examples (“fine tuning”) was the only way to specifically improve the performance of a language model. In recent years, numerous new, promising approaches have been presented to make language models more powerful and flexible: The adapter framework allows the learned skills to be separated from the underlying language model and, within certain limits, to be made interchangeable and combinable. Models such as the "Toolformer" and the conceptually related plugin system from OpenAI open up access to external sources of information. Dialogue models such as Chat-GPT have shown that a combination of training on question-answer pairs in conjunction with targeted human feedback can induce a language model to preferentially generate helpful and useful responses to queries.