THIS CONTENT IS BROUGHT TO YOU BY University of Oslo - read more

Researchers Vladislav Mikhailov, Andrei Kutuzov, David Samuel, and Erik Velldal are all from the research group for language technology at the University of Oslo's Department of Informatics.

Norwegian answer to ChatGPT is on its way

Several new Norwegian language models have already been launched. These are not yet easy to use for the average person.

“There are many problems associated with the tech giants' language models. They appear as black boxes to the outside world. We need Norwegian alternatives,” says Erik Velldal.

He is a professor at the University of Oslo's Department of Informatics.

Before Christmas, something crucial happened to speed up the Norwegian counterpart to ChatGPT. It is being developed by the Language Technology Group (LTG) at the University of Oslo. 

They were granted computing time on Europe's most powerful computer, LUMI in Finland. It is so sought after that researchers have to apply for time to use it. This allowed them to start training large language models on Norwegian data. Within a couple of weeks, enough data was processed for the researchers to launch three Norwegian language models. 

Training requires computing capacity

“Training of large language models requires a lot of computing capacity from GPUs, graphics processing units. This training scales well, which means that if you double the number of GPUs involved, it will roughly go twice as fast to complete the training. The advantage of LUMI is that there are a lot of GPUs available, over 10,000,” says Hans A. Eide.

He is special adviser at Sigma2. This is a non-profit responsible for the national e-infrastructure for research.

“The National Library of Norway and the University of Oslo have made several Norwegian language models available earlier, but these are the largest we've made so far. They are trained on over 30 billion words,” says Velldal. 

All models have around seven billion parameters. This is something the researchers consider optimal in relation to the amount of Norwegian training data available. 

“A language model becomes poor if it's trained on too little material in relation to its size. It's about finding the right balance,” says Velldal.

Training on the same data in several rounds has proven effective if there is enough processing power. The models the University of Oslo has trained on LUMI have been fed the same training data six times, he explains.

Erik Velldal is a professor at the Department of Informatics.

Norway must have technological independence 

The language technology group believes it is important to have Norwegian counterparts to OpenAI's ChatGPT and Google's LaMDA. 

The Norwegian language makes up just 0.1 per cent of the language data that ChatGPT has been trained on. Furthermore, the specifics of the data used for training have not been fully disclosed. This lack of transparency is concerning for several reasons, according to Velldal.

He explains that Microsoft and OpenAI allow Norwegian users to access the model through a web interface. The model behind is closed. 

"In many contexts, it can also be problematic to send data to a commercial third party. If you work with sensitive health data, for example, it's important to be able to control where and how the data is processed. Then it's essential to have access to open and free models that developers can run on their own machines,” he says.

Several large Norwegian public actors have nevertheless jumped on and bought access to OpenAI's ChatGPT.

“It's important to ensure that open, Norwegian-developed models become available as an alternative. Perhaps especially for the public sector,”  Velldal points out. 

The Oslo School is among the latest to announce that they will use the American service. This despite several unresolved questions about rights and copyrighted material in the models. 

Researchers are also working with the National Library of Norway on a project to compare language models developed on freely available and copyright-protected material.

In the long run, this might be able to provide guidelines for a future compensation scheme for the use of copyrighted material in language models.

Language models highlight stereotypes 

There are several important reasons why we need Norwegian language models, according to Associate Professor Andrey Kutuzov at the University of Oslo. ChatGPT is to a very small extent adapted to the knowledge and value base in Norway, he points out.

“The tech giants' language models are essentially trained on English and American languages. They thus also reflect an American set of values and culture. An example may be that the American language models correspond more to a gender distribution of professions, which is more stereotypical than is the case in Norway,” says Kutuzov. 

In addition, one often sees that English expressions rub off on the Norwegian wording. 

“A Norwegian language model will to a much greater extent reflect society as we know it in Norway,” he says.

Must be trained to solve tasks 

The Norwegian language models have been launched and have already been downloaded by several thousand users. 

The models are initially intended for researchers and developers. 

Kutuzov explains that the Norwegian versions have not been launched in web interfaces that are easy to use for everyone. He admits that they are still far from being able to offer the possibilities that the commercial language models provide. The models are trained to be general base models. 

A language model is trained in several steps. These Norwegian models have received basic training, which means they can predict the next word in a text. 

For the Norwegian models to reach the same level as ChatGPT or similar models, they need more so-called instruction training. This will enable them to solve various tasks to a greater extent. This work is already underway at the University of Oslo. New and updated versions of the language models will be launched continuously. 

Even though the race with the American models seems tough, the researchers point out that Norwegian language models need to be further developed. 

“It's an important principle that we create models that are free from restrictions. We must have such models that are based on openly available resources and that are transparent for the research community and industry. Large language models will increasingly serve as basic infrastructure for solving various tasks in research, industry, administration, and society in general,” says Velldal. 

Three new Norwegian language models

Three new Norwegian language models have been launched, based on the GPT-like architectures BLOOM and Mistral, all with an 'open source' license.

The models have been developed by the research environment at the University of Oslo in collaboration with Sigma2 and the National Library of Norway. Together with other actors in the national AI network NORA, the partners are planning a national infrastructure for the development and use of large Norwegian language models.

Two of the models have been trained from scratch in Norwegian.

The third is based on a model pre-trained for English by the French company Mistral AI, which has then been further trained for Norwegian.

The models are available here.

——— 

The article has been translated by Sigma. 

Read the Norwegian version of this article on forskning.no

Powered by Labrador CMS