THIS CONTENT IS BROUGHT TO YOU BY University of Oslo - read more
Norwegian answer to ChatGPT is on its way
Several new Norwegian language models have already been launched. These are not yet easy to use for the average person.
“There are many problems associated with the tech giants' language models. They appear as black boxes to the outside world. We need Norwegian alternatives,” says Erik Velldal.
He is a professor at the University of Oslo's Department of Informatics.
Before Christmas, something crucial happened to speed up the Norwegian counterpart to ChatGPT. It is being developed by the Language Technology Group (LTG) at the University of Oslo.
They were granted computing time on Europe's most powerful computer, LUMI in Finland. It is so sought after that researchers have to apply for time to use it. This allowed them to start training large language models on Norwegian data. Within a couple of weeks, enough data was processed for the researchers to launch three Norwegian language models.
Training requires computing capacity
“Training of large language models requires a lot of computing capacity from GPUs, graphics processing units. This training scales well, which means that if you double the number of GPUs involved, it will roughly go twice as fast to complete the training. The advantage of LUMI is that there are a lot of GPUs available, over 10,000,” says Hans A. Eide.
He is special adviser at Sigma2. This is a non-profit responsible for the national e-infrastructure for research.
“The National Library of Norway and the University of Oslo have made several Norwegian language models available earlier, but these are the largest we've made so far. They are trained on over 30 billion words,” says Velldal.
All models have around seven billion parameters. This is something the researchers consider optimal in relation to the amount of Norwegian training data available.
“A language model becomes poor if it's trained on too little material in relation to its size. It's about finding the right balance,” says Velldal.
Training on the same data in several rounds has proven effective if there is enough processing power. The models the University of Oslo has trained on LUMI have been fed the same training data six times, he explains.
Norway must have technological independence
The language technology group believes it is important to have Norwegian counterparts to OpenAI's ChatGPT and Google's LaMDA.
The Norwegian language makes up just 0.1 per cent of the language data that ChatGPT has been trained on. Furthermore, the specifics of the data used for training have not been fully disclosed. This lack of transparency is concerning for several reasons, according to Velldal.
He explains that Microsoft and OpenAI allow Norwegian users to access the model through a web interface. The model behind is closed.
"In many contexts, it can also be problematic to send data to a commercial third party. If you work with sensitive health data, for example, it's important to be able to control where and how the data is processed. Then it's essential to have access to open and free models that developers can run on their own machines,” he says.
Several large Norwegian public actors have nevertheless jumped on and bought access to OpenAI's ChatGPT.
“It's important to ensure that open, Norwegian-developed models become available as an alternative. Perhaps especially for the public sector,” Velldal points out.
The Oslo School is among the latest to announce that they will use the American service. This despite several unresolved questions about rights and copyrighted material in the models.
Researchers are also working with the National Library of Norway on a project to compare language models developed on freely available and copyright-protected material.
In the long run, this might be able to provide guidelines for a future compensation scheme for the use of copyrighted material in language models.
Language models highlight stereotypes
There are several important reasons why we need Norwegian language models, according to Associate Professor Andrey Kutuzov at the University of Oslo. ChatGPT is to a very small extent adapted to the knowledge and value base in Norway, he points out.
“The tech giants' language models are essentially trained on English and American languages. They thus also reflect an American set of values and culture. An example may be that the American language models correspond more to a gender distribution of professions, which is more stereotypical than is the case in Norway,” says Kutuzov.
In addition, one often sees that English expressions rub off on the Norwegian wording.
“A Norwegian language model will to a much greater extent reflect society as we know it in Norway,” he says.
Must be trained to solve tasks
The Norwegian language models have been launched and have already been downloaded by several thousand users.
The models are initially intended for researchers and developers.
Kutuzov explains that the Norwegian versions have not been launched in web interfaces that are easy to use for everyone. He admits that they are still far from being able to offer the possibilities that the commercial language models provide. The models are trained to be general base models.
A language model is trained in several steps. These Norwegian models have received basic training, which means they can predict the next word in a text.
For the Norwegian models to reach the same level as ChatGPT or similar models, they need more so-called instruction training. This will enable them to solve various tasks to a greater extent. This work is already underway at the University of Oslo. New and updated versions of the language models will be launched continuously.
Even though the race with the American models seems tough, the researchers point out that Norwegian language models need to be further developed.
“It's an important principle that we create models that are free from restrictions. We must have such models that are based on openly available resources and that are transparent for the research community and industry. Large language models will increasingly serve as basic infrastructure for solving various tasks in research, industry, administration, and society in general,” says Velldal.
———
The article has been translated by Sigma.
Read the Norwegian version of this article on forskning.no
This content is paid for and presented by the University of Oslo
This content is created by the University of Oslo's communication staff, who use this platform to communicate science and share results from research with the public. The University of Oslo is one of more than 80 owners of ScienceNorway.no. Read more here.
More content from the University of Oslo:
-
Call for action to tackle global antibiotic shortages
-
Researcher: There is an increased risk of nuclear weapons use
-
This newly developed robot can play the drums, listen, and learn
-
Genetically, this is a super fungus
-
The proportion of women in power worldwide in 2019 was the same as in France in the 1300s
-
A cloud of dust prevents us from seeing the universe clearly – researchers are now going to clear it up