From the Labs: Training big NLP models with TPU Research Cloud

Training large NLP models may seem like an enormous - and perhaps impossible - task for smaller companies with limited resources. This is, however, a misconception. Thanks to Google’s TPU Research Cloud program (TRC), even a small three-person team can train large MLM models successfully. Read on to learn how this is possible in practice.

Written by — Rasmus Toivanen, Data Engineer

This year, I and a few eager data fellows from Finnish consulting companies shared our frustrations with the current pace of development in the Finnish NLP space. Particularly in the development of the Masked Language Model (MLM), which is key for many downstream tasks (text classification, named entity recognition, etc.). It seems that new open-source models haven’t been released since the launch of FinBERT by TurkuNLP after 2019. What is especially concerning is how these Natural Language models keep growing and training them for non-academic entities or smaller companies seems to be out of reach due to the huge compute needs.

Luckily, these concerns can be resolved with Google’s TPU Research Cloud Program (TRC), which enables researchers to apply for a cluster of more than 1,000 Cloud Tensor Processing Units (TPUs) basically for free! Researchers accepted into the TRC program have access to v2 and v3 devices at no charge and can leverage a variety of frameworks, including TensorFlow, PyTorch, Julia, and JAX, to accelerate the next wave of open research breakthroughs.

Sounds too good to be true? Well, it is true. In the following, I will share my experience on how our small team of three trained large NLP models successfully with the help of TPU Research Cloud.

Have enough data to train the MLM models

The three of us applied to the TRC program with a few objectives in mind:

To see whether we would be able to train these huge MLM models successfully with the TPU instances,
to find out whether data from the Finnish Language Bank and other resources would be sufficient to train the MLM models,
and all while doing our daily work in different companies.

Once we got accepted to the program, we shared roles; I would focus on project management, infrastructure, and data engineering while my fellow data experts, Aapo and Tommi, would focus on training the model.

For training our MLM models, we started with data from the Finnish Language Bank and found few large enough text datasets. In addition to these we found a large MC4 dataset. With these resources, we were confident that we would have enough data to train our models.

The process of training large MLM models

I did the preprocessing on my machine and uploaded data to the Google Cloud Storage bucket. We then iterated slightly with our research setup and landed on a composition in which we had one bucket in one region from which each TPU instance downloaded data to disks that were attached to the TPU instances. In GCP, you can attach storage buckets directly to the instances with GCSFuse, but we decided to dedicate disks for each instance as we found it more convenient.

After this, we started training the model. The popular NLP company Huggingface provides a python package that carries its name for different NLP tasks. Within these packages, we found pretraining scripts for the NLP models we were aiming to train (RoBERTa, ELECTRA, DeBERTa, etc.) using the FLAX framework. Together with TPUs, the FLAX framework accelerates training times immensely compared to more commonly used GPU training with other frameworks.

After a few weeks of iterating with the RoBERTa model with downstream testing, we were able to achieve results that were almost on par with the current State-of-the-art Finnish MLM model from Turku NLP group (unit of Turku University). Although we aimed to beat the Turku NLP group’s model, this was a satisfying result for us.

With more data and better data filtering, we aim to keep improving our model and hopefully get the current Finnish SOTA for MLM into selected downstream tasks - text classification in particular. We just got a 90-day extension to our TRC access, so we hope to accomplish our goal by the end of the year. We are also planning to train Automatic Speech Recognition (ASR) during this 90-day period - but more on that later…

Want to learn more about TRC?

As this article indicates, training huge NLP models is possible for groups with smaller resources thanks to programs like the TRC Program.

If you’re keen to try out the program yourself, go ahead and apply to the Google TRC Program here. Our model is also openly available, so go ahead and check it out!

If you have any questions regarding TRC and our project, or you would like to join our team, please don’t hesitate to contact me rasmus.toivanen@recordlydata.com.

Did you swoon over the story of this article? You should then consider working with us! We at Recordly are constantly looking for kick-ass Data Engineers and Data Architects to join our happy bunch of data troops! Read more about us and apply 👇