From the Labs: Processing Finnish Reddit Data
I saw on the AI Nordics Discord group that a Scandi-Reddit dataset had been created. Why the heck wouldn’t we do this also for Finnish?
Written by — Rasmus Toivanen, Data Engineer
For the last 2 years, I have been contributing to Finnish AI open source by publishing models and datasets to the Finnish-NLP Huggingface organization. The work started after Turku-NLP released their version of Bert model named FinBert and my friend Aapo and I recognized that the development of Finnish open-source AI space was lagging behind. In my previous blog post, I shed some light on how we train models but this short story tells you how I processed the whole Reddit data from 2006 to 2022 on my local PC.
I have been part of the AI Nordics Discord group (you can join here) for some time and saw that Dan Saattrup Nielsen from The Alexandra Institute had created a Scandi-Reddit dataset. This spurred my inspiration. I thought why the heck wouldn’t we do this also for Finnish? And so my data processing project started.
I started by looking into how the processing had been done in the Scandi-Reddit project and started to use their code at first. After a while, however, I decided to try and get it done on my own by putting different pieces together. I started with processing the data using Polars by first unzipping the .zst-files and then processing the data. After using this method for some time, I stumbled upon a new method that could process data directly from .zst-files. I ended up using it together with Pandas for the remaining files.
As Reddit popularity has grown quite a lot, the file sizes started to become an issue and I ended up splitting the processing into 10 million messages at a time. During the processing, I took only certain interesting fields from the messages, filtered with a 30-character limit, and did language recognition with fasttext. Lastly, I filtered the final datasets with a 70 percent limit for Finnish.
Once the data was processed, I published it on Hugging Face, a platform for hosting and sharing datasets. I copied the README file from the Scandinavian language repository – giving credit to the original creator of course – and added my own modifications to ensure users were aware of potential toxicity in the data. You can find the dataset which has 4,476,667 unique messages along with some metadata and the processing code from https://huggingface.co/datasets/Finnish-NLP/Reddit_fi_2006_2022.
BUILDING THE FINNISH NLP TOGETHER
The project was challenging but a rewarding experience.
Downloading and processing over 2TB of data might sound like a lot to do but we need more contributions like this from the Finnish AI community. I would like to see this project just as a small demonstration that anyone with enough interest and willingness can contribute to the Finnish NLP community.
If you’re interested in working with natural language data, I highly recommend taking on a project like this! On the other hand, if you represent an organization with a good amount of text data in Finnish, I would highly recommend or even challenge you to evaluate whether you could share it with the open-source community and upload it to Hugging Face. A work like this done quite recently was when people at Datacrunch, a company that provides GPU instances and AI services decided to translate the Alpaca dataset (A instruct NLP dataset from Stanford) to Finnish and push it to Hugging Face as a public dataset. You can find their dataset here.
We need all corporations and brave individuals in Finland to join forces and work together in this so we can keep up with the bigger players.
Did you swoon over the story of this article? You should then consider working with us! We at Recordly are constantly looking for kick-ass Data Engineers and Data Architects to join our happy bunch of data troops! Read more about us and apply 👇