Blog - SpeakLeash | Spichlerz

Bielik AI Unveils Two New Models with Polish-Optimized Tokenizer at KUKDM 2026

Post author By Waldemar Boszko
Post date 15 April 2026

Zakopane, Poland — April 14, 2026 — At the 18th High-Performance Computing Users’ Conference (KUKDM 2026) in Zakopane, the Bielik AI team presented two new language models: Bielik-PL-11B-v3.0-Instruct and Bielik-PL-Minitron-7B-v3.0-Instruct. These are the first variants in the Bielik family to feature a dedicated tokenizer optimized specifically for the Polish language.

Presentation at KUKDM 2026

The premiere took place on April 14 at 10:35 AM during the presentation “Architecture Optimization in Bielik Models” (original title: Optymalizacja architektury w modelach Bielik), delivered by Krzysztof Ociepa and Krzysztof Wróbel from the Bielik AI team. The KUKDM conference, organized by the Academic Computer Centre Cyfronet AGH, runs from April 13–15, 2026 at Hotel Bachleda Kasprowy in Zakopane, bringing together Poland’s scientific and technology community focused on high-performance computing.

What’s New in the Bielik PL Models?

The key innovation in the new models is the transition from a universal Mistral-based tokenizer to a dedicated Polish-optimized vocabulary. Previous Bielik models relied on a tokenizer designed to cover a broad spectrum of languages, which resulted in higher fertility ratios for Polish text, increased inference costs, and restricted effective context windows.

The new models employ FOCUS-based embedding initialization, a multi-stage pretraining curriculum, and an advanced post-training pipeline comprising:

Supervised Fine-Tuning (SFT)
Direct Preference Optimization (DPO)
Group Relative Policy Optimization (GRPO) with verifiable rewards

Benchmark Results

According to the accompanying research paper, Bielik-11B-v3.0-Instruct achieves a 5-shot average of 65.93 on the Open LLM Leaderboard, ranking among the top models and outperforming several significantly larger solutions, including Meta-Llama-3.1-70B-Instruct.

The Polish tokenizer variants achieve scores of 64.11 (Bielik-PL-11B) and 61.66 (Bielik-PL-Minitron-7B), confirming that tokenizer optimization does not come at the expense of model quality. In domain-specific evaluations for the Polish language, including the Polish Medical Benchmark, the 11B model demonstrated a significant improvement over the base version.

Compression Without Compromise

The Bielik-PL-Minitron-7B model was created by compressing the 11B variant using structured pruning and knowledge distillation techniques developed in collaboration with NVIDIA engineers. This approach achieved a 33% reduction in model size and up to 50% faster inference, while retaining approximately 90% of the full model’s quality. The compression methodology was first introduced at NVIDIA GTC in March 2026.

Research Publication

The technical details of the new models are described in the paper “Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series“, published on April 12, 2026 on the arXiv platform (ID: 2604.10799). The paper is authored by Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel and Adrian Gwoździej.

Links:

Project website: bielik.ai
KUKDM 2026 Conference: cyfronet.pl/kukdm-2026
Conference program: events.plgrid.pl
Research paper: arXiv:2604.10799

AI dataset

SUMMARY OF THE YEAR 2023

Post author By Lila
Post date 5 January 2024

Today we bring you a summary of a year of activity from the Speakleash project. 🚀

When we set off with the project our goal seemed very far away, some said it was even impossible. 🎯 During this year we have not only formalised our organisation as a foundation, but most importantly we have managed to collect as much as 837.45 GB of data, meaning that we are already very close to the target of 1TB! 💪 It is worth mentioning that we are at the moment creating the world’s largest (or one of the largest) single-language text datasets developed in an open-science model. Additionally, we focus not only on the quantity of data, but also on its quality. All the data we collect is categorised in detail and assessed for quality, you can follow the progress of our work in real time on our dashboard. 🌐

The Speakleash’s work is not only about collecting data, but also about sharing knowledge and inspiring. 💡 This year we have had the pleasure of attending a number of conferences including the Data Science Summit and Data Science Summit Machine Learning Edition, two editions of the Pytech Summit, the CLARIN-PL anniversary conference “Ten years of CLARIN open science infrastructure in Poland” and the Deviniti JIRA DAY Night Talk. 📣 We were partners of events such as ML in PL and Hack To The Rescue. We were a guest on Michal Dulemba‘s Nieliniowy podcast, it is worth noting that an episode featuring us was named the most popular podcast episode of the past year by Crossweb.pl. 🎙

This year has been very important for the development of AI in Poland and the beginning of the work on PLLuM (Polish Large Language Universal Model).
Key units involved in AI in Poland are working on this initiative. We are very curious to see the results!

None of this would be possible if it were not for the hard work of our team, a mix of competences and characters. More than 160 people have already joined the Speakleash Discord (https://discord.gg/NN99d3Uv) and more are constantly joining.👥 Sebastian Kondracki, Maria Filipkowska, PhD, Krzysztof (Chris) Ociepa, Adrian Gwoździej, Paweł Kiszczak, Grzegorz Urbanowicz, Szymon Baczyński, Igor Ciuciura, Pawel Cyrta, Izabela Babis, Waldemar Boszko, Andrzej Cyboroń, Jacek Chwiła are just a small part of our team, it’s impossible to mention them all here.

Thank you all involved for your work and we can’t wait to see what 2024 will bring!🎉

AI dataset

The Speakleash dataset of texts in the Polish language has expanded by over 470 GB in 3.5 months!

Post author By Waldemar Boszko
Post date 17 December 2023

Speakleash - the largest dataset of texts in the Polish language.

Since the last update, which we proudly announced on September 6th, the dataset of texts in the Polish language that Spichlerz is working on has significantly expanded. Currently, the text database has reached an impressive size of 833.36 GB, indicating a growth of over 470 GB in just 3.5 months.

The key changes include:

Increase in the text database: The size of the Spichlerz dataset has grown from 370 GB to an impressive 833.36 GB. This substantial increase in the amount of collected data reflects the project’s intensified efforts in the collection and analysis of Polish texts.
Surpassing The Pile dataset size: Speakleash has surpassed the size of the well-known The Pile project’s dataset, confirming the project’s position as one of the largest sources of textual data globally and certainly the largest for the Polish language.
New data from internet forums: Over 100 GB of content, mainly from various internet forums, has been added to our database.
Data from the CulturaX dataset: We have introduced new data from the CulturaX dataset, which has undergone a detailed analysis using Speakleash metrics. Additionally, the data has been precisely categorized, increasing its usability and analytical value.

Collecting 370 GB in such a short time is a testament to the incredible commitment and high pace of work by those supporting the project’s development. We have no intention of slowing down!

If you would like to contribute to achieving our main goal of collecting 1 TB of Polish textual data, we invite you to collaborate!

Tags the biggest Polish language dataset

AI dataset

Pytech Summit 2023

Post author By Lila
Post date 4 December 2023

🚀 This Thursday (7 December) at Pytech Summit (https://pytechsummit.pl/) you will have the opportunity to listen to the Speakleash team representation, consisting of Szymon Baczyński and Igor Ciuciura. 🎙️

The talk will be about creating the Speakleash package as a data management tool.

The conference takes place online. Book your ticket today! 🎟️

AI dataset

New podcast’s episode!

Post author By Maciej Ogrodnik
Post date 20 June 2023

Welcome to the premiere of the latest episode of the NIELINIOWY podcast on Youtube in which Michal Dulemba talks to the technical team of our project:

Maria Filipkowska, PhD, Jacek Chwiła, Adrian Gwoździej, Grzegorz Urbanowicz and Paweł Kiszczak.
https://lnkd.in/dESakTbb

Conversation on, among other topics:

how much data has already been collected
variety of text data
artificially generated text data
kilometers of data in state archives
the motivation of the team members
the benefits of joining the Granary
the variety of technical skills useful in the Granary
the impact of Spichlerz on recruitment
pride of working on a typically Polish project

AI dataset

We are partners!

Post author By Maciej Ogrodnik
Post date 14 June 2023

We are very proud to announce that we have become official partners of a unique hackathon – Hack to the Rescue!

Hack to the Rescue is the world’s largest Generative AI event. Its goal this year is to search for the most effective solutions to help nonprofit organizations deal with the most pressing challenges of the modern world. It is an online event that will take place on June 14-15.

We would like to point out that among the mentors of this extraordinary hackathon are Maria Filipkowska and Adrian Gwoździej, who work with us on a daily basis on the Speakleash project! We are extremely pleased that they are with us and with their attitude they motivate us to continuous development.

We invite you to read the details of the event at the link:
https://hacktotherescue.org/

AI dataset

Another webinar

Post author By Maciej Ogrodnik
Post date 6 June 2023

We have barely enjoyed the Python Data Summit webinar, and there is already another presentation waiting for us!
You are warmly invited to the conference, which will be held on June 15-16. Among the speakers, in addition to the standard duo – Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. – will be Maria Filipowska and Grzegorz Urbanowicz.
The presentation will discuss the achievements of the SpeakLeash project to date, as well as compare them with other initiatives. There will certainly be other interesting topics as well.
For all those willing to attend the conference, we have a code for a -20% discount.
See you there – you can’t miss it!

https://ml.dssconf.pl/

AI dataset

We are 300!

Post author By Maciej Ogrodnik
Post date 29 May 2023

The temperature outside is rising, but this has nothing to do with our data collection rate. The end of May is under our belt, and we reach “3” in front, 302GB to be exact!!! It is worth mentioning that only 2 months ago i.e. at the end of March we had only 120GB. This gives an optimistic outlook for further updates which will appear as soon as possible.

The last nearly 50GB include women’s, sports or health forums and public information.

Please visit our dashboard, where you will learn much more about the data we collected.

Have a great week!

AI dataset

We do webinar

Post author By Maciej Ogrodnik
Post date 17 May 2023

Data is not the only thing a person lives by. As the interest in our project is greater than we could have expected, we are reaching out to you with a hand.

Tomorrow our representatives in the person of Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. will talk about how to effectively obtain large text data sets in Python on the example of the SpeakLeash.org project.

The conference will take place tomorrow, i.e. May 18, at 1:00 p.m. You are cordially invited. You must not miss it.

Link to the event: 👇

Pytech Summit 2024 (online) | Największa konferencja o Python

AI dataset

Quatergoal

Post author By Maciej Ogrodnik
Post date 17 May 2023

Another week, another update!
This time we exceed the magic number that marks the achievement of a quarter of our goal. 255.1GB or 255,100MB(which sounds even more impressive) is the exact volume of Polish text data we have managed to collect so far. The data collected, like last time, was for the forums and education categories.

Knowing our researchers, next week we will be even closer to our goal, as the pace of data collection is growing exponentially. This is also thanks to the people who have joined the project in recent weeks inspired by the idea to help our work.
If you are interested in being a part of something big, don’t hesitate to write to us!!!
Contact link in the comments. 👇

#data #project #education #people