AI dataset

New podcast’s episode!

Welcome to the premiere of the latest episode of the NIELINIOWY podcast on Youtube in which Michal Dulemba talks to the technical team of our project:

Maria Filipkowska, PhD, Jacek Chwiła, Adrian Gwoździej, Grzegorz Urbanowicz and Paweł Kiszczak.

Conversation on, among other topics:

  • how much data has already been collected
  • variety of text data
  • artificially generated text data
  • kilometers of data in state archives
  • the motivation of the team members
  • the benefits of joining the Granary
  • the variety of technical skills useful in the Granary
  • the impact of Spichlerz on recruitment
  • pride of working on a typically Polish project
AI dataset

We are partners!

We are very proud to announce that we have become official partners of a unique hackathon – Hack to the Rescue!

Hack to the Rescue is the world’s largest Generative AI event. Its goal this year is to search for the most effective solutions to help nonprofit organizations deal with the most pressing challenges of the modern world. It is an online event that will take place on June 14-15.

We would like to point out that among the mentors of this extraordinary hackathon are Maria Filipkowska and Adrian Gwoździej, who work with us on a daily basis on the Speakleash project! We are extremely pleased that they are with us and with their attitude they motivate us to continuous development.

We invite you to read the details of the event at the link:

AI dataset

Another webinar

We have barely enjoyed the Python Data Summit webinar, and there is already another presentation waiting for us!
You are warmly invited to the conference, which will be held on June 15-16. Among the speakers, in addition to the standard duo – Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. – will be Maria Filipowska and Grzegorz Urbanowicz.
The presentation will discuss the achievements of the SpeakLeash project to date, as well as compare them with other initiatives. There will certainly be other interesting topics as well.
For all those willing to attend the conference, we have a code for a -20% discount.
See you there – you can’t miss it!

AI dataset

We are 300!

The temperature outside is rising, but this has nothing to do with our data collection rate. The end of May is under our belt, and we reach “3” in front, 302GB to be exact!!! It is worth mentioning that only 2 months ago i.e. at the end of March we had only 120GB. This gives an optimistic outlook for further updates which will appear as soon as possible.

The last nearly 50GB include women’s, sports or health forums and public information.

Please visit our dashboard, where you will learn much more about the data we collected. 

Have a great week!

AI dataset

We do webinar

Data is not the only thing a person lives by. As the interest in our project is greater than we could have expected, we are reaching out to you with a hand.

Tomorrow our representatives in the person of Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. will talk about how to effectively obtain large text data sets in Python on the example of the project.

The conference will take place tomorrow, i.e. May 18, at 1:00 p.m. You are cordially invited. You must not miss it.

Link to the event: 👇

Pytech Summit 2024 (online) | Największa konferencja o Python

AI dataset


Another week, another update!
This time we exceed the magic number that marks the achievement of a quarter of our goal. 255.1GB or 255,100MB(which sounds even more impressive) is the exact volume of Polish text data we have managed to collect so far. The data collected, like last time, was for the forums and education categories.

Knowing our researchers, next week we will be even closer to our goal, as the pace of data collection is growing exponentially. This is also thanks to the people who have joined the project in recent weeks inspired by the idea to help our work. 
If you are interested in being a part of something big, don’t hesitate to write to us!!!
Contact link in the comments. 👇


AI dataset

We kept out word!


We come with positive news! As promised, we managed to exceed more than 200GB before May. Moreover, currently the counter has stopped at over 217GB, although we are not sure if it has already changed while we are writing this post 🙂

The main source of data acquisition relate to lifestyle and beauty forums.

We can’t describe the enormity of the work and dedication of our experts. Thank you !

As a result, based on the last update of the OSCAR project from 23.01( we surpass the mentioned project in terms of the size of the dataset, by as much as 40%.

We will keep you updated on further changes. We are looking forward to the next news.

 Have a productive afternoon 🙂

AI dataset

We are leading

As promised, more data from the blogs and education category is now in our granary! To get an idea of the task we’re facing, the data from this category alone is 2.9million files and that’s just a fraction of what we’ve collected. Another added set of data relates to job listings. As a result, at the moment our project has the largest number of Polish data!

AI dataset

Happy Easter!

We wish you much peace and joy in the coming days!

In the meanwhile, we can report the import of more data. As promised, another from the blogs and education category which, together with the previous texts, gives us more than 145 GB of text data. You can see more details on our dashboard: Speakleash Dashboard – Streamlit

Happy Easter!

AI dataset


Another 3 datasets are already in our granary! The datasets come from media in general as well as from sites related to weblogs. Currently our dataset count has stopped at 141GB, and you can be sure that there will be another increase from these areas, like media and blogs, in the near future.
Below you can see the distribution of each category on a pie chart.