The Speakleash - dataset of texts in the Polish has over 830 GB!

Since the last update, which we proudly announced on September 6th, the dataset of texts in the Polish language that Spichlerz is working on has significantly expanded. Currently, the text database has reached an impressive size of 833.36 GB, indicating a growth of over 470 GB in just 3.5 months.

The key changes include:

Increase in the text database: The size of the Spichlerz dataset has grown from 370 GB to an impressive 833.36 GB. This substantial increase in the amount of collected data reflects the project’s intensified efforts in the collection and analysis of Polish texts.
Surpassing The Pile dataset size: Speakleash has surpassed the size of the well-known The Pile project’s dataset, confirming the project’s position as one of the largest sources of textual data globally and certainly the largest for the Polish language.
New data from internet forums: Over 100 GB of content, mainly from various internet forums, has been added to our database.
Data from the CulturaX dataset: We have introduced new data from the CulturaX dataset, which has undergone a detailed analysis using Speakleash metrics. Additionally, the data has been precisely categorized, increasing its usability and analytical value.

Collecting 370 GB in such a short time is a testament to the incredible commitment and high pace of work by those supporting the project’s development. We have no intention of slowing down!

If you would like to contribute to achieving our main goal of collecting 1 TB of Polish textual data, we invite you to collaborate!