Category: AI dataset

SUMMARY OF THE YEAR 2023

Post author By Lila
Post date 5 January 2024

Today we bring you a summary of a year of activity from the Speakleash project. 🚀

When we set off with the project our goal seemed very far away, some said it was even impossible. 🎯 During this year we have not only formalised our organisation as a foundation, but most importantly we have managed to collect as much as 837.45 GB of data, meaning that we are already very close to the target of 1TB! 💪 It is worth mentioning that we are at the moment creating the world’s largest (or one of the largest) single-language text datasets developed in an open-science model. Additionally, we focus not only on the quantity of data, but also on its quality. All the data we collect is categorised in detail and assessed for quality, you can follow the progress of our work in real time on our dashboard. 🌐

The Speakleash’s work is not only about collecting data, but also about sharing knowledge and inspiring. 💡 This year we have had the pleasure of attending a number of conferences including the Data Science Summit and Data Science Summit Machine Learning Edition, two editions of the Pytech Summit, the CLARIN-PL anniversary conference “Ten years of CLARIN open science infrastructure in Poland” and the Deviniti JIRA DAY Night Talk. 📣 We were partners of events such as ML in PL and Hack To The Rescue. We were a guest on Michal Dulemba‘s Nieliniowy podcast, it is worth noting that an episode featuring us was named the most popular podcast episode of the past year by Crossweb.pl. 🎙

This year has been very important for the development of AI in Poland and the beginning of the work on PLLuM (Polish Large Language Universal Model).
Key units involved in AI in Poland are working on this initiative. We are very curious to see the results!

None of this would be possible if it were not for the hard work of our team, a mix of competences and characters. More than 160 people have already joined the Speakleash Discord (https://discord.gg/NN99d3Uv) and more are constantly joining.👥 Sebastian Kondracki, Maria Filipkowska, PhD, Krzysztof (Chris) Ociepa, Adrian Gwoździej, Paweł Kiszczak, Grzegorz Urbanowicz, Szymon Baczyński, Igor Ciuciura, Pawel Cyrta, Izabela Babis, Waldemar Boszko, Andrzej Cyboroń, Jacek Chwiła are just a small part of our team, it’s impossible to mention them all here.

Thank you all involved for your work and we can’t wait to see what 2024 will bring!🎉

AI dataset

The Speakleash dataset of texts in the Polish language has expanded by over 470 GB in 3.5 months!

Post author By Waldemar Boszko
Post date 17 December 2023

Speakleash - the largest dataset of texts in the Polish language.

Since the last update, which we proudly announced on September 6th, the dataset of texts in the Polish language that Spichlerz is working on has significantly expanded. Currently, the text database has reached an impressive size of 833.36 GB, indicating a growth of over 470 GB in just 3.5 months.

The key changes include:

Increase in the text database: The size of the Spichlerz dataset has grown from 370 GB to an impressive 833.36 GB. This substantial increase in the amount of collected data reflects the project’s intensified efforts in the collection and analysis of Polish texts.
Surpassing The Pile dataset size: Speakleash has surpassed the size of the well-known The Pile project’s dataset, confirming the project’s position as one of the largest sources of textual data globally and certainly the largest for the Polish language.
New data from internet forums: Over 100 GB of content, mainly from various internet forums, has been added to our database.
Data from the CulturaX dataset: We have introduced new data from the CulturaX dataset, which has undergone a detailed analysis using Speakleash metrics. Additionally, the data has been precisely categorized, increasing its usability and analytical value.

Collecting 370 GB in such a short time is a testament to the incredible commitment and high pace of work by those supporting the project’s development. We have no intention of slowing down!

If you would like to contribute to achieving our main goal of collecting 1 TB of Polish textual data, we invite you to collaborate!

Tags the biggest Polish language dataset

AI dataset

Pytech Summit 2023

Post author By Lila
Post date 4 December 2023

🚀 This Thursday (7 December) at Pytech Summit (https://pytechsummit.pl/) you will have the opportunity to listen to the Speakleash team representation, consisting of Szymon Baczyński and Igor Ciuciura. 🎙️

The talk will be about creating the Speakleash package as a data management tool.

The conference takes place online. Book your ticket today! 🎟️

AI dataset

New podcast’s episode!

Post author By Maciej Ogrodnik
Post date 20 June 2023

Welcome to the premiere of the latest episode of the NIELINIOWY podcast on Youtube in which Michal Dulemba talks to the technical team of our project:

Maria Filipkowska, PhD, Jacek Chwiła, Adrian Gwoździej, Grzegorz Urbanowicz and Paweł Kiszczak.
https://lnkd.in/dESakTbb

Conversation on, among other topics:

how much data has already been collected
variety of text data
artificially generated text data
kilometers of data in state archives
the motivation of the team members
the benefits of joining the Granary
the variety of technical skills useful in the Granary
the impact of Spichlerz on recruitment
pride of working on a typically Polish project

AI dataset

We are partners!

Post author By Maciej Ogrodnik
Post date 14 June 2023

We are very proud to announce that we have become official partners of a unique hackathon – Hack to the Rescue!

Hack to the Rescue is the world’s largest Generative AI event. Its goal this year is to search for the most effective solutions to help nonprofit organizations deal with the most pressing challenges of the modern world. It is an online event that will take place on June 14-15.

We would like to point out that among the mentors of this extraordinary hackathon are Maria Filipkowska and Adrian Gwoździej, who work with us on a daily basis on the Speakleash project! We are extremely pleased that they are with us and with their attitude they motivate us to continuous development.

We invite you to read the details of the event at the link:
https://hacktotherescue.org/

AI dataset

Another webinar

Post author By Maciej Ogrodnik
Post date 6 June 2023

We have barely enjoyed the Python Data Summit webinar, and there is already another presentation waiting for us!
You are warmly invited to the conference, which will be held on June 15-16. Among the speakers, in addition to the standard duo – Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. – will be Maria Filipowska and Grzegorz Urbanowicz.
The presentation will discuss the achievements of the SpeakLeash project to date, as well as compare them with other initiatives. There will certainly be other interesting topics as well.
For all those willing to attend the conference, we have a code for a -20% discount.
See you there – you can’t miss it!

https://ml.dssconf.pl/

AI dataset

We are 300!

Post author By Maciej Ogrodnik
Post date 29 May 2023

The temperature outside is rising, but this has nothing to do with our data collection rate. The end of May is under our belt, and we reach “3” in front, 302GB to be exact!!! It is worth mentioning that only 2 months ago i.e. at the end of March we had only 120GB. This gives an optimistic outlook for further updates which will appear as soon as possible.

The last nearly 50GB include women’s, sports or health forums and public information.

Please visit our dashboard, where you will learn much more about the data we collected.

Have a great week!

AI dataset

We do webinar

Post author By Maciej Ogrodnik
Post date 17 May 2023

Data is not the only thing a person lives by. As the interest in our project is greater than we could have expected, we are reaching out to you with a hand.

Tomorrow our representatives in the person of Sebastian Kondracki from Deviniti and Adrian Gwozdziej from BTC and Bank Pekao S.A. will talk about how to effectively obtain large text data sets in Python on the example of the SpeakLeash.org project.

The conference will take place tomorrow, i.e. May 18, at 1:00 p.m. You are cordially invited. You must not miss it.

Link to the event: 👇

Pytech Summit 2024 (online) | Największa konferencja o Python

AI dataset

Quatergoal

Post author By Maciej Ogrodnik
Post date 17 May 2023

Another week, another update!
This time we exceed the magic number that marks the achievement of a quarter of our goal. 255.1GB or 255,100MB(which sounds even more impressive) is the exact volume of Polish text data we have managed to collect so far. The data collected, like last time, was for the forums and education categories.

Knowing our researchers, next week we will be even closer to our goal, as the pace of data collection is growing exponentially. This is also thanks to the people who have joined the project in recent weeks inspired by the idea to help our work.
If you are interested in being a part of something big, don’t hesitate to write to us!!!
Contact link in the comments. 👇

#data #project #education #people

AI dataset

We kept out word!

Post author By Maciej Ogrodnik
Post date 28 April 2023

ENGLISH VERSION:

We come with positive news! As promised, we managed to exceed more than 200GB before May. Moreover, currently the counter has stopped at over 217GB, although we are not sure if it has already changed while we are writing this post 🙂

The main source of data acquisition relate to lifestyle and beauty forums.

We can’t describe the enormity of the work and dedication of our experts. Thank you !

As a result, based on the last update of the OSCAR project from 23.01(https://lnkd.in/gEgAQygg)) we surpass the mentioned project in terms of the size of the dataset, by as much as 40%.

We will keep you updated on further changes. We are looking forward to the next news.

Have a productive afternoon 🙂