Inkyvoices is the first step towards teaching Arabic human speech to machines. Despite there being 25 countries that consider Arabic to be their first language, and approximately 375 million Arabic speakers, the language is still inadequately represented within the NLP domain.
Through this platform, we aim for the level of Arabic NLP to become equal to the English one.
The first step in teaching human speech to a machine is to provide it with enough samples. Inkyvoices is an initiative to gather as much data as possible through the power of crowdsourcing, so that we can push the community to innovate and be creative.
This dataset targets the Arabic speaking community to make their individual contributions to the largest available corpus of annotated audio in Arabic.
The inkyvoices dataset will be composed of texts with corresponding audio recordings. We aim to add more information to the audio to categorise it, and expand the reach of potential dataset uses. The aim is to create a better level of Arabic audio treatment, using the innovative power of a community with unlimited access to a rich and public dataset.
One of the main reasons that pushed us to create the platform is the desire to create a dataset with the ability to generate Arabic text into speech.
Usage is limitless for a generator that can automatically transform transcriptions into subtitles, which is what this dataset will help build.
The ability to classify and determine the origins, age, gender etc. of the speaker can be valuable in a wide variety of research fields.
By making this dataset open and public, we open up for new and innovative tasks for our minds to collectively both envision and realise.