CoMNIST - a dataset of cyrillic and latin hand-written letters for machine learning

Posted by Gregory Vial on February 28, 2017

Background

Anyone interested in machine learning has heard about MNIST dataset, a repository of hand-written digits made famous by the neural network researcher Yann LeCun.

This dataset contains 70,000 images of digitalized hand-written digits (characters from 0 to 9), and is very often used for educational purposes. Being able to identify which digit an image represents is a classical classification exercise, and people learning neural networks often build their first using MNIST. It can be considered the “Hello, world!” of neural nets.

MNIST stands for Modified NIST and is a subset of a larger dataset released by the NIST (US National Institute for Standards and Technology). NIST provides 800,000 images of digits and letters of the latin alphabet.

Hitorically the NIST database was made available at a prohibitive cost, and although it can now be downloaded for free, it is still copyrighted to the US Secretary of Commerce, so usage that can be made of it is unclear.

Introducing CoMNIST

CoMNIST is a free, crowd-sourced version of MNIST that contains digitalized letters from the cyrillic and latin alphabet.

CoMNIST stands for Cyrillic-oriented MNIST, but also makes reference to the latin meaning of co-: “with”, “together”

CoMNIST is available as a Kaggle dataset

CoMNIST is distributed under the Creative Commons Attribyte/ShareAlike license, which means it can be freely distributed, modified or used, including for commercial purposes.

It is expected that the size of CoMNIST will grow over time, so although it may start small, my hope is that data lovers of all boards will find it useful and contribute largely to it.

Contribute

English version - Russian version - French version

Method

Collection

The initial set of data is being collected through a website (see links in the “contribute” section) that kindly requests anyone interested in the initiative to draw some specified letters, with the finger and on a device with touch screen. The drawings are then saved as an image file on the server.

Users will be reached out mostly through social networks (Facebook, VKontakte, LinkedIn) but also during some data science related meetups in Berlin. We hope to encourage people to participate in this initiative thanks to the objectives, i.e. non commercial, educational applications related to language learning for kids and/or foreign adults (more on this on a later blog post). Parents are encouraged to let their children participate in order to obtain more images, and maybe from a less “canonical” form, as that will potentially help betters train classification models.

Some concepts of gamificaiton have been used to encourage the users to spend more time drawing letters, in order to maximise the number of images collected: rewards, encouragements etc

At the end of the experience (that can be stopped at anytime), user are asked to provide their name and/or email address to be kept informed of further developmens of the project.

Users are also encouraged to share their experience of facebook to increase the number of recruited “drawers”

Sanitation

There is a risk that some of the data input is incorrectly classified or simply not matching any real letter (vandalism, children using the app without supervision). Therefore sanitation is an important part of the process.

To do so, anomaly detection technic have been implemented as described by Andrew Ng in his Coursera course on machine learning. Each letter is analysed and if not considered “standard” based on statistics calculated on the whole corpus of similar letter it is rejected

Next steps

The intention is to build some web apps for language learning as soon as enough data has been collected to properly trained a neural network able to recognize letters/words with a high accuracy.

Stay tuned!