URDU WORDS RECOGNITION MODEL WITH COMPUTER GENERATED IMAGES
Understanding Deep Learning
A branch of Machine learning.
The technology behind modern image recognition, self driving cars, modern machine translation and election hacking.
Allows rigorous and repeated estimations on small parts of an image e.g. 10s of pixels at a time.
Requires a very large amount of labelled dataset for training. Requires huge computational power – GPUs (Graphical Processing Units).
CONVOLUTIONAL NEURAL NETWORKS
The technology behind Image recognition
Following are few examples:
Image Segments Detection
This is part of Deep Learning
Let’s take a brief look at the history of “Urdu” Language
It’s approximately 800 years old.
Grammar structure is based on Sanskrit.
Vocabulary from Arabic, Persian and Sanskrit.
Script from Persian, written right-to-left.
Originated around Delhi, Lucknow and parts of Punjab, India. Also one of official languages of India.
The national language of Pakistan.
Urdu is Complex Non-Latin Writing Script!
Making it difficult to OCR (Optical Character Recognition) as compared to the latin scripts.
Let me show you some of the urdu words, let’s start with one letter and form words out of it. We will start from right, because that’s how you write Urdu.
A word create with this letter on right. Pronounced as “Koshish”
A word create with this letter on right. Pronounced as “Ashaat”
A word create with this letter on right. Pronounced as “Shuru”
A letter from Urdu, pronounced as “Sheh”
Therefore, Urdu has particular text recognition requirements!
Requires a very large labelled dataset of approximately 140,000 words.
MINST dataset is 280,000 images in total, which means there are 28,000 images per word.
Tweaking a Convolutional Neural Network Algorithm.
Tuning the hyperparameters for optimal results.
Labelling data for machine learning is…
ML Sense has programmatically developed a way to label the large data set that is:
(140,000 x 20,000 = 2,800,000,000 Images)
Our Solution at ML Sense
We generated synthetic data!
We created a way to Automate the Data Labelling Process.
Our Solution at ML Sense
We generated data for a complex machine learning problem, through which we created data for a non-latin scripted language.
We found that generated/synthetic data is as good as real (manually) labelled data. It works perfectly for machine learning. In fact, it is even better, because the generated data is specific to the machine learning problem.
Hence, we can generate synthetic data for any niche problem.