The Sign Language MNIST database contains hand gesture pictures for the American Sign Language. Each picture belongs to one of 24 classes of letters (excluding J and Z which require motion). The task is to use this database to train a computer vision model to identify what letter a given picture represents.
We build a Convolutional Neural Network (CNN) using Tensorflow (Keras frontend) to solve the multi-class classification problem. The original data set is relatively small (1704 hand gestures) and has been augmented to a total of 27455 training data and 7172 test data applying different image transformations (leading to believable-looking variations of the original pictures). Hence, most samples are strongly correlated with the few original one, and overfitting will be the main issue.
Our strategy is to add a dense classifier to a VGG16 convolutional base trained on the ImageNet dataset. We will fine tune the last VGG16 block (encoding more specialized features than the first VGG16 layers, not necessarily useful for our dataset) and freeze the others to reduce overfitting.
Another strategy to mitigate overfitting could have been data augmentation at training time (i.e., applying random picture transformations at each training epoch, so that the training algorithm never sees the same image twice). However, the dataset itself has already been artificially increased, so further randomly transforming images does not seem promising to significantly reduce overfitting.
The code is available here: Jupyter notebook.
Evaluating the model on the test set leads to 99.14% accuracy.