When AlexNet (paper) won the ImageNet Large Scale Visual Recognition Challenge in 2012, it sent a shock wave across the computer vision research community. Even though the concept of neural network has been available for many decades, it is AlexNet that made deep convolutional neural network (CNN) a highly recognised solution that solves many computer vision problems.
There are now many other CNN architectures that are more sophisticated and more powerful. However in many cases a “simple” AlexNet can still be very effective. In this post I’m going to use AlexNet architecture for the task of character recognition.
The dataset
The dataset is obtained from the Chars74K dataset, which contains 74K images of 64 classes (0-9, A-Z, a-z). The dataset includes characters obtained from natural images (7,705), hand-drawn characters (3,410) and synthesised characters from computer fonts (62,992). For the sake of this post, I’m going to use synthesised characters from computer fonts with numbers (0-9) and uppercase letters (A-Z) only. The reduced dataset can be downloaded here. It has 36,576 images of 36 classes.
Random samples from the dataset are shown below.
AlexNet Architecture
AlexNet architecture has eight layers which consists of five convolutional layers and three fully connected layers. The first convolutional layer has 96 kernels of size 11×11 with a stride of 4. The second convolutional layer has 256 kernels of size 5×5. The third and fourth convolutional layers have 384 kernels of size 3×3. And the fifth convolutional layer has 256 kernels of size 3×3. The fully connected layers have 4096 neurons. Each layer is followed by Relu activation function. And max pooling is applied in the first, second and fifth layers with size 3×3 and stride 2×2.
The input image in the original AlexNet paper has width x height of 224×224. However the 74K dataset has image size of 128×128, and I stick to the width and height of the 74K dataset.
The AlexNet-like architecture for the 74K dataset is illustrated in Fig. 2 (click image to view in full screen). Please note the input image size is different from that of the original paper.
Implementation
The model can be implemented in Tensorflow as follows:
from keras.models import Sequential, Model from keras.layers.normalization import BatchNormalization from keras.layers.core import Activation from keras.layers.core import Flatten from keras.layers.core import Dropout from keras.layers.core import Dense from keras.layers import Conv2D, MaxPooling2D from keras import layers def model(num_classes, input_shape): model = Sequential() # 1st Convolutional Layer model.add(Conv2D(filters=96, input_shape=input_shape, kernel_size=(11,11), strides=(4,4), padding='valid')) model.add(Activation('relu')) # Max Pooling model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) # 2nd Convolutional Layer model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same')) model.add(Activation('relu')) # Max Pooling model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) # 3rd Convolutional Layer model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same')) model.add(Activation('relu')) # 4th Convolutional Layer model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same')) model.add(Activation('relu')) # 5th Convolutional Layer model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same')) model.add(Activation('relu')) # Max Pooling model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) # Passing it to a Fully Connected layer model.add(Flatten()) # 1st Fully Connected Layer model.add(Dense(4096)) model.add(Activation('relu')) # Add Dropout to prevent overfitting model.add(Dropout(0.5)) # 2nd Fully Connected Layer model.add(Dense(4096)) model.add(Activation('relu')) # Add Dropout to prevent overfitting model.add(Dropout(0.5)) # Output Layer model.add(Dense(num_classes)) model.add(Activation('softmax')) return model
Now we have the model ready, we can set up the data and hyperparameters to train the model.
First we define some constants for the program such as DATASET_PATH (where to locate the dataset), MODEL_PATH (where to save the model after training), BATCH_SIZE (batch size), EPOCHS (number of epochs), TARGET_WIDTH, TARGET_HEIGHT, and TARGET_DEPTH (width, height and depth of the input image respectively)
from keras.preprocessing.image import ImageDataGenerator from keras.optimizers import Adam from keras.callbacks import ReduceLROnPlateau import os # Define constants DATASET_PATH = './English/Fnt/' MODEL_PATH = '.' BATCH_SIZE = 128 EPOCHS = 20 TARGET_WIDTH = 128 TARGET_HEIGHT = 128 TARGET_DEPTH = 3
Next we split the data into 2 parts: 80% for training and 20% for validation. This is done via Keras’ ImageDataGenerator:
# Set up the data generator to flow data from disk print("[INFO] Setting up Data Generator...") data_gen = ImageDataGenerator(validation_split=0.2, rescale=1./255) train_generator = data_gen.flow_from_directory( DATASET_PATH, subset='training', target_size = (TARGET_WIDTH, TARGET_HEIGHT), batch_size = BATCH_SIZE ) val_generator = data_gen.flow_from_directory( DATASET_PATH, subset='validation', target_size = (TARGET_WIDTH, TARGET_HEIGHT), batch_size = BATCH_SIZE )
We compile the AlexNet model by using the model function defined earlier:
# Build model print("[INFO] Compiling model...") alexnet = model(train_generator.num_classes, (TARGET_WIDTH, TARGET_HEIGHT, TARGET_DEPTH)) # Compile the model alexnet.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Before we train the network, we define a learning rate decay function. This callback monitors the loss value and if no improvement is seen for 2 epochs, the learning rate is reduced by a factor of 0.2:
# Set the learning rate decay reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.001)
Finally we can train the model and save it into disk:
# Train the network print("[INFO] Training network ...") H = alexnet.fit_generator( train_generator, validation_data=val_generator, steps_per_epoch=train_generator.samples // BATCH_SIZE, validation_steps = val_generator.samples // BATCH_SIZE, epochs=EPOCHS, verbose=1, callbacks=[reduce_lr]) # save the model to disk print("[INFO] Serializing network...") alexnet.save(MODEL_PATH + os.path.sep + "trained_model") print("[INFO] Done!")
This model achieved 97.29% accuracy on training set and 95.46% accuracy on validation set. Considering that many characters are difficult even for human to distinguish such as 1 and I, 2 and Z, 0 and O, 5 and S, this is a quite impressive result.
After training, we can try using the trained model to predict the hand written characters that the model has not seen. Save the following code in a file called predict.py:
import argparse import numpy as np from keras.models import load_model from keras.preprocessing.image import img_to_array import cv2 # Construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image") args = vars(ap.parse_args()) labels = [ '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G', 'H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z' ] # Define constants TARGET_WIDTH = 128 TARGET_HEIGHT = 128 MODEL_PATH = './trained_model' # Load the image original_image = cv2.imread(args["image"]) # Preprocessing the image image = cv2.resize(original_image, (TARGET_WIDTH, TARGET_HEIGHT)) image = image.astype("float") / 255.0 image = img_to_array(image) image = np.expand_dims(image, axis=0) # Load the trained convolutional neural network print("[INFO] Loading my model...") model = load_model(MODEL_PATH, compile=False) # Classify the input image then find the index of the class with the *largest* probability print("[INFO] Classifying image...") prob = model.predict(image)[0] idx = np.argsort(prob)[-1] # Display original image cv2.imshow("Original Image", original_image) cv2.waitKey(0); # Display the predicted image cv2.putText(original_image, 'Character is ' + labels[idx], (10, 100), cv2.FONT_HERSHEY_SIMPLEX, 2, (255,0,255), 2) cv2.imshow("Recognised Image", original_image) cv2.waitKey(0)
To test it on an image, just run:
python predict.py --image test1.png
Bravo, it guessed the image correctly: