Object Detection and Recognition using YOLO

You Only Look Once (YOLO) model is one of the most efficient and fastest object detection algorithms. There are currently three state-of-the-art models for detecting objects:

You Only Look Once – YOLO (paper)
R-CNN and its variants Fast R-CNN, Faster R-CNN, etc. (paper)
Single Shot Detector – SSD (paper)

This post will show you how YOLO works. Training YOLO from scratch is a non-trivial task, so a pre-trained model is used.

Video: Object Detection and Recognition using YOLO

Prerequisites:

Python 3
OpenCV 4
Numpy
Pre-trained YOLOv3 model (see below)

Update (18th August 2020): the ‘mish’ activation function has been built into OpenCV 4.4.0. So YOLOv4 can be used with OpenCV 4.4.0 upwards.

Note (9th June 2020): by the time of writing this post, the latest version of YOLO is YOLOv4. However when using it on OpenCV 4.0, I got an error: Unsupported activation: mish in function ‘cv::dnn::darknet::ReadDarknetFromCfgStream’. It seems that in YOLOv4’s implementation it uses ‘mish’ activation function which is not available in OpenCV 4.0 yet. The ‘mish’ activation function is defined as follows (https://github.com/opencv/opencv/issues/17148):

float softplus(float x, float threshold = 20) {
    if (x > threshold) return x;                // too large
    else if (x < -threshold) return expf(x);    // too small
    return logf(expf(x) + 1);
}

float mish_activation(float input) {
    const float MISH_THRESHOLD = 20;
    output = input * tanh( softplus(input, MISH_THRESHOLD) );
    return output;
}

As a result I used YOLOv3 in this demo and when the above issue resolved I will upgrade to YOLOv4 in due course.

YOLO source code and pre-trained models are available at https://github.com/AlexeyAB/darknet. You can download the config and weights for YOLOv3 at https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov3.cfg and https://pjreddie.com/media/files/yolov3.weights respectively. In addition, a list of object names in the pre-trained model is available at https://github.com/pjreddie/darknet/blob/master/data/coco.names. There are 80 different types of objects: person, bicycle, car, etc.

After download the pre-trained config, weights and names, we can start writing code.

Implementation

First we define two constants CONF_THRESHOLD and NMS_THRESHOLD. Each detection from YOLO comes with a confidence score (i.e. probability) and by defining CONF_THRESHOLD we only keep the detection that has the confidence score higher than the CONF_THRESHOLD threshold. And NMS_THRESHOLD is the Intersection over Union (IoU) threshold used in non-max suppression for removing overlapping bounding boxes.

# Define constants
# CONF_THRESHOLD is confidence threshold. Only detection with confidence greater than this will be retained
# NMS_THRESHOLD is used for non-max suppression
CONF_THRESHOLD = 0.3
NMS_THRESHOLD = 0.4

Get the list of object names:

# Read COCO dataset classes
with open('coco.names', 'rt') as f:
    classes = f.read().rstrip('\n').split('\n')

Darknet, the deep learning framework from which YOLO is trained, is built in OpenCV’s deep neural network module and ready to use. We can load the network in just one line of code and pass in the config and weights of YOLOv3:

# Load the network with YOLOv3 weights and config using darknet framework
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg", "darknet")

Read an input image and create a blob from it to feed into the neural network later:

# Read image
image = cv2.imread(args.image)
# Create blob from image, normalize and don't crop
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB=True, crop=False)

Set the input for the neural network and run a forward pass to get the result:

# Get the output layer names used for forward pass
outNames = net.getUnconnectedOutLayersNames()

# Set the input
net.setInput(blob)

# Run forward pass
outs = net.forward(outNames)

Now process the output from the neural network and draw prediction:

# Process output and draw predictions
process_frame(image, outs, classes, CONF_THRESHOLD, NMS_THRESHOLD)

The process_frame function is where detection with low confidence scores are eliminated and overlapping bounding boxes are removed using non-max suppression:

def process_frame(frame, outs, classes, confThreshold, nmsThreshold):
    # Get the width and height of the image
    frameHeight = frame.shape[0]
    frameWidth = frame.shape[1]

    # Network produces output blob with a shape NxC where N is a number of
    # detected objects and C is a number of classes + 4 where the first 4
    # numbers are [center_x, center_y, width, height]
    classIds = []
    confidences = []
    boxes = []
    for out in outs:
        for detection in out:
            scores = detection[5:]
            classId = np.argmax(scores)
            confidence = scores[classId]
            if confidence > confThreshold:
                # Scale the detected coordinates back to the frame's original width and height
                center_x = int(detection[0] * frameWidth)
                center_y = int(detection[1] * frameHeight)
                width = int(detection[2] * frameWidth)
                height = int(detection[3] * frameHeight)
                left = int(center_x - width / 2)
                top = int(center_y - height / 2)
                # Save the classId, confidence and bounding box for later use
                classIds.append(classId)
                confidences.append(float(confidence))
                boxes.append([left, top, width, height])

    # Apply non-max suppression
    indices = cv2.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)
    for i in indices:
        i = i[0]
        box = boxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        draw_prediction(frame, classes, classIds[i], confidences[i], left, top, left + width, top + height)

Prediction boxes are drawn along with name and confidence score:

# Draw a prediction box with confidence and title
def draw_prediction(frame, classes, classId, conf, left, top, right, bottom):

    # Draw a bounding box.
    cv2.rectangle(frame, (left, top), (right, bottom), (0, 255, 0))

    # Assign confidence to label
    label = '%.2f' % conf

    # Print a label of class.
    if classes:
        assert(classId < len(classes))
        label = '%s: %s' % (classes[classId], label)

    labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
    top = max(top, labelSize[1])
    cv2.rectangle(frame, (left, top - labelSize[1]), (left + labelSize[0], top + baseLine), (255, 255, 255), cv2.FILLED)
    cv2.putText(frame, label, (left, top), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0))

To run the YOLO object detection for image:

python yolo_detect_image.py --image name_of_your_image_here

For example, with this input:

The output will be:

Each bounding box comes with an object type (e.g. person, car, motorbike, traffic light, etc.) and a confidence score (e.g. 0.97 means 97% confident).

Similarly, to run the YOLO object detection for video:

python yolo_detect_video.py --video name_of_your_video_here

An example can be seen in the video below:

Video: Object detection and recognition using YOLO

Full source code can be downloaded at https://github.com/minhthangdang/ObjectDetectionYOLO.

Credits: The implementation in this post took inspiration from https://github.com/opencv/opencv/blob/8c25a8eb7b10fb50cda323ee6bec68aa1a9ce43c/samples/dnn/object_detection.py

Home » Knowledge Sharing » Object Detection and Recognition using YOLO

Prerequisites:

Implementation

Share this: