This project is also published as a paper in COMMIT (COMMUNICATION AND INFORMATION TECHNOLOGY) JOURNAL with a title of Hand Symbol Classification for Human-Computer Interaction Using the Fifth Version of YOLO Object Detection.
Human-Computer Interaction (HCI) nowadays mostly uses physical contact, such as people using the mouse to choose something in an application. In the meanwhile, there are certain problems that people will face if using conventional HCI. For example, when people need to hit submit button in an online form with the mouse, but their hands are dirty or wet this could cause the mouse to become dirty or maybe broken. In this research, the authors try to overcome some of the problems while people use conventional HCI by using the Computer Vision method. This research focuses on creating and evaluating the object detection model for classifying hand symbol using YOLO fifth version. The hand gesture classes that were made in this research are 'ok', 'cancel', 'previous', 'next', and 'confirm'. The architecture of YOLO fifth version which is used in this research is yolov5m. After all, the performance of the models for classifying hand symbols is 80% for accuracy, 95% for precision, 84% for recall, and 89% for F1 score.
YOLO is one of the most popular computer vision algorithms to detect and classify objects in an image or video. The name itself stands for You Only Look Once, and YOLO is categorized as Single-Shot Detector (SSD) Algorithm. In therm of speed inference, the SSD algorithm outperforms other algorithms like R-CNN, Faster R-CNN, etc. However, the drawback is the accuracy is not good as R-CNN. Joseph Redmon developed the YOLO algorithm, which was popular, starting with YOLO Version 3, and uses the darknet framework; the later version of YOLO Version 4 increased the speed and accuracy. Later on, YOLO Version 5 was developed using PyTorch by Glenn Jocher, and this version is used in this algorithm.
YOLO has a working method of dividing an image into several small pieces: S × S grids. Each grid will be responsible for the centre point of the objects in each grid. The algorithm will give a bounding box and the confidence value of the detected object class in the grid. A bounding box consists of Width, Height, Class, and the bounding box centre (bx, by). When detecting an object, the confidence value is the same as the Intersection Over Union (IOU) obtained from the calculation between the bounding box predictions and ground truth.
This experiment uses the YOLO Version 5, specifically yolov5m. The architecture of yolov5m consists of backbone, neck, and head which could be visualized as follow:

Cancel

Confirm

Previous

Ok

Next
The images itself was captured from the laptop's camera, then processed using labelimg and saved as YOLO format data consisting of x, y, w, and h of the object. The labelling process was done by creating a boundary box for each image, and the size should be as fit as possible to the object, so the background won't be seen too much.
The inference resulted in a Confusion Matrix with details as follows:
Which then from the Confusin Matrix we calculated the Accuracy, Precision, Recall, and F1 Score. The result is as follow:
In conclusion, the model has an accuracy of 80%, a precision of 95%, a recall of 84%, and an F1 score of 89%. As you can see, the result of the inference is lower than the result in the training process, this is because the inference process has some variables that affect the model's ability such as the lighting, the background, and the angle of the camera.
This research was successfully conducted with the help of some people. I would like to thank: