Abstract

Nowadays, Machine Learning (ML) is being used to construct many applications in domains such us object detection, image classification, speech to text etc. Deep Neural Networks (DNNs) are the core of Machine Learning as they offer remarkable accuracy and performance across various tasks. Despite their powerful capabilities, DNNs often require substantial computational resources, which can be challenging to manage, especially when deploying them on edge devices. So, these models have to be optimized before being deployed to these devices. Optimizing a model means making it smaller and more efficient without losing too much performance. Even though techniques like pruning reduce the number of parameters, the goal is to keep accuracy and speed as close as possible to the original. We are going to present a hybrid solution combining two techniques, pruning and quantization. Pruning is the process of eliminating inessential weights and connections in order to reduce the model size. Once the unnecessary parameters are removed, the model is quantized by converting the weights of the remaining parameters from 32 floating point precision to half. We verify and validate the performance of this hybrid approach using the COCO dataset (contains 80 classes) and the pre-trained YOLOv8 model. At the final stage, the hybrid model is deployed on an edge device, the NVIDIA Jetson Nano (4GB).