Abstract

The widespread proliferation and adoption of ML models has triggered the need to accelerate them directly in hardware. In this way, we satisfy the need for high performance needed by many ML applications with real time responses, and at the same time allow for energy efficient implementations as necessitated by edge or mobile applications. In this short report, we highlight the proposed architecture for ML acceleration that combines (a) a vector processor that supports structured sparsity, (b) a systolic array architecture with a fine-tuned pipeline organization for integer or reduced precision floating point arithmetic and (c) a low-cost RISC-V scalar core that allows dual issuing of compressed 16-bit instructions with minimal hardware overhead. Consequently, the proposed design can substantially improve instruction throughput and reduce execution times. The programmable nature of the proposed architecture allows handling efficiently both current as well as future extensions of ML applications, while its overall organization allows it to approach the energy efficiency of application-specific designs.