# Co-Design for Edge Artificial Intelligence: Application-Specific System on Chip

#### **NOVEMBER 13, 2024**

**RESEARCH REVIEW 2024** 

Dr. John G. Wohlbier Principal Research Scientist Advanced Computing Lab Lead

[DISTRIBUTION STATEMENT A] Approved for public release and unlimited distribution.

©2024 Carnegie Mellon University



#### **Document Markings**

Copyright 2024 Carnegie Mellon University.

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.

The view, opinions, and/or findings contained in this material are those of the author(s) and should not be construed as an official Government position, policy, or decision, unless designated by other documentation.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at permission@sei.cmu.edu.

DM24-1471

Carnegie

Mellon

Agenda

- Application-specific system on chips (SoCs)
- Modulation recognition
- Accelerator implementation
- SoC integration
- Future work
- Resources

#### **Application-Specific SoCs**





- Edge devices, such as drones, are powered by microelectronics.
  - Accelerometer
  - Camera
  - Flight controller
  - GPS module
  - Speed controller
  - Radios (receive and transmit [Tx/Rx])

What is an SoC?

- A chip containing components that comprise a system
- Components: CPU cores, GPU cores, memory cores, accelerators
- Accelerators: fast Fourier transform (FFT), natural language processing (NLP), NVIDIA Deep Learning Accelerator (NVDLA), Tx/Rx

Carnegie

#### **Application-Specific SoCs**

LPDDR 10 NLP EdgeBERT Mem NCA (Audio) Ariane **FPGA** NVDLA w/ PM Mem Ariane +inline Crypto FFT w/ PM Systolic SP Viterbi w/ PM Mem FFT w/o PM Night LPDDR **FPGA** SP Vision Mem Mem Mem

Image courtesy of IBM EPOCHS

- Electronics' efficiency plays a major role in system performance.
  - Designing an SoC for a particular deployment can yield major benefit.
- What makes an SoC application specific?
  - The mix of system components is designed from the application.
  - Examples: object detection, object classification, signal characterization
- How do you design an application-specific SoC?
  - Hardware description languages (HDL): Chisel, Verilog
  - High-level synthesis (HLS): Bambu, oneAPI, AMD/Xilinx
  - Frameworks: Chipyard, Embedded Scalable Platforms (ESPs)

"In general, *compute* can improve mission time and lower energy consumption by as much as 5X."

Boroujerdian, Behzad; Genc, Hasan; Krishan, Srivatsan; Faust, Aleksandra; & Reddi, Vijay Janapa. Why Compute Matters for UAV Energy Efficiency? International Symposium on Aerial Robotics. 2018.

Carnegie

## **Modulation Recognition**



- Edge environment senses radio frequency spectrum
- Workload designed to make sense of detected signals
- Signal processing for detection and data rate computation
- Machine learning (ML) for modulation classification

Carnegie

### **Accelerator Design Flow**

- Identify hot spots of application(s) to accelerate.
  - Profiling, source code analysis, expert knowledge
- Accelerator design options
  - HDL: code the accelerator by hand (time consuming, but can achieve optimal performance)
  - HLS: tools lower code to HDL (much faster, but unlikely to achieve optimal performance)
- High-level synthesis
  - HLS tools supporting PyTorch and Tensorflow are immature
- Emerging approaches
  - Multi-Level Intermediate Representation (MLIR) compiler framework for model lowering
  - Dialect abstraction to enable different types of optimizations

## Strip Spectral Correlation Algorithm (SSCA)

**Definition:** The cyclic auto-correlation function of a time-series x(t) is calculated as follows:

$$R_x^{\alpha}(\tau) = \int_{-\infty}^{\infty} x\left(t - \frac{\tau}{2}\right) x^*\left(t + \frac{\tau}{2}\right) e^{-i2\pi\alpha t} dt$$

where (\*) denotes complex conjugation. By the Wiener-Khinchin theorem, the spectral correlation density is then:

$$S_x^{\alpha}(f) = \int_{-\infty}^{\infty} R_x^{\alpha}(\tau) e^{-i2\pi f\tau} d\tau$$

#### <u>Cyclostationary processes</u>

- <u>Signal</u> having statistical properties that vary cyclically with time
- Functions: cyclic autocorrelation, spectral correlation function (SCF)
- $\alpha$  is the cyclical frequency (CF)
- SSCA estimates the SCF for all CFs
- For example, 8-phase-shift keying (8-PSK) signal at a specific modulation rate has the detection shown on the next slide.

**SSCA** 



Detect Coherence vs. Cyclic Frequency





8-PSK Symbol Constellation



### **Modulation Recognition Profile**



- Scan
  - Radio sweeps
- Detect
  - Noise floor estimate
- Characterize
  - SSCA symbol rate
  - ML classify for modulation types

#### **SSCA** Accelerator



| SSCA Algorithm Implementation           | Total Duration (s) |  |  |  |
|-----------------------------------------|--------------------|--|--|--|
| Reference Application                   | 334.12             |  |  |  |
| SSCA RTL Accelerator - 1 Length N FFT   | 67.04              |  |  |  |
| SSCA RTL Accelerator - 16 Length N FFTs | 15.59              |  |  |  |

- SSCA: FFT heavy workload
- Implemented in VHDL with Xilinx FFT IP
- Synthesized into SoC with RISC-V CPU core and SSCA run on VCU118
- Defined benchmark to compare PyTorch implementation on Orin to field-programmable gate array (FPGA)
  - ~ 5x 20x acceleration

Carnegie

### **SSCA** Accelerator

| Carnegie<br>Mellon |
|--------------------|
| University         |
|                    |
|                    |

| Device Type            | LUTs<br>(%) | Register<br>s (%) | F7 Muxes<br>(%) | F8 Muxes<br>(%) | LUT as<br>Logic (%) | LUT as<br>Memory (%) | Block RAM<br>Tile (%) | DSPs (%) | Bonded<br>IOB (%) |
|------------------------|-------------|-------------------|-----------------|-----------------|---------------------|----------------------|-----------------------|----------|-------------------|
| profpga-xc7v2000t      | 35          | 24                | 1               | 0               | 28                  | 24                   | 18                    | 91       | 12                |
| profpga-xcvu440        | 17          | 12                | 1               | 0               | 13                  | 18                   | 9                     | 68       | 10                |
| xilinx-vc707-xc7vx485t | 139         | 97                | 4               | 2               | 112                 | 64                   | 23                    | 70       | 20                |
| xilinx-vcu118-xcvu9p   | 36          | 25                | 1               | 1               | 29                  | 14                   | 11                    | 29       | 17                |
| xilinx-vcu128-xcvu37p  | 32          | 23                | 1               | 0               | 26                  | 14                   | 12                    | 22       | 23                |
| xilinx-zcu102-xczu9eg  | 154         | 107               | 5               | 2               | 124                 | 59                   | 26                    | 78       | 44                |
| xilinx-zcu106-xczu7ev  | 184         | 128               | 6               | 3               | 147                 | 83                   | 76                    | 113      | 40                |
| xilinx-z7020-xc7z020   | 795         | 553               | 24              | 11              | 637                 | 485                  | 170                   | 889      | 72                |
| xilinx-z7045-xc72045   | 193         | 135               | 6               | 3               | 155                 | 120                  | 44                    | 217      | 40                |

### **SoC Integration**

| Taskastan                               | CPU                                   | Casha Illanan             |            | De siste susta                     |             |       |                     |           |
|-----------------------------------------|---------------------------------------|---------------------------|------------|------------------------------------|-------------|-------|---------------------|-----------|
| Technology<br>Target technology:        |                                       | Cache Hierarch<br>Caches: | iy .       | Peripherals<br>UART                |             |       |                     |           |
| virtexup                                | Core: ariane                          |                           | ESP RTL -  | No JTAG (test)                     |             |       |                     |           |
| FPGA Prototyping                        | FPU: ETH FPnew                        | L2 SETS:                  | 512 -      | Ethernet<br>No Custom IO Link      |             |       |                     |           |
| Target FPGA board:                      | Shared Local Memory                   | L2 WAYS:                  | 4          | No SVGA                            |             |       |                     |           |
| xilinx-vcu118-xcvu9p                    | KB per tile: 256                      |                           |            | Bahara Unit                        |             |       |                     |           |
|                                         |                                       | LLC SETS (per mem tile):  | 1024 -     | Debug Link<br>Ethernet link (EDCL) |             |       |                     |           |
|                                         | Accelerators                          | LLC WAYS:                 | 16 -       | IP address (hex): C0A80180         |             |       |                     |           |
|                                         | Data allocation strategy:             | ACC L2 SETS:              |            | IP address (dec): 192.168.1.128    |             |       |                     |           |
|                                         | <ul> <li>Big physical area</li> </ul> | ACC L2 WAYS:              | 4 1        | MAC address (hex): 000A3504DB5A    |             |       |                     |           |
|                                         | <ul> <li>Scatter/Gather</li> </ul>    |                           |            |                                    |             |       |                     |           |
| NoC configura                           | tion                                  |                           |            | NoC Tile Co                        | nfiguration |       |                     |           |
| Rows: 2 Cols:                           | 2                                     |                           |            | (0,0)                              |             | (0,1) |                     |           |
| Config                                  |                                       |                           |            | mem -                              |             |       |                     |           |
| Monitor DDR ban                         | dwidth                                |                           |            |                                    |             |       |                     |           |
| Monitor memory                          |                                       |                           |            |                                    |             |       |                     |           |
| Monitor injection                       |                                       |                           |            |                                    |             |       | Has DOR             |           |
| Monitor router po     Monitor accelerat |                                       | c                         | lk Reg 0 응 |                                    | Cik Reg 0 0 |       | CLK BUF             |           |
| Monitor accelerat                       | tor statu                             |                           |            |                                    |             |       |                     |           |
| Monitor LLC                             |                                       | (0,0)                     |            |                                    |             |       | (0,1)               |           |
| E Monitor DVI                           |                                       | (0,0)                     |            |                                    |             |       | (0,1)               |           |
| Num CPUs: 1                             |                                       | mem                       | _          |                                    |             |       | cpu 🔤               |           |
| Num memory co                           | _                                     |                           |            | J                                  |             |       |                     |           |
| Num local memo<br>Num local memo        |                                       |                           |            |                                    |             |       |                     |           |
| Num I/O tiles: 1                        |                                       |                           |            |                                    |             |       |                     |           |
| Num accelerato                          | mem                                   |                           |            |                                    |             |       | cpu                 |           |
| Num CLK region                          |                                       |                           |            |                                    |             |       |                     |           |
| Num CLKBUF: 0                           |                                       |                           |            |                                    |             |       |                     |           |
| Num CERBOF: 0                           | Has cache                             |                           |            | T Has DDR                          | T Has d     | ache  |                     | T Has DDR |
| VF points: 4                            |                                       |                           |            |                                    |             |       |                     |           |
| C                                       | lk Reg 🛛 🌲                            | T Has PLL T               | CLK B      | UF                                 | Clk Rea     | 0 🚔   | 🗆 Has PLL 🗖 CLK BUF |           |
|                                         |                                       |                           |            |                                    |             |       |                     |           |
|                                         |                                       | (1,0)                     |            |                                    |             |       | (1,1)               |           |
|                                         |                                       | . (                       |            | (                                  |             |       | 1                   |           |
|                                         | IRIS_HLS4N                            | ₄∟                        | .: dm      | a64_w16 🔤                          |             |       | 10                  |           |
|                                         |                                       |                           |            |                                    |             |       |                     |           |
|                                         |                                       |                           |            |                                    |             |       |                     |           |
|                                         |                                       | IRIS HLS                  | 4ML        |                                    | 10          |       |                     |           |
|                                         |                                       | 1110_1120                 |            |                                    |             |       | 10                  |           |
|                                         |                                       |                           |            |                                    |             |       |                     |           |
|                                         |                                       |                           |            |                                    |             |       |                     |           |
|                                         | Has cache                             |                           |            | 🗆 Has DDR                          | 🗆 Has d     |       |                     | □ Has DDR |
|                                         |                                       |                           |            |                                    |             | 0.00  |                     |           |
| C                                       | ilk Reg 🛛 🚔                           | □ Has PLL □               | CLK B      | SUF                                | Clk Reg     | 0     | □ Has PLL □ CLK BUF |           |
|                                         |                                       |                           |            |                                    |             | _     |                     |           |

- Using <u>ESP</u> from Columbia University
  - Includes selectable CPUs, memory cores, input/output (I/O) cores, selection of accelerators
- Example: four-tile design with an RISC-V CPU core, memory tile, I/O tile, and PyTorch-Iris accelerator
- Implementation on FPGA with AMD/Xilinx tools
  - Design flow for application-specific integrated circuit (ASIC) also available

Co-Design for Edge Artificial Intelligence: Application-Specific System on Chip ©2024 Carnegie Mellon University Carnegie

#### **RESEARCH REVIEW** 2024

### SoC Testing: FPGA Implementation



Image 1: FPGA Dev Board Xilinx VCU118 connected to host Image 2: Booting Linux on RISC-V core and then running application, invoking the accelerator 🕨 🛑 📷 jgwohlbier — root@lx-pinkbird: /home/espuser — ssh -Y lx-pinkbird.ad.sei.cmu.edu — 80×35

Welcome to minicom 2.7.1

OPTIONS: I18n Compiled on Aug 13 2017, 15:25:34. Port /dev/ttyUSB1, 14:00:00

Press CTRL-A Z for help on special keys



Co-Design for Edge Artificial Intelligence: Application-Specific System on Chip ©2024 Carnegie Mellon University

#### Future Work



- Accelerator synthesis for Resnet neural network
- Integration of SoC containing neural network accelerator, SSCA accelerator, and RISC-V CPU core
- Performance quantification and comparison to NVIDIA Orin

Carnegie

#### Resources

- Reproduce these results using open source code
- CMU-SEI GitHub
  - PyTorch-Iris hls4ml branch simple neural network code
  - https://github.com/cmu-sei/pytorch-iris
  - Docker environment for using hls4ml
  - <u>https://github.com/cmu-sei/hls4ml-docker</u>
  - Docker environment for using ESP
  - <u>https://github.com/cmu-sei/esp-docker</u>
- ESP tutorials
  - https://www.esp.cs.columbia.edu/docs/

**RESEARCH REVIEW** 2024

#### Presenter

**Dr. John G. Wohlbier** Principal Research Scientist Advanced Computing Lab Lead

> Telephone: +1 412.268.5800 Email: <u>info@sei.cmu.edu</u>

Mellon