# Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

#### Paras Jain<sup>2\*</sup>

with Safeen Huda<sup>1</sup>, Martin Maas<sup>1</sup>, Joseph Gonzalez<sup>2</sup>, Ion Stoica<sup>2</sup> and Azalia Mirhoseini<sup>1</sup>

<sup>1</sup>Google

<sup>2</sup> UC Berkeley

\* Work done while an intern at Google Brain

Google Research



## Deep learning's inference energy problem



Rise of highparameter models

#### Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

Google Research

## Deep learning's inference energy problem



Rise of highparameter models Inference is 80%+ of DNN workloads (AWS, Facebook)

## Deep learning's inference energy problem



# Approximate computing as a new way to save power on DNN accelerators



• Deep learning models are tolerant to approximations like quantization

# Approximate computing as a new way to save power on DNN accelerators



- Deep learning models are tolerant to approximations like quantization
- We study: emerging approximate multipliers + adders to trade-off accuracy for power
- *Complementary* approach to quantization and sparsity
- **Challenge:** how to maintain high accuracy under approximation?

Approximate computing as a new way to save power on DNN accelerators

# How to achieve power savings with an approximate inference accelerator without any accuracy loss on a large-scale dataset?

Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

# **Background:** approximate MACs to trade-off power and accuracy



- Parts of fully-accurate circuits can be removed to trade-off accuracy for better power efficiency
- Example: truncate the carry chain in an 8-bit adder
- Extensive prior work to produce such multipliers/adders [1] [2] [survey].
- <u>Functionally</u> approximate circuits only

https://dl.acm.org/doi/10.1145/2228360.2228509
 https://ieeexplore.ieee.org/abstract/document/7926993
 [survey] https://www.osti.gov/pages/servlets/purl/1286958

Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

# **Background:** approximate MACs to trade-off power and accuracy



- Parts of fully-accurate circuits can be removed to trade-off accuracy for better power efficiency
- Example: truncate the carry chain in an 8-bit adder
- Extensive prior work to produce such multipliers/adders [1] [2] [survey].
- <u>Functionally</u> approximate circuits only

https://dl.acm.org/doi/10.1145/2228360.2228509
 https://ieeexplore.ieee.org/abstract/document/7926993
 [survey] https://www.osti.gov/pages/servlets/purl/1286958

Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

# **Background:** approximate MACs to trade-off power and accuracy



V. Mrazek, R. Hrbacek, Z. Vasicek and L. Sekanina, EvoApprox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods. Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017

## **Challenge:** Prior designs with approximate MACs degrade accuracy

|                           | Largest<br>dataset | Model<br>MACs | Retrain<br>free? | Zero<br>loss? | Must incur<br>accuracy penalty! |
|---------------------------|--------------------|---------------|------------------|---------------|---------------------------------|
| Venkataramani et al. [43] | CIFAR-10           | <1M           | ×                | ×             |                                 |
| Zhang et al. [45]         | CALTECH            | <1M           | ×                | X             |                                 |
| Sarwar et al. [37]        | CIFAR-100          | <1M           | ×                | X             | Evaluated on CIFAR w/           |
| Mrazek et al. [34]        | CIFAR-10           | 21M           | 1                | X             |                                 |
| Mrazek et al. [33]        | CIFAR-10           | 120M          | 1                | ×             | small models                    |
|                           |                    |               |                  |               |                                 |

# This work: We show it is possible to use approximation and maintain accuracy

|                           | Largest<br>dataset | Model<br>MACs | Retrain<br>free? | Zero<br>loss? |
|---------------------------|--------------------|---------------|------------------|---------------|
| Venkataramani et al. [43] | CIFAR-10           | <1M           | ×                | X             |
| Zhang et al. [45]         | CALTECH            | <1M           | ×                | ×             |
| Sarwar et al. [37]        | CIFAR-100          | <1M           | ×                | ×             |
| Mrazek et al. [34]        | CIFAR-10           | 21M           | 1                | ×             |
| Mrazek et al. [33]        | CIFAR-10           | 120M          | $\checkmark$     | X             |
| AutoApprox (ours)         | ImageNet-1k        | 2B            | $\checkmark$     | 1             |

**Key Insight:** Add additional approximate units next to exact units as a low-power "fast-path"



At inference, router selects <u>one systolic array</u>

#### **Error-tolerant workloads:**

 $\rightarrow$  Save power by using approximate MAC

### Sensitive workloads:

→ Maintain accuracy by using exact MAC

# **AutoApprox:** full-stack framework to design zero-loss approximate accelerators



#### **Contributions:**

- 1. Approx. TPU architecture w/ exact fallback
- 2. Whole chip PPA estimates
- **3. Fast e2e accuracy simulation:** 7000x simulation speedup
- **4. ML-guided search:** Novel Bayesian optimizer for large combinatorial space of circuits

Automatically generate diverse approximate accelerators



#### **Contributions:**

- 1. Approx. TPU architecture w/ exact fallback
- 2. Whole chip PPA estimates
- **3. Fast e2e accuracy simulation:** 7000x simulation speedup
- 4. ML-guided search: Novel Bayesian optimizer for large combinatorial space of circuits

Automatically generate diverse approximate accelerators



**TPUv3-based architectural template** 

Automatically generate diverse approximate accelerators



**TPUv3-based architectural template** 



- Systolic array generator instantiates diverse set of approximate TPU designs
- Architectural template: TPU w/ sister approximate matrix multipliers
- Approximate MAC bank: 36 MACs from prior work, can be augmented w/ new designs

Automatically generate diverse approximate accelerators

Google Research



**TPUv3-based architectural template** 

## Performance estimation of candidates



#### **Contributions:**

1. Approx. TPU architecture w/ exact fallback

#### 2. Whole chip PPA estimates

- **3. Fast e2e accuracy simulation:** 7000x simulation speedup
- 4. ML-guided search: Novel Bayesian optimizer for large combinatorial space of circuits

### Performance estimation of candidates: POWER, AREA

Power and area estimates are whole-chip, not per-multiplier



lower power usage from interconnect

## Performance estimation of candidates



#### **Contributions:**

- 1. Approx. TPU architecture w/ exact fallback
- 2. Whole chip PPA estimates
- **3. Fast e2e accuracy simulation:** 7000x simulation speedup
- 4. ML-guided search: Novel Bayesian optimizer for large combinatorial space of circuits

### Performance estimation of candidates: <u>ACCURACY</u>

Very large datasets are important but cost-prohibitive to simulate

### ImageNet:

1.2M training samples50K validation samples224x224 images

- Most important use-cases involve large, diverse datasets
- MNIST, CIFAR-10 not representative!

## Performance estimation of candidates: <u>ACCURACY</u>

Very large datasets are important but cost-prohibitive to simulate

### ImageNet:

1.2M training samples50K validation samples224x224 images

Evaluating single ImageNet sample with commercial simulator takes <u>4.2 hours</u>

## Performance estimation of candidates: <u>ACCURACY</u>

Very large datasets are important but cost-prohibitive to simulate







Related approach (caching only): V. Mrazek, L. Sekanina, and Z. Vasicek. Using libraries of approximate circuits in design of hardware accelerators of deep neural networks. AICAS, 2020.

### Performance estimation of candidates



#### **Contributions:**

- 1. Approx. TPU architecture w/ exact fallback
- 2. Whole chip PPA estimates
- **3. Fast e2e accuracy simulation:** 7000x simulation speedup
- **4. ML-guided search:** Novel Bayesian optimizer for large combinatorial space of circuits

#### **ML-guided search to jointly search for hardware + mapping** Two phase search space finds zero-loss chips but is enormous



#### **ML-guided search to jointly search for hardware + mapping** Two phase search space finds zero-loss chips but is enormous



#### **ML-guided search to jointly search for hardware + mapping** Two phase search space finds zero-loss chips but is enormous



Accelerate search w/ Bayesian optimization, pruning and continuous relaxation

$$\min_{Z} \sum_{i=1}^{N} q_{i}^{\mathsf{T}} Z_{i}$$
s.t.  $\operatorname{ACC}(Z) \geq \tau$ 
 $\operatorname{AREA}(Z) \leq \phi$ 
 $\sum_{j=1}^{K} Z_{ij} = 1 \quad \forall i \in \{1, \dots, N\}$ 
 $Z \in \{0, 1\}^{N \times K}$ 

Search space O(2<sup>268</sup>)

Accelerate search w/ Bayesian optimization, pruning and continuous relaxation

$$\begin{array}{ll} \min_{Z} & \sum_{i=1}^{N} q_{i}^{\mathsf{T}} Z_{i} \\ \text{s.t.} & \operatorname{ACC}(Z) \geq \tau \\ & \operatorname{AREA}(Z) \leq \phi \\ & \sum_{j=1}^{K} Z_{ij} = 1 \quad \forall i \in \{1, \dots, N \\ & Z \in \{0, 1\}^{N \times K} \end{array}$$

Search space O(2<sup>268</sup>)

 (a) Bayesian optimization to balance exploration + exploitation w/ learned surrogate cost function

Accelerate search w/ Bayesian optimization, pruning and continuous relaxation

$$\begin{split} \min_{z} & \sum_{i=1}^{N} q_{i}^{\mathsf{T}} Z_{i} \\ \text{s.t.} & \operatorname{ACC}(Z) \geq \tau \\ & \operatorname{AREA}(Z) \leq \phi \\ & \sum_{j=1}^{K} Z_{ij} = 1 \quad \forall i \in \{1, \dots, N\} \\ & Z \in \{0, 1\}^{N \times K} \end{split}$$
 (a) Bayesian optimization to balance exploration + exploitation w/ learned surrogate cost function (b) Prune catastrophic trials using greedy lower bound

Accelerate search w/ Bayesian optimization, pruning and continuous relaxation

$$\min_{Z} \sum_{i=1}^{N} q_{i}^{\mathsf{T}} Z_{i}$$
s.t.  $\operatorname{ACC}(Z) \geq \tau$   
 $\operatorname{AREA}(Z) \leq \phi$   
 $\sum_{j=1}^{K} Z_{ij} = 1 \quad \forall i \in \{1, \dots, N\}$   
 $Z \in \{0, 1\}^{N \times K}$   
Search space O(2<sup>268</sup>)  
(a) Bayesian optimization to  
balance exploration +  
exploitation w/ learned  
surrogate cost function  
(b) Prune catastrophic trials  
using greedy lower bound  
(c) Relax combinatorial  
search space into  
continuous space

# **Results:** Evaluating AutoApprox on large-scale workload + dataset

**Workload:** ResNet-50 on ImageNet-1k Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK

| Hardware design | <b>Total chip energy</b> (relative to exact) | <b>Total chip area</b> (exact + approx) | Top-1 accuracy | Top-5 accuracy |
|-----------------|----------------------------------------------|-----------------------------------------|----------------|----------------|
| Exact 8-bit MXU | $1.0 \times$                                 | 1.0 	imes                               | 72.1%          | 90.7%          |

# **Results:** Evaluating AutoApprox on large-scale workload + dataset

**Workload:** ResNet-50 on ImageNet-1k Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK

| Hardware design                               | <b>Total chip energy</b><br>(relative to exact) | <b>Total chip area</b> (exact + approx) | Top-1 accuracy  | Top-5 accuracy |
|-----------------------------------------------|-------------------------------------------------|-----------------------------------------|-----------------|----------------|
| Exact 8-bit MXU                               | 1.0×                                            | 1.0 	imes                               | 72.1%           | 90.7%          |
| Greedy layerwise search<br>Google Vizier [12] | 0.976×<br>0.969×                                | 1.281×<br>2.712×                        | 71.2%<br>65.82% | 90.3%<br>86.2% |

1%-6% lower accuracy than baseline

# **Results:** Evaluating AutoApprox on large-scale workload + dataset

**Workload:** ResNet-50 on ImageNet-1k Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK

| Hardware design                | <b>Total chip energy</b> (relative to exact) | <b>Total chip area</b> (exact + approx) | Top-1 accuracy | Top-5 accuracy |
|--------------------------------|----------------------------------------------|-----------------------------------------|----------------|----------------|
| Exact 8-bit MXU                | 1.0 	imes                                    | 1.0 	imes                               | 72.1%          | 90.7%          |
| Greedy layerwise search        | 0.976×                                       | 1.281×                                  | 71.2%          | 90.3%          |
| Google Vizier [12]             | 0.969×                                       | 2.712×                                  | 65.82%         | 86.2%          |
| AutoApprox-S (power optimized) | 0.939×                                       | $1.844 \times$ 0.948 $	imes$            | 66.5%          | 87.42%         |
| AutoApprox-L (balanced)        | 0.968×                                       |                                         | 72.5%          | 90.7%          |

3.2% - 6.1% energy savings!

# **Results:** Significant energy savings for TPU with zero accuracy loss

**Workload:** ResNet-50 on ImageNet-1k Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK

| Hardware design                                                                                 | <b>Total chip energy</b> (relative to exact) | <b>Total chip area</b> (exact + approx)  | Top-1 accuracy          | Top-5 accuracy           |
|-------------------------------------------------------------------------------------------------|----------------------------------------------|------------------------------------------|-------------------------|--------------------------|
| Exact 8-bit MXU                                                                                 | 1.0 	imes                                    | 1.0 	imes                                | 72.1%                   | 90.7%                    |
| Greedy layerwise search<br>Google Vizier [12]                                                   | 0.976×<br>0.969×                             | 1.281×<br>2.712×                         | 71.2%<br>65.82%         | 90.3%<br>86.2%           |
| AutoApprox-S (power optimized)<br>AutoApprox-L (balanced)<br>AutoApprox-XL (accuracy optimized) | $0.939 \times 0.968 \times 1.024 \times$     | $1.844 \times 0.948 \times 1.189 \times$ | 66.5%<br>72.5%<br>73.1% | 87.42%<br>90.7%<br>91.1% |



## **Results:** AutoApprox system pareto optimal to baselines

**Workload:** ResNet-50 on ImageNet-1k Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK



## **Results:** AutoApprox system pareto optimal to baselines

**Workload:** VGG-19 on CIFAR-10 Evaluating routed TPU design w/ approximate cores Energy, perf. and area evaluated at <10nm PDK



### Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

Paras Jain, Safeen Huda, Martin Maas, Joseph Gonzalez, Ion Stoica, Azalia Mirhoseini

Please reach out! parasj@berkeley.edu

**Problem:** How to achieve power savings with an approximate inference accelerator without any accuracy loss on a large-scale dataset?

Approach: Pack heterogenous approximate MXUs as sidekicks to a fallback exact MXU

#### **Contributions:**

- Approx. TPU architecture w/ exact fallback
- Whole chip PPA estimates
- Fast e2e accuracy simulation
- ML-guided search

#### Key results:

- Save up to 6% MXU power end-to-end on real TPU design (<10nm)
- Method significantly outperforms competitive baselines
- Opens new orthogonal avenue for chip efficiency beyond quantization + sparsity