# Leveraging Prior Knowledge for Effective Design-Space Exploration in High-Level Synthesis

Lorenzo Ferretti<sup>10</sup>, *Student Member, IEEE*, Jihye Kwon, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca P. Carloni, and Laura Pozzi

Abstract-High-Level Synthesis (HLS) tools allow the generation of a large variety of hardware implementations from the same specification by setting different optimization directives. Each combination of HLS directives returns an implementation of the target application that is based on a particular microarchitecture. Designers are interested only in the subset of implementations that correspond to Pareto-optimal points in the performance versus cost design space. Finding this subset is hard because the relationship between the HLS directives and the Pareto-optimal implementations cannot be foreseen. Hence, designers must default to an exploration of the design space through many time-consuming HLS runs. We present a methodology that infers knowledge from past design explorations to identify high-quality directives for new target applications. To this end, we formulate a novel abstract representation of applications and their associated configuration spaces, introduce a similarity metric to compare quantitatively the configuration spaces of different applications, and a method to infer actionable information from a source space to a target space. The experimental results with the MachSuite benchmarks show that our approach retrieves close approximations of the Pareto frontier of best-performing implementations for the target application, in exchange for a small number of HLS runs.

*Index Terms*—Design-space exploration (DSE), hardware acceleration, high-level synthesis (HLS), knowledge transfer.

#### I. INTRODUCTION

**H** IGH-LEVEL synthesis (HLS) enables the automatic generation of hardware designs from high-level specifications given, for example, as C/C++ or SystemC code [1]. With HLS, designers can first specify complex functionalities in a simpler way by working at a higher level of abstraction than register-transfer level (RTL), and then synthesize many different implementations from the same specification by applying optimization directives before running the HLS

Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi are with the Faculty of Informatics, Università della Svizzera italiana, 6900 Lugano, Switzerland (e-mail: lorenzo.ferretti@usi.ch).

Jihye Kwon, Giuseppe Di Guglielmo, and Luca P. Carloni are with the Department of Computer Science, Columbia University, New York, NY 10027 USA.

Digital Object Identifier 10.1109/TCAD.2020.3012750

tool. Examples of directives include the unrolling factor of a loop, the inlining of functions, and the mapping of arrays to memory structures. The combined application of directives has a major impact on the microarchitecture of the synthesized implementation. Hence, HLS directives enable a broad exploration of the design space in search of implementations that are Pareto optimal with respect to conflicting objectives, such as performance (latency and throughput) and cost (area and power) [2], [3]. For complex designs, however, the relationships between the combination of many directives and the quality of the synthesized implementations are difficult to foresee before the execution of time-consuming HLS runs. Moreover, the number of alternative implementations increases exponentially with the number of considered directives, thus making exhaustive explorations infeasible in practice even for simple cases.

Several research efforts (summarized in Section VI) have proposed strategies to discover, in the context of HLS-based designs, the most effective implementations from a cost and performance perspective, while minimizing the number of HLS runs. This problem is named HLS-driven design-space exploration (DSE). Our contribution tackles this problem from a novel perspective: for the first time, we explore the feasibility of effectively *harnessing the knowledge of past synthesis outcomes to guide the optimization of new designs*.

Fig. 1 illustrates our approach. When exploring a new *target* design with HLS, we first select a *source* design from a library of previously completed explorations. The source design is the one whose specification is most similar to the target one. Then, we consider the combinations of directives that led to the best results for the source design. These are Pareto-optimal implementations (marked with red crosses in Fig. 1), i.e., those for which no other implementations resulted in both less area and lower latency. The main idea is that the similarity among the two designs justifies the translation of the directives from the source design into directives for the target design, in order to lead the search for an approximation of the set of Pareto-optimal target implementations.

In more detail, our strategy employs a novel *similarity metric* to identify the most appropriate source design for a given target design. Then, it adopts a novel *mapping strategy* to link the directives between the related, but not identical, source and target designs. Ultimately, our methodology derives a set of Pareto-optimal implementations for a new HLS design from a prior knowledge at a minimal cost in terms of synthesis runs. Our contribution is therefore threefold.

0278-0070 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Manuscript received April 18, 2020; revised June 12, 2020; accepted July 6, 2020. Date of publication October 2, 2020; date of current version October 27, 2020. This work was supported in part by the National Science Foundation under Grant 1527821; in part by the MagicISEs under Grant 200021-156397 and the ML-Edge under Grant 200020-182009 projects evaluated by the Swiss NSF; and in part by the MyPreHealth under Grant 16073 project funded by the Hasler Stiftung. This article was presented in the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 2020 and appears as part of the ESWEEK-TCAD special issue. (*Corresponding author: Lorenzo Ferretti.*)



Fig. 1. High-level description of the methodology proposed in this work. A target design is compared to available sources in a knowledge base. The most similar source design is identified and leveraged to learn effective optimizations for the target, approximating the Pareto frontier of its design space while requiring few HLS runs.

- 1) We explore the feasibility of harnessing prior knowledge in the context of HLS-driven DSEs.
- 2) We introduce an abstract representation of HLS design space, a metric to assess the similarity between sources and targets, and a way to infer optimizations across different DSEs.
- 3) We demonstrate the effectiveness of our approach across an extensive set of 39 different designs from MachSuite [4]. We retrieve close approximations of the Pareto set of the best implementations, achieving a Pareto frontier distance (Average Distance from Reference Set) of 0.009 in the median case. On average, we only require the exploration of less than 8% of target design spaces, and only an average of 38 synthesis runs per design.

#### II. MOTIVATION

When optimizing a design with HLS, an expert designer starts by identifying which directives are applicable. For example, given the code in Snippet 1 of Fig. 2, the designer may be interested in exploring unrolling factors for loops, combined with different degrees of partitioning for the input/output arrays. Furthermore, the designer may recall to have already optimized in the past a design with a similar code structure, such as the one reported in Snippet 2 (also shown in Fig. 2). Indeed, even if they are not identical (e.g., the loop boundaries and the memory access patterns differ), the two code snippets have some structural similarities: they both iterate over two nested loops and process data provided in input to the function through pointers. These similarities may be sufficient to suggest adapting those directives that lead to optimal implementations for Snippet 2 to the case of Snippet 1, instead of starting the DSE by trying anew many combinations of directives for Snippet 1.

The designer's empirical strategy to tackle the DSE task hence consists of three main steps: 1) identify the main structural characteristics in the code of the target design; 2) pinpoint a similar already-explored design; and 3) transfer the knowledge from the source design to the new target design.

Our methodology performs these steps, but, differently from the above-described scenario, operates in a systematic and automated way. In Section IV, we show that our explorations, guided by prior knowledge, yield close approximations of the Pareto-optimal results from an exhaustive approach, while requiring very few synthesis runs. Our methodology answers the following three research questions.

### *R.Q.1:* From an HLS perspective, how can similarities among designs be quantified?

In general, code written in a high-level programming language such as C/C++ or SystemC is ill-suited for the automatic identification of structural similarities. Therefore, we propose an abstract representation *that only retains the characteristics of interest for HLS optimizations*, e.g., the structure of loops and that of memory access patterns. We extract such a representation [that we termed specification encoding (SE)] with a custom compiler pass. Since the representation is in the form of a string of symbols, we can use string-similarity algorithms to quantify the similarity in terms of computational patterns that exist between a source design (from a library capturing prior knowledge) and the target design.

## *R.Q.2:* How can the similarity between directive choices for different designs be assessed?

Besides the specification code, the other aspect affecting the HLS results is the choice of HLS directives. Indeed, a proper source of prior knowledge should have a choice of directives values similar to the one of the target. As an example, if a loop can be unrolled by only a small degree in a source, little information can be leveraged to optimize a loop in the target for very high unrolling factors. We introduce a domain-specific language to describe succinctly the set of directives associated with a design and a metric to measure the similarity between sets of directives associated with the source and target designs. Then, in the *source selection strategy* step, which is shown in the top part of Fig. 1, we combine design and directive similarities to identify the most promising source for the given target design.

*R.Q.3:* How to infer from prior knowledge HLS directives that give optimal results?

We have designed a strategy that transforms the HLS directives for the source design into HLS directives of the target design, as shown in the lower part of Fig. 1.

In the next section, we describe in detail the answers to these three research questions.

#### III. METHODOLOGY

#### A. Terminology

An *HLS design* (or *design*) is a functionality to be realized in hardware. A *specification* is a high-level description of the design in a programming language such as SystemC or C/C++, given in input to the HLS tool. An *implementation* of the design is the output of a run of the HLS tool. This output is typically expressed as an automatically generated Snippet 1: last\_step\_scan (target).

```
1
    void last_step_scan(int bucket[SIZE], int sum[RADIX]){
2
       int i, j, k;
3
       loop_1:for(i = 0; i < RADIX;i++){
4
         loop_2:for(j = 0; j < BLOCK; j++) {
5
            \mathbf{k} = (\mathbf{i} * \text{BLOCK}) + \mathbf{j};
6
            bucket[k] = bucket[k] + sum[i];
7
         }
8
       }
9
    }
```

Snippet 2: get\_delta\_matrix\_weights2 (source).

```
1
   void get_delta_matrix_weights2(double delta_weights2[
        N_NODES*N_NODES], double output_difference[
        N_NODES], double last_activations[N_NODES]) {
2
     int i. i:
3
     loop_1:for(i = 0; i < N_NODES; i++) 
4
       loop_2:for(j = 0; j < N_NODES; j++) {
5
         delta_weights2[i * N_NODES + j] = last_activations[i]
               * output_difference[j];
6
7
     }
8
   }
```

Fig. 2. Running example. Code snippets of two different functions from MachSuite [4]. Snippet 1 shows the C code of last\_step\_scan used in this article as example of target design. Snippet 2 shows the C code of get\_delta\_matrix\_weights2 used as example of source design. The code has been rewritten to increase readability, without modifying the functionality.

RTL code written in Verilog or VHDL. Each implementation is characterized by the values of a performance metric and a cost metric.

A synthesis configuration (or, simply, configuration) defines the transformations that a design undergoes through HLS. A designer controls these transformations with constraint and optimization directives, such as loop unrolling or pipelining, array manipulation, and other control and datapath optimizations. A directive is associated with a location in the code specification. A location could be either a label in the code or a language construct; for example, a loop or an array declaration. A designer can further customize some directives by specifying the values for the directive parameters; for example, the designer can customize the amount of parallelism in the implementation by unrolling a loop a certain number of times or by setting a certain initiation interval for the pipeline implementing the loop.

A *design space* is the set of all the possible design configurations and the associated costs and performance results from HLS. A *Pareto configuration* of a design is a configuration that leads to an implementation that is Pareto optimal in the biobjective optimization space defined by the performance and cost metrics. A (first-rank) Pareto frontier is the set of Paretooptimal points. Finally, an *i*-th *rank Pareto frontier* (for i > 1) is defined as the Pareto frontier obtained after removing the lower rank frontiers from the design space.

#### B. Problem Description

For a design *T*, let  $\mathcal{X}_T$  denote the set of all possible synthesis configurations. In general,  $\mathcal{X}_T$  is a very large set, possibly of infinite size. In practice, designers explore a portion of the design space of *T* by trying a subset  $X_T \subset \mathcal{X}_T$ , whose elements they choose carefully based on their experience running HLS. Given *T* and  $X_T$ , the *DSE task* returns a subset of  $X_T$  that consists of all Pareto configurations, i.e.,

$$P(T, X_T) = \{x | x \in X_T \text{ and } x \text{ is Pareto}\}.$$
 (1)

The subset  $P(T, X_T)$  is obtained by first 1) performing  $|X_T|$  HLS runs on *T*, one run for each  $x \in X_T$ , and then 2) by selecting only those configurations that turn out to be Pareto configurations.

Now, assume that before performing the DSE task for T (the *target* design), the designer has performed the DSE task



Fig. 3. (Top) Standard approach: the designer defines a set of configurations to be explored,  $X_T$ , given a target design T. Only after synthesizing all the  $X_T$  configurations, Pareto optimal ones  $P(T, X_T)$  are identified. (Bottom) Approach leveraging prior knowledge instead: the configurations to be synthesized  $X_T^S$  are *inferred* from the  $P(S, X_S)$  of a similar design S. By synthesizing T with  $X_T^S \ll X_T$  configurations, a close approximation  $\hat{P}(T, X_T)$  of the Pareto frontier is obtained.

for another design *S* (the *source* design), thereby obtaining  $P(S, X_S)$  for a given subset  $X_S$  of the configuration set  $\mathcal{X}_S$ . Furthermore, assume that a function  $g: X_S \to X_T$  exists that transforms a configuration for the source design into one for the target design, i.e.,

$$g(x_s) = x_t \tag{2}$$

with  $x_s \in X_S$  and  $x_t \in X_T$ . With the help of function *g*, the designer can leverage the *prior knowledge* on the source design in order to perform a DSE for the target design with a potentially much smaller number of HLS runs.

Let  $X_T^S$  be the set of all configurations for the target design T that are obtained by transforming the Pareto configurations (up to a certain Pareto frontier rank) of the source design, i.e.,

$$X_T^S = \{g(x_s) | x_s \in P(S, X_S)\}.$$
 (3)

By synthesizing the target design *T* with the configurations in  $X_T^S$ , we can obtain the set  $P(T, X_T^S)$  as an approximation  $\widehat{P}(T, X_T)$  of the set of Pareto configurations  $P(T, X_T)$ .

Fig. 3 showcases the difference between a standard approach and one leveraging prior knowledge.



Fig. 4. Methodology flow. The target design and its configurations space are encoded into a signature, which is compared to the ones of existing DSEs. The source having the most similar signature is selected to drive the inference process and generate the target configurations.

Note that this approximation requires  $|X_T^S|$  HLS runs, while the derivation of the actual set of Pareto configurations would require  $|X_T|$  HLS runs. Tuning the maximum Pareto frontier rank whose configurations are transformed from source to target, the synthesis effort and the approximation of  $P(T, X_T^S)$ by  $\hat{P}(T, X_T)$  can be traded-off. We explore the effect of varying this parameter in Section IV. Since for a given design T, the number of Pareto configurations  $|P(T, X_T)|$  is, in general, much smaller than the number of configurations  $|X_T|$ , if sets  $P(T, X_T)$  and  $P(S, X_S)$  are of comparable sizes then leveraging prior knowledge allows a major reduction in the number of time-consuming HLS runs while deriving the approximated set  $\hat{P}(T, X_T)$ .

Moreover, the degree to which  $P(T, X_T)$  is approximated by  $\widehat{P}(T, X_T)$  depends on the choice of a proper source design *S* to derive the prior knowledge for the given target *T*. To this end, we introduce a novel and concise representation to encode the specification and a configuration space of each design via an abstract characterization called *signature*. Then, we define a similarity metric between the signatures. If the signatures of two designs have a high similarity, then Pareto configurations for one design—when transformed to configurations for the other design. Moreover, signatures are also employed to automate the transformation of Pareto configurations between source and target spaces, thereby realizing function  $g(\cdot)$  of (2).

Fig. 4 illustrates the overall flow of our methodology. Given as input a target design's specification and configuration space, our strategy (A) derives the signature of the design, and (B) employs a similarity metric over such signature to search, in a database of already performed DSEs (the sources), for the most similar one. Once a source is selected, the Pareto configurations *for that source* are extracted, and (C) they are transformed by an inference process into valid configurations for the target.

Each of these three steps—signature encoding, similarity evaluation, and inference—is detailed in the remainder of this section.

#### C. Signature Encoding

This step aims to characterize a DSE with a compact representation that abstracts the specification (code) and the associated configurations (set of applied directives). The proposed Specification Encoding (SE) and Configuration

TABLE I SE OF DESIGN SOURCE CODE

| Symbols  | Code constructs                        | HLS directives            |
|----------|----------------------------------------|---------------------------|
| F        | Function definition                    | None                      |
| V        | Function parameter<br>passed by value  | None                      |
| Р        | Function parameter passed by reference | Partitioning and resource |
| А        | Arrays definition or declaration       | Partitioning and resource |
| S        | Struct definition or declaration       | Partitioning and resource |
| L        | Loops                                  | Unrolling                 |
| R        | Read operations                        | None                      |
| W        | Write operations                       | None                      |
| $C_{id}$ | Function call                          | Inlining                  |
| { }      | Scope                                  | None                      |

Space Descriptor (CSD) capture these two aspects. The combination of SE and CSD uniquely defines a *signature encoding*.

**Specification Encoding.** An SE describes those aspects of an HLS specification that can be targeted by HLS directives, such as the presence of loops and read/write operations, while disregarding anything that is not interesting from an HLSdriven DSE perspective. The encoding process generates a string representation of the specification that highlights the source code structure.

Table I shows the encoding scheme adopted and the correspondence between the string symbols and the code constructs. We derive the SE from the C/C++ specification through an LLVM [5] pass. We extended the compiler to parse the abstract syntax tree and produce the SE string. The last column of Table I also shows the HLS directives that can be associated with each code construct. We use the curly braces to identify the scope associated with symbols, thus allowing hierarchical representations (e.g., a function containing multiple nested loops).

*Running Example:* Given the function last\_step\_scan from Snippet 1 of Fig. 2 and the encoding in Table I, the proposed SE is F{PP}L{L{RRW}}. The encoding states that the function (F) receives two parameters by reference (PP), it has two nested loops and the innermost loop performs two reads and one write operations (L{L{RRW}}). Likewise, the SE for Snippet 2 from Fig. 2 is F{PPP}L{L{RRW}}, showcasing a similar, but not identical, structure.

**Configuration Space Descriptor.** We defined a domainspecific language to concisely describe a user-defined



Fig. 5. (Left) Signature similarity matrix obtained with  $\alpha = 0.2$ . (Center) SE similarity matrix. (Right) CSD similarity matrix. Darker color expresses a higher similarity. Each row of the matrix shows the similarity between a target design and the source ones. The indices on the axes corresponds to the function IDs in Table II.

configuration space. For source designs, CSDs describe which configurations are available in its design space, while for a target, a CSD indicates the set of configurations that a designer wishes to explore. Each line of the descriptor encodes a *knob* (a type of directives, the location, and selected parameter values) that the designer considers for the DSE task. For a directive with multiple parameters, a set of values for each parameter is specified.

*Running Example:* Given the function last\_step\_scan in Snippet 1 of Fig. 2, the associated CSD is shown in Snippet 3 of Fig. 6. The descriptor defines seven different knobs that can be associated with the function last\_step\_scan. Line 1 of Snippet 3 shows a knob with a single value: it associates a dual-ported block RAM (BRAM) to the array bucket that is the input of the function.

Line 3 instead defines a knob governing the array partitioning directive defined by all the pairs having one of two partitioning strategies (cyclic and block) as the first component, and the ten possible partitioning factors (all the powers of two from 1 up to 512) as the second one.

The CSD is parsed and the resulting set of configurations of the design space is generated as follows:

$$CS = K_1 \times K_2 \times \dots \times K_N \tag{4}$$

where *N* is the number of considered knobs, and  $K_i$  is the set of values related to each *i* knob, i.e., the set of values that the directive associated to the knob *i* can assume. For a directive with multiple parameters,  $K_i$  is the Cartesian product among each set of values. The size of the configuration space is then given by its cardinality (|CS|).

### D. Similarity Evaluation

To choose a candidate source design for a target design, we compute a similarity metric between their signature encodings. We compute such a similarity given the similarities of the design SEs and CSDs

$$\operatorname{Sim} = \alpha \operatorname{Sim}_{\operatorname{SE}} + (1 - \alpha) \operatorname{Sim}_{\operatorname{CSD}} \quad \alpha \in [0, 1].$$
 (5)

The parameter  $\alpha$  in (5) weights the contributions of the SE and CSD similarities. The best source candidate, selected

according to the similarity metric, is used to transfer knowledge from source to target design during the inference step.

Fig. 5-left shows the similarity matrix for the 39 functions in the MachSuite Benchmarks. Each row shows the similarity between a target design and all the candidate source designs; the diagonal elements show the similarity of the design to itself. The figure highlights a high similarity variance that discriminates well between similar and dissimilar sources. In Section IV, we show that our chosen similarity metric leads to an effective selection of the source for the given target.

**SE** Similarity: Since we express the SE as a string, we can use string-based algorithms to assess the similarity between SEs. In our approach, we adopt the Longest Common Subsequence (LCS) metric [6]. This metric returns a score  $Sim_{SE} \in [0, 1]$ , whose value is closer to 1 the more two strings are alike.

Fig. 5-center expresses the  $Sim_{SE}$  matrix for the same 39 functions, where each row shows the similarity between a target design and all the candidate sources.

*Running Example:* Given the SE of the target design F{PP}L{L{RRW}} and the source design F{PPPLL{LRRW}} in Fig. 2, the resulting SE similarity score is 0.93.

**CSD** Similarity. The similarity between two CSDs is assessed by comparing the knobs  $K_i$  for a target configuration space  $X_T$  (for design T) to the knobs  $K_j$  for a source configuration space  $X_S$  (for design S) using a mapping function  $M_{T,S}$ , which relates each knob of the target CSD to a specific knob of the source CSD

$$M_{T,S}(K_i) = K_j. (6)$$

The function  $M_{T,S}$  is determined through an alignment procedure. By iterating over the knobs of the target CSD from top to bottom, each knob is mapped to the first nonmapped knob of the same type belonging to the source CSD. Eventually, if no more knobs are available in the source, some target knobs may be left unmapped.

*Running Example:* Let us consider the target and source designs in Fig. 2 and their CSDs in Snippet 3 and Snippet 4 of Fig. 6, respectively. The CSD of function

Snippet 3: Configuration Space Descriptor of last\_step\_scan.

- 1 resource;last\_step\_scan;bucket;{RAM\_2P\_BRAM}
- 2 resource;last\_step\_scan;sum;{RAM\_2P\_BRAM}
- 3 array\_partition;last\_step\_scan;bucket;1;{cyclic,block };{1->512,pow\_2}
- 4 array\_partition;last\_step\_scan;sum;1;{cyclic,block};{1->128, pow\_2}
- 5 unroll;last\_step\_scan;last\_1;{1->128,pow\_2}
- 6 unroll;last\_step\_scan;last\_2;{1,2,4,8,16}
- 7  $clock; \{10\}$

- Snippet 4: Configuration Space Descriptor of
  get\_delta\_matrix\_weights2.
- 1 array\_partition;get\_delta\_matrix\_weights2;delta\_weights2;1;{
   cyclic,block};{1->256,pow\_2}
- 2 array\_partition;get\_delta\_matrix\_weights2;output\_difference ;1;{cyclic,block};{1->64,pow\_2}
- 3 array\_partition;get\_delta\_matrix\_weights2;last\_activations;1;{
   cyclic,block};{1->64,pow\_2}
- 4 unroll;get\_delta\_matrix\_weights2;loop\_1;{1->64,pow\_2}
- 5 unroll;get\_delta\_matrix\_weights2;loop\_2;{1->64,pow\_2}
- 6  $clock; \{10\}$



| Domain          |                            |                            |                                              | Maj                 | pping               | g           |                           |          |           |                          |                         |
|-----------------|----------------------------|----------------------------|----------------------------------------------|---------------------|---------------------|-------------|---------------------------|----------|-----------|--------------------------|-------------------------|
| Target<br>knobs | K <sub>1</sub><br>Resource | K <sub>2</sub><br>Resource | K <sub>3</sub><br>Part. type<br>Part. factor | K<br>Part.<br>Part. | 4<br>type<br>factor |             |                           | K<br>Unr | 5<br>roll | K <sub>6</sub><br>Unroll | K7<br>Clock             |
| Source<br>knobs |                            |                            | K <sub>1</sub><br>Part. type<br>Part. factor | K<br>Part.<br>Part. | 2<br>type<br>factor | Par<br>Part | K3<br>t. type<br>. factor | K<br>Unr | 4<br>roll | K5<br>Unroll             | K <sub>6</sub><br>Clock |
|                 | _                          | Dom                        | ain                                          |                     |                     | k           | (nob                      | valu     | es        |                          |                         |
|                 | _                          | Set of va<br>target k      | lues for<br>nob K <sub>6</sub>               | 1                   | 2                   | 4           | 8                         | 16       | 32        | 2 64                     | 128                     |
|                 |                            | Set of va<br>source k      | lues for<br>nob K5                           | 1                   | 2                   | 4           | 8                         | ↓<br>16  | 32        | 2 64                     |                         |

Fig. 7. Top: Mapping between the knobs of source and target CSD shown as an example (Snippets 1 and 2). Bottom: Correspondence between the knob value sets of knobs  $K_6$  and  $K_5$  in target and source, leading to a distance  $\Delta$  of 1.

last\_step\_scan has seven different knobs—each knob is one line of the descriptor—while the CSD of function get\_delta\_matrix\_weights2 has only six knobs. Fig. 7 shows the mapping  $M_{T,S}$  between the two CSDs. Five out of seven knobs of function last\_step\_scan can be mapped by using get\_delta\_matrix\_weights2 as a source design; while knobs  $K_1$  and  $K_2$  of last\_step\_scan are unmapped.

Once a mapping  $M_{T,S}$  is defined between a target configuration space  $X_T$ , defined with *I* different knobs, and a source configuration space  $X_S$ , defined with *J* different knobs, we compute their similarity as follows:

$$\operatorname{Sim}_{\operatorname{CSD}} = 1 - \left[\frac{1}{I} \sum_{i=1}^{I} \Delta(K_i, M_{T,S}(K_i)) / D_{\operatorname{MAX}}\right]$$
(7)

where  $D_{\text{MAX}}$  is a normalization factor—constant across all the source candidates for a given target—such that  $\text{Sim}_{\text{CSD}} \in [0, 1]$ . Then,  $\Delta(\cdot)$  is a function measuring the minimum distance between a source knob and a target knob

$$\Delta(K_i, K_j) = \sqrt{\sum_{n=1}^{|K_i|} {\binom{|K_j|}{\min} |\delta(k_n, k_m)|}^2} \quad k_n \in K_i, \quad k_m \in K_j.$$
(8)

The above equation sums up the distance between each target knob value  $k_n$  and the one that is closest to it among all source knob values  $k_m$ . The function  $\delta(k_n, k_m)$  computes the distance

between two knob values of the same directive type that has Z parameters, e.g.,  $k_n = (k_{n,1}, \ldots, k_{n,Z})$ 

$$\delta(k_n, k_m) = \sqrt{\sum_{z=1}^{Z} |k_{n,z}, k_{m,z}|^2}$$
(9)

where numerical parameter values are casted to their respective  $log_2$  value, and categorical parameter values are represented with one-hot encoding.

Since for unmapped knobs, there is no correspondence between source and target, the distance  $\Delta(\cdot)$  in (8) is computed between the values of the target knob and the default value of the directive.

Fig. 5-right shows the resulting Sim<sub>CSD</sub> matrix for the functions in the MachSuite Benchmark Suite considered in this work.

*Running Example:* Given the mapping between the functions in our running example (Fig. 2), for each target knob, we measure the distance with respect to the source one. Fig. 7 (bottom) shows the computation of the distance for the target knob  $K_6$  mapped to the source node  $K_5$ , each having a single value set of possible unrolling factors.  $K_6$  specifies eight factors (from 1 to 128, all of them powers-of-two), while  $K_5$  comprises seven values (from 1 to 64). Since the  $\delta$  is calculated among numerical values, the directive values are casted to their respective log2; therefore, the knobs discrepancy leads to a  $\Delta$  equal to  $(log_2 128 - log_2 64) = 1$ . When accounting for all target knobs, the CSD similarity between the source and target is 0.97.

#### E. Inference

After a source design is identified for a target design, the inference process transfers knowledge from the source to the target configuration space, hence implementing (2). In the first step of such a process, we extract the configurations belonging to the Pareto frontier in the source configuration space from a library of prior knowledge. These are peeled from the source design space, allowing the identification of second-rank Pareto configurations. Then, we proceed iteratively to extract higher ranked Pareto frontiers, until a certain number of these have been retrieved from the source design space.

Each selected configuration is transformed into a valid one in the target CSD. To this end, first, knobs in the source and target spaces are mapped according to the mapping function

| Domain          |                                                 | Inference                                              |                                       |                          |                                             |                             |
|-----------------|-------------------------------------------------|--------------------------------------------------------|---------------------------------------|--------------------------|---------------------------------------------|-----------------------------|
| Source<br>knobs |                                                 | K <sub>1</sub><br>cyclic<br>256                        | K <sub>2</sub><br>cyclic<br>8         | K4<br>32                 | K5<br>64                                    | K <sub>6</sub><br>10        |
| Target<br>knobs | K <sub>1</sub> K <sub>2</sub><br>2P_BRAM 2P_BRA | K <sub>3</sub><br><u>M</u> cyclic, block<br>1,,256,512 | K4<br><u>cyclic</u> , block<br>18,128 | K5<br>1, <u>32</u> ,,128 | <b>K</b> <sub>6</sub><br>1,2,4,8, <u>16</u> | K <sub>7</sub><br><u>10</u> |

Fig. 8. Inference from source to target design spaces from the running example. The inferred values of the HLS directive knobs are underlined in the bottom part of the figure.

 $M_{T,S}$  described in the previous section. If a target knob  $K_i$  is not be mapped to a source knob, the value of  $x_T^i$  is fixed to the directive default, since no prior knowledge related to that knob can be leveraged. The values of all other knobs are instead inferred from the source design space.

Then, given a source configuration  $x_S = [x_S^1, \ldots, x_S^J] \in X_S$ , with  $x_S^j$  being the value for knob *j*, the corresponding target configuration is set as  $x_T = [x_T^1, \ldots, x_T^J] \in X_T$ , where each component  $x_T^i$  is knob values associated to the knob *i*.

For each configuration component, we define the inference function  $g: X_S \rightarrow X_T$ , introduced in (2), as follows:

$$x_T^i = \arg\min_n \left\{ \delta\left(k_n, x_S^j\right) \right\}$$
(10)

where  $\delta(\cdot)$  is the distance function defined in (9),  $x_S^i$  is the values assigned to the *j*th knob of the source, and  $k_n \in K_i$  is the set of all possible sets of values that the knob  $K_i$  of the target design can assume, as specified by its CSD. Therefore, each target directive value  $x_T^i$  is assigned to the closest value to  $x_S^i$  among those specified in the target knob set for knob *i*.

*Running Example:* Let us assume that, among the many Pareto configurations of the source design in Fig. 2, one configuration has the directive values shown in Fig. 8. Given the mapping function from Fig. 7 and (10), we transform the source Pareto configuration into a valid target configuration. We map the partitioning factors—256 and 8 for  $K_1$  and  $K_2$  of the source—to the closest partitioning factor values of the target knobs—respectively, 256 and 8 for  $K_1$  and  $K_2$  of the target. Similarly, we infer the same partitioning type—cyclic—from the source Pareto configuration for the target ones. Finally, we map the source unrolling factors and the clock, 32, 64, and 10 for  $K_4$ ,  $K_5$ , and  $K_6$  to 32, 16, and 10 for  $K_5$ ,  $K_6$ , and  $K_7$  in the target, respectively.

#### **IV. EXPERIMENTS**

#### A. Experimental Setup

We implemented the similarity evaluation and inference algorithms in Python. We implemented the SE encoding in C++ as a custom compiler pass within the LLVM infrastructure, as described in Section III-C.

Our experiments targeted all of the functions in the MachSuite benchmarks collection [4], except those that expose very small design spaces, and those having a variable latency for different invocations during the benchmark execution due to input-dependent control flows. In total, 11 designs were discarded. The resulting suite comprises 39 functions, which have on average 40 lines of code and 308 in the biggest case.

For each design, we performed an extensive DSE across their configuration spaces up to tens of thousands of design points. We used Vivado HLS [7] to run synthesis with a target clock period of 10 ns and targeting a ZynqMP Ultrascale+(xczu9eg) FPGA chip. We collected the design configurations and synthesis results in a MySQL database.

In order to control the configuration space size,<sup>1</sup> we only employed power-of-two values for directives having a numerical range (e.g., loop-unrolling and array-partitioning factors), and, in some cases, we forced related knobs to have the same value (e.g., the partitioning factor of an array and the unrolling of a loop accessing it once every iteration). Such a decision corresponds to the intuitive choice of constraints that a designer would impose when tasked with the exploration of the design space. More than four years of single-core machine time are required to generate the knowledge base. To speed up this process, we collected synthesis data from up to 60 instances of Vivado HLS running in parallel. This allowed us to reduce to produce the database in approximately 25 days.

We use these extensive DSEs in two ways. On the one hand, we use these results as ground truth to assess the performance of our approach. On the other hand, we use them as a source of prior knowledge. In the latter case, we adopted a leave-oneout cross-validation, considering each design as a target using all others as candidate knowledge sources.

Similar to [8]–[11], we used as quality metric the Average Distance from Reference Set (ADRS), which expresses the distance between a reference curve P (the Pareto frontier from ground-truth data), and an approximated curve  $\bar{P}$ . The ADRS for two objective functions is defined as

$$ADRS(\bar{P}, P) = \left\lfloor \frac{1}{|P|} \sum_{p \in P} \min_{\bar{p} \in \bar{P}} (d(\bar{p}, p)) \right\rfloor$$
(11)

$$d(\bar{p}, p) = \max\{0, (A_{\bar{p}} - A_p)/A_p, (L_{\bar{p}} - L_p)/L_p\}$$
(12)

Low ADRS values are better, because they imply proximity between P and  $\overline{P}$ . In our scenario, the first objective function is the FPGA resource requirement (area, A) of an implementation, expressed as the average utilization of Flip-Flops, Look-Up Tables, DSP, and BRAMs. The second objective function is runtime performance, i.e., latency (L) in nanoseconds.

#### B. Results

**Outcome of Explorations.** Table II summarizes the results of the explorations performed with our methodology. It reports the target function IDs (used as indexes in Fig. 5), their benchmarks and the function names. Moreover, for each case, it provides the function IDs and the function names of the source having the highest similarity score, the obtained ADRS values in the target space, the number of synthesized configurations derived from that source, and the size of the related configuration space (|CS|).

For the vast majority of the targets, our approach requires very few syntheses to reach low ADRS scores. As an example, when targeting aes\_addRoundKey (row index: 20) while leveraging the knowledge of the

<sup>&</sup>lt;sup>1</sup>Even for the simple case in Snippet 2, considering all loop unrolling factors, two types of resources, two types of partitioning, and all partitioning factors would result in more than  $10^8$  configurations.



Fig. 9. ADRS evolution while changing the value of  $\alpha$ .

add\_bias\_to\_activations source, only 13 out of 500 possible synthesis rounds are performed, still resulting in a perfect identification of the Pareto frontier of best performing implementations (ADRS = 0). Only 35 out of thousands of configurations are synthesized for the gemm target (row index: 5) while reaching a very close Pareto frontier approximation (ADRS = 0.012).

We obtained the results of Table II by inferring up to the 10th-ranked Pareto frontier (as defined in Section III-A) and by fixing the tradeoff between SE and CSD similarity (introduced as  $\alpha$  in Section III-D) to 0.2. We further explore both settings in the rest of this section.

**Tuning of the Similarity Function.** Fig. 9 shows the ADRS scores achieved when selecting the most similar candidates according to the similarity metric in (5) while varying the parameter  $\alpha$ , i.e., the relative weight of SE and configuration space similarity. Data is shown on a logarithmic scale and in an aggregated form across all targets. Boxes encompass the first and third quartile of the ADRS values obtained by the DSEs of all targets, while the lines inside them indicate the median case. The skewers above and below each box are the upper and lower 1.5 interquartile. As before, we inferred configurations up until the 10th-ranked Pareto frontier in the source design space.

Fig. 9 that both SE and CSD have an impact on the quality of results and that CSD similarity generally has a more significant impact than the SE one. An  $\alpha$  value of 0.2 both minimizes the interquartile range and the median ADRS.

Effectiveness of the Similarity Metric. Fig. 10 highlights the importance of a proper source of prior knowledge in order to achieve effective explorations. It depicts four DSEs of the target design last\_step\_scan, from the running example, leveraging different sources of prior knowledge. Each plot depicts the ground truth of the target design space resulting from its exhaustive exploration—gray dots—as well as the Pareto frontier retrieved with the inference process—dark blue line. The top-left DSE shows the result of inferring configuration from get\_delta\_matrix\_weights2 (ID 27), the best-ranked source according to our similarity metric. In this case, the Pareto frontier is very well approximated and obtains a small ADRS of 0.004.

The top-right picture shows an example of DSE characterized by a low SE similarity, resulting in a partial approximation of the Pareto frontier, since only a few knobs can be mapped from source to target. In this case, the



Fig. 10. Example of DSEs targeting last\_step\_scan, inferring from different sources. (Top-left) Good Pareto approximation obtained with the best candidate source. (Top-right, bottom-left, and bottom-right) Low-quality Pareto approximations, from sources having low CSD and/or AS similarity. Gray dots represent the ground truth for the target design last\_step\_scan while the dark blue line represents the Pareto frontier obtained performing the inference with different sources.

inference process uses as source design the function bulk (ID 2), which is ranked 30th in order of similarity score. Similarly, the bottom-left picture shows the result of the DSE when product with bias output layer (ID 34) is employed as a source design. In this case, the similarity score is penalized by a low CSD similarity. Therefore, only a portion of the target design space can be explored, due to inadequate coverage of the knob values of  $X_T$  by the ones in  $X_S$ . The Pareto frontier is hence well approximated only for the  $X_T$ region for which prior knowledge is available, resulting in an ADRS of 1.5. Finally, in the plot at the bottom-right of Fig. 10, we show the result of the inference from backprop (ID 35). In this case, we observed both a low SE similarity (few knobs can be mapped from source to target) and a low CSD similarity (for mapped knobs, knob values are distant between source and target spaces) resulting in an extremely poor approximation of the Pareto frontier, since little prior knowledge can be harnessed. In this case, as reported in the figure, the retrieved Pareto frontier only comprises a single design point.

Fig. 11 generalizes these findings by reporting the aggregated ADRS values when selecting the sources with the highest similarity score for each target, the second-best choices, etc. As in Fig. 9, we plot the data on a logarithmic scale. Boxes encompass the first and third quartile of the ADRS scores across all targets, while the line inside the boxes indicates the median case. An order of magnitude separates the

#### ID Benchmark **Target function** Source ID Source function ADRS # Synth. CS spmv ellpack ellpack 28 get\_delta\_matrix\_weights2 0.034 65 1600 bfs bulk 28 39 2352 2 bulk get\_delta\_matrix\_weights2 0.01030 25 3 md knn md\_kernel 0.006 1600 get\_oracle\_activations1 28 4 viterbi viterbi get\_delta\_matrix\_weights2 $8.7e^{-4}$ 12 1152 5 gemm ncubed gemm 28 get\_delta\_matrix\_weights2 0.012 35 2744 6 gemm blocked 28 get\_delta\_matrix\_weights2 1.689 35 1600 bbgemm 7 fft strided fft 30 get\_oracle\_activations1 0.0007 15 1600 8 twiddles8 32 fft transpose product\_with\_bias\_input\_layer 0.0002 24 64 9 27 0.322 31 1024 ms\_mergesort get\_delta\_matrix\_weights1 sort merge 10 30 0.262 19 4096 merge get\_oracle\_activations1 28 11 stencil stencil2d stencil get\_delta\_matrix\_weights2 0.015 46 1344 28 get\_delta\_matrix\_weights2 12 stencil stencil3d stencil3d 1.88 16 1536 13 update 30 get\_oracle\_activations1 0.009 28 2400 14 28 0.007 46 4704 hist get\_delta\_matrix\_weights2 15 18 init local\_scan 0.07868 484 16 radix sort sum\_scan 36 add\_bias\_to\_activations 0.136 25 1280 90 17 last step scan 28 get delta matrix weights2 0.004 800 18 local\_scan 17 last\_step\_scan 0.005 71 704 19 32 product\_with\_bias\_input\_layer 0.0005 21 1792 ss\_sort 20 aes\_addRoundKey 36 add\_bias\_to\_activations 0 13 500 21 29 0 aes\_subBytes get\_delta\_matrix\_weights3 8 -50 aes\_addRoundKey\_cpy 22 28 get\_delta\_matrix\_weights2 0 71 625 23 0.013 16 8 20 aes aes\_shiftRows sum\_scan 24 aes mixColumns 25 aes\_expandEncKey 0 15 18 25 13 0.003 33 aes\_expandEncKey update 216 26 aes256\_encrypt\_ecb 4 viterbi 0.030 22 1944 27 get\_delta\_matrix\_weights1 28 get\_delta\_matrix\_weights2 0.002 139 21952 28 get\_delta\_matrix\_weights2 27 0.010 77 get\_delta\_matrix\_weights1 31213 29 28 0.030 222 21952 get\_delta\_matrix\_weights3 get\_delta\_matrix\_weights2 30 31 2.907 67 2401 get\_oracle\_activations1 get\_oracle\_activations2 31 get\_oracle\_activations2 get\_delta\_matrix\_weights3 29 0.051 19 1372 32 34 product\_with\_bias\_input\_layer product\_with\_bias\_output\_layer 3.560 3 1372 33 30 backprop 32 0 686 product\_with\_bias\_second\_layer product\_with\_bias\_input\_layer $2.5e^{-5}$ 34 32 24 392 product\_with\_bias\_output\_layer product\_with\_bias\_input\_layer

1

20

29

29

4

ellpack

viterbi

aes\_addRoundKey

get\_delta\_matrix\_weights3

get\_delta\_matrix\_weights3



#### LIST OF FUNCTIONS EXPLORED FROM MACHSUITE [4] (GROUPED BY BENCHMARK). THE TABLE REPORTS: TARGET FUNCTION *ID*, BENCHMARK NAME, TARGET FUNCTION NAME, SOURCE FUNCTION *IDs*, SOURCE FUNCTION NAMES, ADRS VALUE, NUMBER OF SYNTHESIZED CONFIGURATIONS, AND SIZE OF THE CONFIGURATION SPACE (|*CS*|)



backprop

soft\_max

take\_difference

update\_weights

add\_bias\_to\_activations

Fig. 11. ADRS according to source selection by metric ranking.



 $4.4e^{-5}$ 

0.002

0.053

0.0002

 $1.1e^{-4}$ 

4

5

9

8

3

2048

1372

64

512

1024

Fig. 12. ADRS according to source selection criterion.

best and third-best choice, after which performance remains almost constant and tangibly worse than in the case of the best-ranked source.

Finally, Fig. 12 compares the aggregated ADRS values obtained by our strategy (named Prior Knowl. in the figure), i.e., leveraging the source with the highest similarity, with three alternatives: 1) a perfect oracle always choosing *a-posteriori* the best source (Oracle); 2) the average of selecting all sources for each target (Average); and 3) a random

sampling of the target design space (Random)—disregarding knowledge transfer altogether. For the latter case, we performed several samplings equal to the ones required by our methodology, and we averaged the results over 100 different runs to minimize noise. A choice that is driven by the proposed similarity metric (Prior Knowl.) performs three orders of magnitude better than a choice that is not considering past explorations at all (Random) and  $12 \times$  better when compared with a blind choice for the source design (Average).

35

36

37

38

39



Fig. 13. Cumulative evolution of the ADRS while inferring from multiple Pareto frontiers. The average number of synthesis performed while the number of Pareto frontiers increases is shown with the dashed line.

TABLE III QUALITATIVE COMPARISON WITH SOA METHODOLOGIES. Average # of Synthesis Required to Obtain an ADRS  $\leq 0.04$ 

| CS      | Prior Knowl. | Lattice | Cluster | RF-TED | Zhong |
|---------|--------------|---------|---------|--------|-------|
|         |              | [10]    | [9]     | [8]    | [11]  |
| < 200   | 7            | 36      | 37      | 155    | NA    |
| < 700   | 10           | 64      | 64      | 391    | 19    |
| < 1800  | 22           | 230     | 290     | 1588   | 31    |
| < 6000  | 19           | 460     | 460     | 1903   | 32    |
| < 16000 | NA           | NA      | NA      | NA     | 35    |
| < 32000 | 38           | NA      | NA      | NA     | NA    |

**Tuning the Number of Source Pareto Frontiers.** In a further round of experiments (Fig. 13), we investigated the effect of varying the number of selected source Pareto frontiers. Each boxplot shows the aggregated ADRS outcomes when inferring an increasing number of Pareto frontiers from the highest similarity source to the target. While increasing the number of frontiers always lowers ADRS scores, diminishing returns can be observed for a number of frontiers  $\geq$  7. The number of required syntheses, instead, linearly increases with the amount of inferred Pareto frontiers.

**Comparison With State of the Art.** Table III compares our methodology (named Prior Knowl. in the table) with four related works that also aim to automate the optimization of HLS designs. According to the taxonomy of Section VI, three of them are refinement-based approaches [8]–[10], while one is a model-based approach [11]. For fairness, we group the results in different brackets according to the configuration size of the employed benchmarks. When no benchmark is reported for a given size on past works, we marked the corresponding table cell with *NA*. In other cases, data show the average number of synthesis runs required to reach an ADRS of 0.04, in which Zhong *et al.* considered as an excellent Pareto frontier approximation [11] [and which is attained by our approach in 29 out of 39 cases (see Table II)].

The numbers in this table show that our approach greatly outperforms the refinement-based approaches (see the required number of synthesis in the first four columns), and this advantage grows with the size of the configuration space. Our strategy is *even* competitive with the model-based strategy of Zhong *et al.* [11] (see the last column), while being agnostic to the number and type of optimizations that can be considered.<sup>2</sup>

Leveraging Prior Knowledge Across Different Clock Constraints and Platforms. All previous results assumed that

TABLE IV ADRS OBTAINED BY LEVERAGING THE KNOWLEDGE OF THE GET\_DELTA\_MATRIX\_WEIGHTS2 SOURCE WHILE EXPLORING THE LAST\_STEP\_SCAN TARGET, VARYING THE CLOCK CONSTRAINTS AND FPGA MODELS EMPLOYED FOR THE TARGET BENCHMARK

| Clock period (ns) | Technology  | ADRS   |  |  |
|-------------------|-------------|--------|--|--|
| 10                | ZynqMPU+    | 0.0044 |  |  |
| 5                 |             | 0.0044 |  |  |
| 25                | 25 ZynqMPU+ |        |  |  |
| 50                |             | 0.0044 |  |  |
|                   | Artix       | 0.0049 |  |  |
| 10                | Virtex      | 0.0045 |  |  |
|                   | Kintex      | 0.0045 |  |  |

the same clock constraint (10 ns) and FPGA model (Xilinx Zynq xczu0eg) are employed for all synthesis runs. In practice, both these conditions may not be satisfied, as often past explorations may be performed for an FPGA than is different from the one of interest for a new design. Similarly, the clock constraints may, in general, not be the same for sources and targets. Nonetheless, our methodology is robust toward variations of FPGA models and operating frequencies because it relies on the Pareto-dominance relationship in the cost/performance space of implementations in each DSE composing the knowledge base, as opposed to relying on the actual values of area and latency. This relationship, and consequently the set of Pareto configurations of a design, is not tangibly affected by the employed clock period and FPGA.

To investigate this characteristic, we have observed the ADRS obtained by identifying the implementations of the firstrank Pareto frontier of the last\_step\_scan benchmark (13 out of 1600 implementations) synthesized with various clock periods, and inferring the related configurations for a different clock constraint. We have evaluated the result of the inference for target and sources with clock constraints of 5, 10, 25, and 50 ns. For all the experiments no changes in the inferred Pareto frontier have been observed, and hence good approximations of the Pareto frontier have been obtained. These results confirm that indeed the set of Pareto configurations is not tightly dependent on the operating frequency.

Similar remarks are obtained when varying the FPGA employed for the synthesis of source and target. We have observed the ADRS obtained by identifying the implementations of the first-rank Pareto frontier of the last\_step\_scan benchmark synthesized with various FPGAs (ZynqMPUltrascale+ xczu9eg, Virtex xc7vh580, Kintex xc7k352, and Artix xc7a100), and inferring the related configurations for a different platform. For all the combinations of target and source platforms, the Pareto configurations are the same. Even in this case, the inferred Pareto frontier perfectly approximates the one obtained by an exhaustive exploration.

Concluding this round of experiments, Table IV shows an example of applying our methodology, again considering last\_step\_scan as a target, while employing the knowledge base in Table II. The design get\_delta\_matrix\_weights2 (ID 28) is identified as the most similar source and it is used to infer configurations up to its 10th-rank Pareto frontier—the same setting adopted

<sup>&</sup>lt;sup>2</sup>Zhong *et al.* only considered the loop unrolling and the dataflow directives.

for the results in Table II. In the first row of the table, provided for reference, both the clock constraint and the FPGA of the target design are equal to the ones used for the source. In rows 2–4, three different target clock constraints are used, while in rows 5–7, the target FPGA is different from the source one. In all cases, very similar, and small, ADRS are achieved, showcasing the robustness of our methodology.

#### V. EXTENSIONS AND FUTURE WORKS

Across a large variety of designs, leveraging prior knowledge can lead to the identification of high-quality implementations while requiring a low budget of synthesis.

However, the quality of the implementations discovered during a DSE relies on the presence of a similar DSE in the knowledge base. While we have shown that this scenario is not unrealistic-75% (29 out of 39) of the benchmarks have obtained an ADRS lower than 0.04-it is still possible that none of the available sources has a high degree of similarity for a given target. This drawback could be alleviated by partitioning target applications, and inferring prior knowledge for each of them separately, possibly from different sources. Moreover, a better coverage of target design spaces could be achieved by interpolating and extrapolating from the data in the source ones. In addition, more complex mapping algorithms could be conceived for matching source and target knobs (see Section III-D), e.g., taking into account loop nesting levels, the presence of loop carried dependency, control flow, etc. Implementing such strategies would be a natural extension of this work.

Our framework is, for the most part, independent from the metrics used to measure cost and performance (for the experiments in Section IV we considered area and latency, respectively) as these are only employed to identify Pareto configurations in the source space. Further tradeoffs can therefore be explored, for example, including energy efficiency as an alternative or additional dimension.

Finally, experimental evidence confirms the intuition that high similarity between sources and targets results in better approximation of the Pareto frontier of a given design space. We hope such findings will spur follow-up efforts investigating the link between presynthesis evaluations (e.g., similarity) and post-synthesis validation (e.g., ADRS).

#### VI. RELATED WORK

Recent works proposing strategies to navigate HLS design spaces can be organized into two main categories. On the one hand, model-based approaches cite [11]–[14] rely on an estimation of performance and resource requirement of a given optimization. While mandating very few synthesis runs, such strategies struggle when coping with multiple, interdependent optimizations. Hence, they are often limited to capturing the effect of only a few directives.

On the other hand, refinement-based frameworks rely on the outcome of some synthesis runs as a starting point, and aim to improve on this initial solution using different strategies, such as random forest [8], genetic algorithms [15], simulated annealing [16], clustering [9], or local search techniques [10].

Refinement-based methods are agnostic to the set of considered directives, but usually exhibit a slower convergence rate, because they must incrementally build a knowledge of the design space being explored.

Our proposed methodology neither relies on an *a-priori* model nor tries to infer it from an ongoing exploration. By focusing on *previous* explorations, we can instead tap on a knowledge base which is both rich in terms of available data points and robust toward different directives and their combination. Indeed, as summarized in the surveys by Pan and Yang [17] and Weiss *et al.* [18], useful information can be extracted from previous sets of experiments for which the outcomes are known, in order to efficiently explore new ones. Nonetheless, most of the strategies described in these surveys deal with domains that are distant from HLS, such as classification and object recognition applications.

In a recent editor's note, Doppa et al. [19] highlighted the importance of leveraging prior knowledge to effectively reduce the complexity of DSE problems. A small number of works take this stance in the context of hardware design. However, they do so from a different and somehow limited perspective: Dai et al. [20] aimed at improving the accuracy of HLS estimations using post-synthesis data, while the goal of Liu et al. is to estimate the performance on FPGA from an ASIC synthesis report [21]. Deshwa et al. [22] leveraged prior knowledge in the context of network-on-chip DSEs, to identify the promising starting point for their exploration methodology. More recently, Wang et al. [23] proposed a method to accelerate the process of HLS-driven DSE by precharacterizing micro-kernels offline and creating predictive models of these. Finally, Martins et al. [24] also presented a strategy to harness prior knowledge based on a similarity metric, but their framework is geared toward the selection of *compiler* optimizations, as opposed to targeting the hardware domain of HLS.

#### VII. CONCLUSION

In this work, we have proposed a methodology for leveraging prior knowledge in HLS-driven DSE. We consider the Pareto-dominance relationship among directive configurations in a source design, and translate Pareto configurations from a source into corresponding ones in a target. By transferring knowledge from similar DSEs, we are able to retrieve highquality implementations with a very sparse sampling of target DSEs, requiring few synthesis runs.

The proposed strategy assesses similarities between sources and targets—therefore identifying the most promising source to learn from—and performs inference of configurations based on a novel abstract representation. Such representation provides a succinct view of the design code and of the configuration space of a DSE.

Our methodology greatly outperforms state-of-the-art refinement-based strategies, requiring much fewer synthesis to derive the same DSE quality. Results are in line with modelbased methods, but, as opposed to them, we are not restricted in the type of supported directives.

#### REFERENCES

- G. Martin and G. Smith, "High-level synthesis: Past, present, and future," *IEEE Design Test Comput.*, vol. 26, no. 4, pp. 18–25, Jul./Aug. 2009.
- [2] H.-Y. Liu, M. Petracca, and L. P. Carloni, "Compositional system-level design exploration with planning of high-level synthesis," in *Proc. Conf. Design Autom. Test Eur. Conf. Exhibit.*, Dresden, Germany, Mar. 2012, pp. 641–646.
- [3] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, "COSMOS: Coordination of high-level synthesis and memory optimization for hardware accelerators," *ACM Trans. Embedded Comput. Syst.*, vol. 16, no. 5s, pp. 1–22, Sep. 2017.
- [4] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, "MachSuite: Benchmarks for accelerator design and customized architectures," in *Proc. IEEE Int. Symp. Workload Characterization*, Raleigh, NC, USA, Oct. 2014, pp. 110–119.
- [5] C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in *Proc. Int. Symp. Code Gener. Optim.*, San Jose, CA, USA, Mar. 2004, p. 75.
- [6] M. Paterson and V. Dančík, "Longest common subsequences," in Proc. Int. Symp. Math. Found. Comput. Sci., Jun. 1994, pp. 127–142.
- [7] Vivado High-Level Synthesis. Accessed: Aug. 10, 2020. [Online]. Available: https://www.xilinx.com/products/design-tools/vivado/ integration/esl-design.html
- [8] H.-Y. Liu and L. P. Carloni, "On learning-based methods for design-space exploration with high-level synthesis," in *Proc. 50th ACM/EDAC/IEEE Design Autom. Conf.*, Austin, TX, USA, Jun. 2013, pp. 1–6.
- [9] L. Ferretti, G. Ansaloni, and L. Pozzi, "Cluster-based heuristic for high level synthesis design space exploration," *IEEE Trans. Emerg. Topics Comput.*, early access, Jan. 15, 2018, doi: 10.1109/TETC.2018.2794068.
- [10] L. Ferretti, G. Ansaloni, and L. Pozzi, "Lattice-traversing design space exploration for high level synthesis," in *Proc. IEEE 36th Int. Conf. Comput. Design*, Orlando, FL, USA, Oct. 2018, pp. 210–217.
- [11] G. Zhong, V. Venkataramani, Y. Liang, T. Mitra, and S. Niar, "Design space exploration of multiple loops on FPGAs using high level synthesis," in *Proc. IEEE 32nd Int. Conf. Comput. Design*, Seoul, South Korea Dec. 2014, pp. 456–463.
- [12] N. K. Pham, A. K. Singh, A. Kumar, and M. M. A. Khin, "Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis," in *Proc. Design Autom. Test Eur. Conf. Exhibit.* (*DATE*), Grenoble, France, 2015, pp. 157–162.

- [13] G. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar, "Lin-analyzer: A high-level performance analysis tool for FPGA-based accelerators," in *Proc. 53rd ACM.EDAC/IEEE Design Autom. Conf.*, Austin, TX, USA, Jun. 2016, pp. 1–6.
- [14] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, "COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications," in *Proc. Int. Conf. Comput.-Aided Design*, Irvine, CA, USA, Oct. 2017, pp. 430–437.
- [15] B. C. Schafer and K. Wakabayashi, "Machine learning predictive modelling high-level synthesis design space exploration," *IET Comput. Digit. Techn.*, vol. 6, no. 3, pp. 153–159, May 2012.
- [16] A. Mahapatra and B. C. Schafer, "Machine-learning based simulated annealer method for high level synthesis design space exploration," in *Proc. Electron. Syst. Level Synth. Conf.*, San Francisco, CA, USA, 2014, pp. 1–6.
- [17] S. J. Pan and Q. Yang, "A survey on transfer learning," *IEEE Trans. Knowl. Data Eng.*, vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
- [18] K. Weiss, T. M. Khoshgoftaar, and D. Wang, "A survey of transfer learning," *SpringerOpen J. Big Data*, vol. 3, no. 1, p. 9, May 2016.
- [19] J. R. Doppa, J. Rosca, and P. Bogdan, "Autonomous design space exploration of computing systems for sustainability: Opportunities and challenges," *IEEE Design Test*, vol. 36, no. 5, pp. 35–43, Oct. 2019.
- [20] S. Dai, Y. Zhou, H. Zhang, E. Ustun, E. F. Young, and Z. Zhang, "Fast and accurate estimation of quality of results in high-level synthesis with machine learning," in *Proc. 26th IEEE Symp. Field Program. Custom Comput. Mach.*, Boulder, CO, USA, Apr. 2018, pp. 129–132.
- [21] S. Liu, F. Lau, and B. C. Schafer, "Accelerating FPGA prototyping through predictive model-based HLS design space exploration," in *Proc. 56th ACM/IEEE Design Autom. Conf.*, Las Vegas, NV, USA, Jun. 2019, p. 97.
- [22] A. Deshwal, N. K. Jayakodi, B. K. Joardar, J. R. Doppa, and P. P. Pande, "Moos: A multi-objective design space exploration and optimization framework for NoC enabled manycore systems," *ACM Trans. Embed. Comput. Syst.*, vol. 18, no. 5s, p. 77, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3358206
- [23] Z. Wang, J. Chen, and B. C. Schafer, "Efficient and robust highlevel synthesis design space exploration through offline micro-kernels pre-characterization," in *Proc. Design Autom. Test Eur. Conf. Exhibit.* (DATE), Grenoble, France, 2020, pp. 145–150.
- [24] L. G. Martins, R. Nobre, J. M. Cardoso, A. C. Delbem, and E. Marques, "Clustering-based selection for the exploration of compiler optimization sequences," ACM Trans. Archit. Code Optim., vol. 13, no. 1, p. 8, Apr. 2016.