A Domain-Specific Language and Compiler for Computation-in-Memory Skeletons

Jintao Yu
Delft University of Technology
Mekelweg 4, 2628 CD
Delft, The Netherlands
j.yu-1@tudelft.nl

Tom Hogervorst
Delft University of Technology
Mekelweg 4, 2628 CD
Delft, The Netherlands

Razvan Nane
Delft University of Technology
Mekelweg 4, 2628 CD
Delft, The Netherlands
r.nane@tudelft.nl

ABSTRACT

Computation-in-Memory (CiM) is a new computer architecture template based on the in-memory computing paradigm. CiM can solve the memory-wall problem of classical Von Neumann-based computing systems by exploiting application-specific computational and data-flow patterns with the capability of performing both storage and computations of emerging resistive RAM technologies (e.g., memristors). However, to efficiently explore and design such radically new application-specific CiM architectures, we require fundamentally new algorithm specification and compilation techniques. In this paper, we introduce a domain-specific language to express not only the computational patterns of an algorithm but also its spatial characteristics. Furthermore, we design a compiler that is able to transform these patterns into highly-optimized CiM designs. Experiments demonstrate the functional correctness of the language and the compiler as well as an order of magnitude speedup improvement over a multicore system in both performance and energy costs.

Keywords

Domain specific language; computation-in-memory; algorithmic skeleton; memristor

1. INTRODUCTION

In classical Von Neumann-based computing systems, memory access and data transfer operations are becoming a big bottleneck for Big Data applications [5]. This is not only because of the limited transfer speed at which data is retrieved from and written back to memory, but also because performing a large amount of memory accesses incurs high energy costs. A solution is to design application-specific memristor-based Computation-in-Memory (CiM) architectures [9], which feature a large memristor crossbar to execute massively parallel applications. Because we can perform both computations and storage using memristors [19], no off-crossbar data transfers are required during execution, enabling CiM-based architectures to solve the memory wall bottleneck.

Application-specific CiM-based solutions can therefore result in significant performance gains over multicore systems [10]. To implement highly-optimized CiM architectures, a designer would need to explicitly spatially program the application-specific computational and data-flow patterns onto the crossbar. This is a new form of Spatial Computation paradigms [3], which map programs into completely distributed hardware. Although general purpose languages can be used for spatial computation [3], Domain-Specific Languages (DSLs) facilitate domain experts to generate optimal solutions [16]. Some DSLs have been developed for specific platforms, such as MaxJ for Data-Flow Engine [15] and ANML for Automata Processor Engines [8]. However, specifying and compiling customized application layouts for a memristor crossbar poses a new challenge. Due to the passive nature of the memristor, the mapping and routing results directly influence the scheduling phase [20] because data movements on the crossbar have to be controlled as well. As a result, existing DSLs for spatial computation are not applicable to memristor-based CiM architectures.

To solve aforementioned challenge, a new type of skeleton was proposed: the CiM skeleton [20]. Skeletons (formally referred to as algorithmic skeletons) are high-level software constructs used to hide the complexity of parallel computer systems from a programmer [6]. Additionally, CiM skeletons provide the scheduling, mapping, and routing information needed to program applications on the CiM architecture. In this paper, we propose a DSL and a DSL-based compilation flow to express and implement CiM skeletons. We make the following contributions:

- A skeleton-based Domain Specific Language, CiM DSL, to describe the low-level crossbar details of the spatial patterns of an algorithm.
- A compiler that generates a scheduled, mapped, and routed system from this DSL using CiM Skeletons.
- The verification of this compiler and comparison with a multicore system.

The paper is organized as follows. Section 2 provides background information about the CiM architecture. Section 3 and Section 4 describe the CiM DSL and the implementation of its compiler. Thereafter, Section 5 validates the compiler with multiple applications. Finally, Section 6 concludes the paper.
2. BACKGROUND

The CiM architecture is a computational template in which application-specific accelerators are instantiated and executed in-memory under the control of a CPU. Figure 1a shows a high-level view of the CPU/CiM heterogeneous system. CiM’s hardware consists of two parts: a large reconfigurable crossbar of horizontal and vertical nanowires with memristors on every intersection, and a CMOS-based controller that activates voltages on the nanowires to control the memristors. Memristors can be configured as memory elements as well as computation elements [19], allowing both to be performed in the crossbar. The high density of a memristor crossbar enables this architecture to fully exploit the parallelism in embarrassingly parallel applications. Furthermore, since a CiM crossbar coexists with the general memory space (RRAM in Figure 1a), we do not have any off-chip memory accesses. This leads to significant improvements in both performance and energy consumption.

Figure 1c shows CiM’s programming model. Before programmers can use application-specific accelerators in CiM, an algorithm designer is required to create the library functions (i.e., accelerators) optimized for minimum communication maximum parallelism on the crossbar. The algorithms are specified in the CiM DSL, using skeleton operators to specify not only the floorplan but also the communication paths\(^1\) between the computational units. Please note that in the CiM crossbar, each corner point that changes the direction of the path needs to be explicitly controlled [18]. As a result, the scheduling is dependent on the mapping and routing. The defined CiM skeletons are used further to specify the complete algorithm that is stored in the function library. Subsequently, an application programmer can use this library in a different high-level language (e.g., C). Therefore, CiM DSL is designed for the library developers rather than the application developers. Functions not contained in the library are processed by a CiM High-Level Synthesis (HLS) tool. Finally, the compilation results of different tools are linked together. Figure 1a illustrates one possible compilation result of the sample program shown in Figure 1b. One circuit is generated using the library functions (colored green), and another one is generated by the CiM HLS tool (colored blue). Code that doesn’t need to be executed on the CiM crossbar (colored red), is compiled regularly for the CPU. In this paper we focus only on the CiM DSL compiler.

3. CiM DSL

In this section, we describe the rules of CiM DSL’s syntax using one variant of Extended Backus-Naur Form (EBNF) [2]. We design the DSL with the goal to create and use CiM skeletons easily. Figure 2 shows the dependence between different CiM concepts. First, we define skeleton language operators, which are language constructs that are used to specify different connections between functional blocks and their relative position. Subsequently, the operators are used to define CiM Skeletons. Finally, both skeleton operators and CiM skeletons are used to build library functions.

3.1 Circuits and Expressions

The CiM DSL creates systems by connecting functional blocks together in expressions. The syntax related to this is as follows:

\[
\begin{align*}
\text{Circuit}_\text{decl} &::= 1D \text{ File}\_name & (1) \\
\text{Exp} &::= \text{Circuit} \mid \text{Int} \mid 1D \text{ Exp}_1 \text{ Exp}_2 \ldots & (2) \\
&\quad \mid \text{Exp}_1 \text{ Op} \text{ Exp}_2 \mid \text{Map} \mid \text{Fold} \mid \text{Repeat} & (3) \\
\text{Op} &::= \ast \text{D}_\ast \mid \ast \text{H}_\ast \mid \ast \text{I}_\ast & (4) \\
\text{Map} &::= 1D \text{ Range Exp} & (5) \\
\text{Fold} &::= \text{Op} \text{ Exp} & (6) \\
\text{Repeat} &::= \text{Int} \text{ Exp} & (7) \\
\text{Range} &::= \text{Int} \mid \text{Int} \left[ \text{Ar}_{\text{Op}} \text{ Int} \right] \text{ Int} & (8) \\
\text{Ar}_{\text{Op}} &::= \ast \ast \mid \ast \ast \mid \ast \ast \mid \ast \ast & (9)
\end{align*}
\]

A circuit declaration (Circuit\_decl) declares the name of the primitive circuit used in the program and specify its

\(^1\)We refer to communication paths simply as paths in the remainder of the paper.
3.2 Statements and Signals

Statements connect the circuits declared in expressions to signals, to give explicit control over the data flow to and from those circuits. A signal is a connection path between two data locations. The relevant CiM DSL syntax is as follows:

\[
\begin{align*}
\text{Statement} & := \text{Signal} [\text{Exp} \ \text{Signal} | \text{Loop}] \\
\text{Loop} & := \text{1D Range Statement} + \\
\text{Signal} & := \text{1D Range} | \text{Signal} + + \text{Signal} | \text{Zip}(\text{Signal}, \text{Signal})
\end{align*}
\]

A statement contains an input signal, an output signal and optionally an expression (Rule 10). It will be translated into a group of primitive circuits. Loops can be used to compactly express a group of similar statements, which is similar to the semantics of map (Rule 11). The symbol “++” means the loop structure accepts one or more statements. Signals have a name and a range, and they can be built using signal operators “++” (Rule 12) and “zip” (Rule 13). “++” concatenates two signals sequentially and “zip” builds a single signal from the elements of two signals interleaved with one another.

Statements are allowed to specify only the connection between input and output signals, without including any component (Rule 10). This feature is useful for shuffling the signals. Listing 1 shows an example of a signal shuffle, specifically one named the butterfly pattern, which is used in a bitonic sort function.

Listing 1: Butterfly Shuffle Statement

1. \[
\text{zip([in[0:2:m/2], in[2:2:m]] ++ zip([in[1:2:m/2], in[1+n/2:2:m]]) => out[0:m];}
\]

3.3 Programs and Components

A complete CiM DSL program consists of one or more circuit declarations and one or more components (Rule 14). A component describes a library function or a CiM skeleton. The CiM DSL syntax concerning these two language constructs is as follows:

\[
\begin{align*}
\text{Program} & := \text{Component}* \\
\text{Component} & := \text{1D Sig dec}[\ \text{Par dec}] \text{ Statement}* \\
\text{Sig dec} & := \{ \text{ID Int}\}^{*} \{ \text{ID Int}\}^{*} \\
\text{Par dec} & := \{ \text{Type ID}\}^{*} \\
\text{Type} & := \text{int} | \text{comp}
\end{align*}
\]

The head of a component contains its name (ID), signal declaration (Sig dec), and parameter declaration (Par dec) (Rule 15). The square brackets surrounding Par dec means it is optional. Among all the components in a program, one and only one component should be named as “main”. It represents the library function defined by the program. The signal declaration declares input and output signals, containing names (ID) and their sizes (Int) (Rule 16). Parameter declaration specifies the type and names of the parameters (Rule 17). Currently, CiM DSL supports two types of parameters, which are integer (“int”) and components (“comp”) (Rule 18). The body of a component is one or more statements (Rule 15), which link expressions with input and output signals (Rule 10).
As an example of a complete CiM DSL program, Listing 2 shows the code for matrix multiply $A_{m \times n} \times B_{n \times k}$ in CiM DSL. It is built based on the inner product function. To achieve the best performance, the primitive circuits used in the inner product are arranged following an H-tree pattern [10]. Please refer to Figure 6a for a visualization of the target layout. The inner product hardware is duplicated to calculate all the elements of the result matrix in parallel.

Listing 2: Matrix Multiply in CiM DSL

```plaintext
libmod mul(add.lib);
libmod mul(mul.lib);

comp main<in[32]|out[16]>() {
    in[0:32] => matrix_multiply(4,4,4) => out[0:16];
}

comp matrix_multiply<in[m*n], B[n*k] | out[m*k]>
    (int m, int n, int k) {
    forV i = 0:m do
        A[m*n+i:n+i+n] ++ B[0:n*k] =>
        row(n, k) => out[k+i:k+i+k];
    end
}

comp row<in[n], b[n*k] | out[n]>(int n, int k) {
    forV i = 0:k do
        a[0:n] + b[1:n+i+n] => inner_product(n) => out[i];
    end
}

comp inner_product<in[a[n]], B[n*k] | out[1]>(int n) {
    zip(a[0:n], b[0:n]) => repeat[n](mul)
    *H => reduce(n/2, add) => out[0];
}.

comp reduce<in[2*n] | out[1]>(int n, comp c) {
    in[0:2*n] => foldR<*,H=>(map<1 = n: n/2: 0>
    (repeat[i](c))) => out[0];
}
```

Line 1 and line 2 declare two primitive circuits, i.e. the adder (add) and the multiplier (mul). The keyword of this declaration is “libmod”, and the name suffix of the library files is “.lib”. After these declarations, the program defines five components with the key word “comp”. The “main” component (line 3 to line 5) specifies that the sizes of two matrices are both four by four. The declaration of input and output ports are surrounded with a pair of angle brackets (“<”,”>”), which means that input locations (e.g., registers/storage locations in the crossbar) are transferred via signals to the matrix multiplication’s input ports. Note the use of the “++” signal operator in lines 10 and 15 to concatenate two arrays into one, and of the “zip” signal operator in line 19 to zip the elements into two arrays into one. `matrix_multiply` and `row` components both employ a fork-loop, which expresses a group of statements. It arranges the circuits that are obtained from the loop statements vertically (forkV) or horizontally (forkH). Therefore, line 15 duplicates $k$ inner-product function in a row, and line 10 further extends these rows into a matrix. The `repeat` expression in lines 19 and 24 represents duplicates of a circuit, which are executed in parallel. It does not contain mapping information and will be further mapped using operators. Line 23 builds a binary tree as shown in Figure 4. The map expression duplicates a circuit using a variable (i) in the range (n/2:0). The range is divided into three fields by colons (;). The first field (n) is the starting number. The second field (/2) is applied to n repetitively until the value reaches the third field (0). All these intermediate values, not including the third field, constitute the range. Finally, the group of circuits represented by map are linked with the operator specified in the fold expression (i.e., “*H”).

4. IMPLEMENTATION

We developed CiM DSL compiler using Spoofax, a language workbench in the Eclipse IDE [11]. The development consists of two phases. First, we define the syntax and the name binding rules, which are used to parse user programs into an Intermediate Representation (IR). Second, we define transformation and generation rules, which annotate and transform the IR according to our defined skeleton operators, and emit the code for the desired circuits, respectively. Spoofax has good support for both phases so that the work-load of developing the new DSL is significantly reduced. After defining all the rules, the CiM DSL compiler is generated by Spoofax.

A CiM DSL program is compiled into circuits following four steps, namely parsing, dereferencing, transforming, and emitting. The parsing builds an Abstract Syntax Tree (AST), which is subsequently dereferenced to eliminate language structures such as fork-loop, fold, map, and repeat. During dereferencing, the loops are fully unrolled, and fold, map, and repeat are applied. Figure 5 shows this process for the expression in the reduce component in Listing 2 (line 22). In this case, we assume n is four. The left side illustrates the AST obtained after parsing, where the range is calculated into a noncontinuous array. Then, map, repeat, and foldR are applied successively as described in Section 3. The result of the dereferencing is also an AST in which all the leaf nodes are primitive circuits, and all the internal nodes are skeleton operators. The result AST is shown in the right part of Figure 5, where the *H operator is replaced by the name HTree that implements this pattern.

The transform process constructs circuits based on the AST that is obtained in the previous step. The circuit construction is performed using a post-order, depth-first tree traversal using the Transform_ske function that is dynamically dispatched according to the type of the operators. The function contains the scheduling, placement, and routing algorithms that will be applied to the circuits according to the particular CiM skeleton that it implements. Figure 7 shows an example of transforming the AST on the right of Figure 5. Its left part shows the leaf nodes of the dereferenced AST and the circuits they represent. The right part is the transformation of this AST, which is done in two steps. The lower Htree node is transformed first, which combines four adders and two adders into two group adders. Next, these two groups are transformed by the root node to form a whole circuit together with another adder. The small squares shown in this figure represent mirrors.

Finally, the code generation phase emits three types of
files, which are VHDL, mapping constraints, and graphic outputs. The VHDL contains the port maps and an FSM, which are produced according to the mapped, routed, and scheduled Data Flow Graph (DFG) of the algorithm. This VHDL can be used for behavioral simulation. We generate graphic output to examine the placement and routing results. This output is in the C language, invoking a graphic library named pslib to produce a graph in postscript (.ps) format.

5. EXPERIMENTAL RESULTS

We use four functions to validate CiM DSL and its compiler, which are vector inner product, matrix multiply, Finite Impulse Response (FIR) filter, and bitonic sort. The code of the first two functions and a small part of bitonic sort is shown in Listing 2 and Listing 1. The data type used in these functions are 32-bit integer.

5.1 Compiler Outputs

The attributes of primitive circuits we used in this case study are listed in Table 1. The *adder* (*Add*) and the *multiplier* (*Mul*) are designed by previous works [12, 1]. The original design is not based on a crossbar, so the area and latency are moderately different when we adapt it to CiM. We will not discuss these changes since the hardware design is beyond the scope of this work. The *Greater than* (*Gt*) component, which is used in bitonic sort, cannot be found in existing works. Therefore we estimate their attributes.

![Figure 6: Graphic output.](image)

<table>
<thead>
<tr>
<th></th>
<th>Latency</th>
<th>Width</th>
<th>Height</th>
<th>Energy</th>
<th>Ref.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Add</td>
<td>178</td>
<td>9</td>
<td>32</td>
<td>124.8</td>
<td>[12]</td>
</tr>
<tr>
<td>Mul</td>
<td>803</td>
<td>256</td>
<td>128</td>
<td>4407.8</td>
<td>[1]</td>
</tr>
<tr>
<td>Gt</td>
<td>27</td>
<td>128</td>
<td>192</td>
<td>93</td>
<td>–</td>
</tr>
<tr>
<td>Copy</td>
<td>3</td>
<td>–</td>
<td>–</td>
<td>12.8</td>
<td>[1]</td>
</tr>
</tbody>
</table>

The latency is listed in terms of Clock Cycle (CC). The area is represented by the required number of rows (Height) and columns (Width). The energy consumption of the adder and the multiplier is not given in the original papers [12, 1]. Actually, it is impossible to report accurate energy consumption at the design phase because this is input data dependent. However, we can estimate the maximum value by assuming every IMPLY or FALSE operation [12] consumes the energy of switching states. This energy varies among technologies, from 0.1 fJ [17] to 230 fJ [13]. In this paper, we use 100 fJ. We also calculated the energy consumption for copy operation following the same way. Its implementation, which is essentially two NOT operations, is taken from [1].

We generated the graphic outputs of inner product, FIR filter, and bitonic sort as shown in Figure 6. Matrix multiply is not presented because it is a matrix of inner products, which is very large but contains little information. The vector size of inner product is 16. The tap size of FIR filter is four and the input size is two. The input size of bitonic sort is eight. In Figure 6a and Figure 6b, the large and small boxes denote multipliers and mirrors respectively. The adders are tiny bars beside the mirrors. In Figure 6c, the small boxes are also mirrors while the large ones are *gls*. The red and green dashes in these figures indicate the input and output ports while the blue lines show the routing. The layouts are the same as we designed, which demonstrates the DSL works as intended.

5.2 Performance Evaluation

We enlarged the problem sizes and compared the performance of generated circuits with a multicore platform. The problem sizes and the attributes of the generated circuits are listed in Table 2. For the FIR filter, the tap size is 64 and the input size is 512. We calculated the area of the generated circuits based on the memrisor density predicted by ITRS [7], which is $2.38 \times 10^{11}$ bit/cm$^2$. These library functions are simulated using Sniper [4], and the energy consumption is reported by McPAT [14]. The targeted multicore system is Intel Xeon X7460 processor that consists of six cores on a


### Table 2: Experimental Results

<table>
<thead>
<tr>
<th>Problem</th>
<th>Lat/CC</th>
<th>Width</th>
<th>Height</th>
<th>Area/mm$^2$</th>
<th>Energy/mJ</th>
<th>Lat/µs</th>
<th>Energy/mJ</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inner product</td>
<td>32768</td>
<td>3653</td>
<td>20448</td>
<td>73696</td>
<td>0.1502</td>
<td>272.7</td>
<td>15.99</td>
<td>74.65</td>
</tr>
<tr>
<td>Matrix multiply</td>
<td>32×32</td>
<td>1753</td>
<td>19456</td>
<td>72704</td>
<td>0.5943</td>
<td>174.5</td>
<td>21.96</td>
<td>99.54</td>
</tr>
<tr>
<td>FIR filter</td>
<td>64/512</td>
<td>12773</td>
<td>8192</td>
<td>147456</td>
<td>0.5075</td>
<td>302.7</td>
<td>35.95</td>
<td>23.70</td>
</tr>
<tr>
<td>Bitonic Sort</td>
<td>256</td>
<td>1401</td>
<td>58240</td>
<td>32768</td>
<td>0.8019</td>
<td>0.0008</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

#### 5.3 FPGA Prototyping

To confirm the validity of the graphical output, we built also an FPGA prototype to simulate the layout of the CiM inner product design on FPGA. We synthesized the generated area constraint file and the VHDL file with Xilinx Vivado. The implemented design of inner product is shown in Figure 8. Primitive circuits are recognizable by their yellow borders, and connections between circuit are represented by orange lines. The sizes of the adders and multipliers are different from Figure 6a. Despite this difference in visualization of the connections, we can verify that the floorplanning and interconnect information was included in the VHDL and constraint files correctly.

#### 6. CONCLUSION

In this paper, we introduce a DSL and compiler to design programs to be run on future CiM-based systems. The skeleton-based DSL allows for the modular, high-level description of a system, and the compiler schedules, places, and routes the system using information provided by CiM skeletons. The functional correctness of the DSL is verified using VHDL files generated by the compiler, and the mapping and routing results are confirmed by generating graphical output files. This DSL can also be used for FPGA designs and it will be investigated in future work.

#### 7. REFERENCES


---

Figure 8: FPGA floorplanning for inner product.

die of 503 mm$^2$, running at 2.66 GHz each. Those cores have 64 kB L1 cache each and share a 16 MB L3 cache. Every two cores share an L2 cache of 3 MB. The latency (Lat) and energy consumption are also listed in Table 2. The speedup of CiM’s latency over multicore is calculated, which is between 23x and 99x. The area of the circuits generated by CiM compiler is very small compared with this processor. The energy consumption is less than 1% of the multicore. Please note that we do not include the CMOS controller in the energy evaluation because there is no backend synthesis tool yet for the CiM system. However, we do not expect the controller to have a big impact on the reported numbers due to its simplicity.

![Image of FPGA floorplanning](image-url)