# Towards Formal Verification of Optimized and Industrial Multipliers 

Alireza Mahzoon ${ }^{1} \quad$ Daniel Große ${ }^{1,2} \quad$ Christoph Scholl ${ }^{3} \quad$ Rolf Drechsler ${ }^{1,2}$<br>${ }^{1}$ Institute of Computer Science, University of Bremen, Germany<br>${ }^{2}$ Cyber-Physical Systems, DFKI GmbH, Bremen, Germany<br>${ }^{3}$ Institute of Computer Science, University of Freiburg, Germany<br>\{mahzoon,grosse,drechsle\}@informatik.uni-bremen.de, scholl@informatik.uni-freiburg.de


#### Abstract

Formal verification methods have made huge progress over the last decades. However, proving the correctness of arithmetic circuits involving integer multipliers still drives the verification techniques to their limits. Recently, Symbolic Computer Algebra (SCA) methods have shown good results in the verification of both large and non-trivial multipliers. Their success is mainly based on (1) reverse engineering and identifying basic building blocks, (2) finding converging gate cones which start from the basic building blocks and (3) early removal of redundant terms (vanishing monomials) to avoid the blow-up during backward rewriting. Despite these important accomplishments, verifying optimized and technology-mapped multipliers is an almost unexplored area. This creates major barriers for industrial use as most of the designs are area and delay optimized. To overcome the barriers, we propose a novel SCA-method which supports the formal verification of a large variety of optimized multipliers. Our method takes advantage of a dynamic substitution ordering to avoid the monomial explosion during backward rewriting. Experimental results confirm the efficiency of our approach in the verification of a wide range of optimized multipliers including industrial benchmarks.

\section*{I. Introduction}


An integer multiplier is one of the most frequent units in many applications (e.g. signal processing, cryptography, and machine learning). The wide variety of multiplication algorithms and a large number of building blocks make it one of the most complex parts of many designs. Integer multipliers are usually optimized in terms of area and delay particularly in industrial applications where having an efficient design is critical. Formal verification of the optimized multipliers is highly important to ensure the correctness of the circuit after optimization. However, the effects of optimization on the multiplier structure make the verification a tough challenge.

In the last 20 years, several formal verification methods have been proposed to prove the correctness of integer multipliers. However, they have serious disadvantages: (a) Decision Diagrams (DDs) (such as BDDs and *BMDs) [1] suffer from memory blow-up when the multipliers are large, (b) Boolean satisfiability (SAT) and Satisfiability Modulo Theories (SMT) face exponential run-times when the bit-width increases, (c) Theorem Proving [2] needs manual effort, and (d) Term Rewriting [3] is not fully automated as a manual update of rewrite rules is necessary for implementations which are not yet represented in the database.
Recently, Symbolic Computer Algebra (SCA) approaches have addressed many challenges of formal verification, see for instance [4], [5], [6], [7], [8], [9], [10], [11], [12]. SCA-based

[^0]multiplier verification generally consists of three main steps: 1) representing the function of the multiplier based on its inputs and outputs as a Specification Polynomial $S P$, 2) capturing the logical gates (nodes) as a set of polynomials $P_{G}$, and 3) proving the membership of $S P$ in the ideal generated by $P_{G}$ using Gröbner basis theory.

In the just mentioned 3rd step, $S P$ is divided by $P_{G}$ in a step-wise process called backward rewriting, and eventually, the remainder is evaluated. If this remainder is zero, the multiplier is correct; otherwise, it is buggy.

The local removal of redundant monomials (so-called vanishing monomials) and reverse engineering have boosted the efficiency of SCA-based methods in verifying both large and non-trivial multipliers. Recently, [10] showed that (i) the vanishing monomials cause an explosion in the number of monomials during backward rewriting of non-trivial multipliers, and (ii) the vanishing monomials are formed when substituting a converging gate, i.e. a gate to which both outputs of a Half Adder (HA) converge. Hence, algorithms for detecting Converging Gate Cones (CGCs) and early removal of vanishing monomials have been proposed to make the global backward rewriting phase vanishing-free. Furthermore, [13] demonstrated that identifying basic building blocks called atomic blocks, e.g. HAs, Full Adders (FAs), etc., is essential in revealing all vanishing monomials and finally to speed up the overall verification process.

Despite the progress in formal verification of multipliers, still proving the correctness of optimized designs is an almost unexplored area. Applying the state-of-the-art SCA-based verification methods to optimized multipliers leads to an explosion in the backward rewriting phase. The reason for the explosion is that optimization typically destroys the boundaries of several atomic blocks. As a consequence, the compact word-level polynomials for the "lost atomic blocks" are no longer available during backward rewriting. Likewise the static substitution order for backward rewriting of the state-of-the-art SCA-verifiers becomes now susceptible to the ordering of the nodes for lost atomic blocks.
Contribution: In this paper, we introduce the novel SCA-method DyPoSub ${ }^{1}$ to verify optimized and industrial multipliers. We first investigate how optimization changes the structure of a multiplier netlist. As a result of these changes, the state-of-the-art SCA-verifiers fail and we analyze the reasons, i.e. in particular the effect of destroyed boundaries for atomic blocks during backward rewriting. We then show that this effect can be mitigated by a dynamic substitution order which

[^1]

Fig. 1. $2 \times 2$ mult
allows keeping the size of the current specification polynomial moderate. To the best of our knowledge, a dynamic substitution order has not been considered before in SCA-verification. The experimental results confirm that our proposed method can verify a wide range of optimized multipliers including industrial benchmarks which was not possible before.
Related Work: As mentioned above, a series of SCA-based verification approaches emerged for integer multipliers in the last five years [4], [5], [6], [7], [8], [10], [11], [12]. While there has been significant progress from simple multiplier architectures to complex architectures as well as in the size of the multipliers, multipliers processed by strong logic synthesis and technology mapping have been still out of reach for the existing approaches. Moreover, the idea of dynamic substitution ordering has not been investigated so far.

## II. Preliminaries

## A. Multiplier Structure

Most integer multipliers consist of three stages: (1) Partial Product Generator (PPG) which creates partial products from two inputs, (2) Partial Product Accumulator (PPA) which reduces partial products and computes their sums, and (3) Final Stage Adder (FSA) which converts these sums to the corresponding binary output.
We use an AND-Inverter Graph (AIG) representation of a multiplier as input for our verification method (Fig. 1 shows the AIG of a $2 \times 2$ multiplier). A great advantage of AIGs is the possibility of advanced reverse engineering. Based on cut enumeration, atomic blocks can be identified very fast, even in multipliers with millions of nodes [14], [13].

## B. SCA Verification

First, we briefly summarize some basics:

- Monomial: power product of the variables, i.e. $M=x_{1}^{a_{1}} x_{2}^{a_{2}} \ldots x_{n}^{a_{n}}$ where $a_{i} \geq 0$
- Polynomial: finite sum of monomials, i.e. $P=c_{1} M_{1}+$ $\cdots+c_{j} M_{j}$ with coefficients in field $k$
- Division: Assuming $p$ is a polynomial and $F$ is a set of polynomials, the division of $p$ by $F$ is denoted by $p \xrightarrow{F} r$ where $r$ is called remainder
The goal of SCA-based verification is to formally prove that all signal assignments consistent with the AIG evaluate the Specification Polynomial $(S P)$ to 0 . The $S P$ determines the function of a multiplier based on its inputs and outputs, e.g. for the $2 \times 2$ multiplier of Fig. $1 S P=8 Z_{3}+4 Z_{2}+2 Z_{1}+$ $Z_{0}-\left(2 A_{1}+A_{0}\right)\left(2 B_{1}+B_{0}\right)$ where $8 Z_{3}+4 Z_{2}+2 Z_{1}+Z_{0}$ represents the word-level representation of the 4 -bit output, and $\left(2 A_{1}+A_{0}\right)\left(2 B_{1}+B_{0}\right)$ represents the product of the 2 -bit inputs.

Before verification, the nodes of an AIG should be modeled as polynomials describing the relation between inputs and outputs. Based on the type of nodes and edges, five different operations might happen in an AIG. Assuming $z$ is the output and $n_{i}$ and $n_{j}$ are the inputs of a node:

```
\(z=n_{i} \Rightarrow p_{N}:=z-n_{i}, \quad z=n_{i} \wedge n_{j} \Rightarrow p_{N}:=z-n_{i} n_{j}\),
\(z=\neg n_{i} \Rightarrow p_{N}:=z-1+n_{i}, \quad z=\neg n_{i} \wedge n_{j} \Rightarrow p_{N}:=z-n_{j}+n_{i} n_{j}\),
\(z=\neg n_{i} \wedge \neg n_{j} \Rightarrow p_{N}:=z-1+n_{i}+n_{j}-n_{i} n_{j}\)

The extracted node polynomials are in the form \(P_{N}=\) \(x-\operatorname{tail}\left(P_{N}\right)\) where \(x\) is the node's output, and \(\operatorname{tail}\left(P_{N}\right)\) is a function based on the node's inputs.
Based on the Gröbner basis theory, all signal assignments consistent with the AIG evaluate the specification polynomial \(S P\) to 0 , iff the remainder of dividing \(S P\) by the AIG node polynomials is equal to 0 (see [15], [8], [16] for more details).
The step-wise division of \(S P\) by node polynomials is shown in Fig. 2 for the \(2 \times 2\) multiplier. As the remainder is zero, the circuit is bug-free. In integer multipliers, dividing \(S P_{i}\) by a node polynomial \(P_{N_{i}}=x_{i}-\operatorname{tail}\left(P_{N_{i}}\right)\) is equivalent to substituting \(x_{i}\) with \(\operatorname{tail}\left(P_{N_{i}}\right)\) in \(S P_{i}\). For example, dividing \(S P_{3}\) by \(P_{n_{11}}\) in Fig. 2 is equivalent to substituting \(n_{11}\) with \(\operatorname{tail}\left(P_{n_{11}}\right)=n_{4} n_{7}\) in \(S P_{3}\). (In the results we always replace powers \(x_{i}^{a_{i}}\) with \(a_{i}>\) 1 by \(x_{i}\), since \(x_{i}\) can only take values from \(\{0,1\}\). In the theory this corresponds to adding \(x_{i}^{2}-x_{i}\) to the node polynomials.) The process of step-wise division (substitution) is called backward rewriting. The existing SCA-based verification methods use a static reverse topological order to substitute node polynomials in the intermediate specification polynomial. We refer to this intermediate polynomial as \(S P_{i}\) in the rest of the paper.
During the backward rewriting of complex multipliers, a huge number of redundant monomials known as vanishing monomials are generated. Removing the vanishing monomials is essential to avoid an explosion in the size of \(S P_{i}\). In [10] and [13] a method has been presented which first performs a local backward rewriting to remove the vanishing monomials early in the so-called Converging Gate Cones (CGCs). Overall, this allows a vanishing-free global backward rewriting.

\section*{III. Challenges of Verifying Optimized Multipliers}

In this section, we first demonstrate how optimization changes the structure of a multiplier netlist. As a result of these changes, a major challenge arises for formal verification of multipliers which we explain next.

\section*{A. Optimization and Multiplier Structure}

Multiplication algorithms determine how the building blocks are connected in each stage of a multiplier to reach a certain design goal, e.g. minimum area or minimum delay [17]. Although the type of building blocks and the connections largely vary from one algorithm to another, there are standard building blocks (also referred as atomic blocks). Examples include HAs, FAs, MUXs, and carry predictors [18].
Example 1. Fig. 3 a shows the AIG of a \(3 \times 3\) array multiplier after reverse engineering and identifying the atomic blocks. The first stage of the multiplier consists of nine AND gates (i.e. \(n_{0}, n_{1}, \ldots, n_{8}\) ) to generate partial products. The second and third stages are made of HA and FA networks \({ }^{2}\) to reduce partial products and perform the final addition. As can be seen

\footnotetext{
\({ }^{2} \mathrm{~A}\) HA is denoted as \(H_{i}\) and an FA as \(F_{k}\), respectively.
}


Fig. 3. AIGs of \(3 \times 3\) array multiplier
in Fig. 3a, the boundaries of atomic blocks and the connections between them are fully visible.

However, performing optimization on the multiplier netlist completely changes the situation. More precisely, the effect of optimization appears at two hierarchy levels:
1) Logic optimization inside building blocks: the number of nodes in a building block shrinks, e.g. 11 AIG nodes of an FA are reduced to 8 nodes. In this case, the overall structure of the multiplier does not change significantly (as long as the reverse engineering is still able to identify the respective atomic block).
2) Logic optimization across building blocks: blocks are merged to reduce the number of nodes and as a consequence some block boundaries are removed. In this case, identifying building blocks is no longer possible.
We now come back to the array multiplier example.
Example 2. Fig. \(3 b\) shows the AIG for the \(3 \times 3\) array multiplier after applying resyn3 optimization from abc [19]. While the overall number of AIG nodes is reduced by \(15 \%\), the optimization destroys the boundaries of the two FAs \(F_{1}\) and \(F_{2}\), respectively.

As becomes evident from the example, there are some nodes in the circuit which have been part of atomic blocks (FAs in the example) before optimization, but they cannot be identified as any atomic blocks anymore. Moreover, in non-trivial multipliers optimization may also change the boundaries of larger blocks, e.g. \(n\)-bit MUXs, carry operators, partial product generator units. Overall, losing block boundaries poses a major challenge to SCA verification of integer multipliers which we detail in the following section.

\section*{B. Challenge for Backward Rewriting of Optim. Multipliers}

An important property of atomic blocks is the compactness of the polynomials describing their input/output relations. For example, the two most common atomic blocks, i.e. HA and FA, can be expressed using the following word-level relations:
\[
\begin{align*}
& H A(\text { in }: X, Y \text { out }: C, S) \quad \Rightarrow \quad 2 C+S=X+Y \\
& F A(\text { in : } X, Y, Z \quad \text { out }: C, S) \quad \Rightarrow \quad 2 C+S=X+Y+Z \tag{2}
\end{align*}
\]

Let's have a look on backward rewriting when reaching atomic blocks and such word-level polynomials: Substituting
an atomic block polynomial in \(S P_{i}\) only slightly changes the size of \(S P_{i}\) (i.e. number of monomials). As a result, the size of \(S P_{i}\) remains almost identical.

Example 3. In Fig. 3a, substituting the HA and FA polynomials increases the size of \(S P_{i}\) by zero or one, respectively. For example, substituting the \(F_{3}\) polynomial ( \(2 O u t[5]+\) Out \([4]=\) \(\left.W_{0}+W_{1}+W_{2}\right)\) and the \(H_{3}\) polynomial \(\left(2 W_{0}+O u t[3]=\right.\) \(\left.W_{3}+W_{4}\right)\) in the first and the second steps of backward rewriting results in: \({ }^{3}\)
\[
\begin{align*}
& S P:=32 \mathrm{Out}[5]+16 \mathrm{Out}[4]+8 \mathrm{Out}[3]+\ldots \\
& S P \xrightarrow{F_{3}} S P_{1}:=16 W_{2}+16 W_{1}+16 W_{0}+8 \mathrm{Out}[3]+\ldots \\
& S P_{1} \xrightarrow{H_{3}} S P_{2}:=16 W_{2}+16 W_{1}+8 W_{3}+8 W_{4}+\ldots \tag{3}
\end{align*}
\]

On the other hand, substituting later \(n_{0}, n_{1}, \ldots, n_{8}\) (AND gates) polynomials results in cancellation of two terms per AND gate and reduces the size of \(S P_{i}\) by two per AND gate.

If we now look on the overall backward rewriting algorithm (and not only on a single substitution step), all existing SCAverifiers use a static pre-determined substitution order derived from the topological sorting of the AIG nodes/circuit elements. This worked out fine so far, as sharp increases in the size of \(S P_{i}\) are not occurring if most of the time atomic blocks are handled as a whole - either by substituting word-level polynomials (if possible) or at least by using a substitution order that substitutes polynomials for the single outputs of atomic blocks strictly one after the other. If we lose the block boundaries in optimized multipliers, the exploitation of this knowledge is not possible anymore. Thus, we observe much larger polynomials, and in the worst-case a blow-up.
Example 4. In Fig. 3b, after substituting the \(F_{3}\) and \(H_{3}\) polynomials, there are several degrees of freedom for the subsequent substitutions respecting the reverse topological order of the \(n_{0}, n_{1}, \ldots, n_{24}\) nodes and \(H_{1}, H_{2}\) components. A straightforward topological order results in a sudden increase in the \(S P_{i}\) size, and even an explosion in larger multipliers. For example, using a topological order \(S O_{1}\) during the backward rewriting of an optimized \(8 \times 8\) array multiplier results in an intermediate polynomial with 106, 938 monomials. However, by using a different order \(\mathrm{SO}_{2}\) we can limit the maximum size of \(S P_{i}\) to 203.

\section*{IV. SCA VErification using Dynamic Backward REWRITING}

In this section, we first give an overview of our proposed approach DyPoSub. Then, we introduce dynamic backward rewriting which allows verification of optimized multipliers.

\section*{A. Overview}

Algorithm 1 shows the top-level pseudo-code of our approach. The input is the AIG representation of a multiplier, and the algorithm returns true (false) if the multiplier is correct (buggy). In the first step, the specification polynomial \(S P\) is created based on the size of the multiplier (Line 1). Then, the atomic blocks are identified by reverse engineering (Line 2). Subsequently, the converging gate cones are detected in the remaining nodes which are not part of any atomic blocks

\footnotetext{
\({ }^{3}\) The case that the left hand sides of word-level relations for atomic blocks do not occur in the needed form in the \(S P_{i}\) is discussed later in Section IV-B.
}
```

Algorithm 1 SCA Verification
Input: Multiplier $M$
Output: TRUE if $M$ is correct, and $F A L S E$ otherwise
1: SP $\leftarrow$ create_specification_polynomial $(M)$
2: $A B \leftarrow$ reverse_engineering $(M)$
3: $C G C \leftarrow$ find_converging_gate_cones $(M, A B)$
4: $F F C \leftarrow$ find_fanout-free_cones $(M, A B, C G C)$
5: $C \leftarrow C G C \cup F F C$
6: $P \leftarrow$ extract_polynomials $(C)$
7: $P \leftarrow$ remove_vanishing_monomials $(P)$
8: $r \leftarrow$ dynamic_backward_rewriting $(S P, A B, C, P)$
9: if $r$ is equal to zero then return TRUE else return $F A L S E$

```
(Line 3). The rest of the nodes are grouped into fanout-free cones (Line 4). The polynomials for all cones are extracted (Line 5 - Line 6) and the vanishing monomials are removed (Line 7). Finally, we perform the proposed dynamic backward rewriting (for details see next section) which is based on the dynamic ordering of cone and block polynomials to obtain the remainder (Line 8). If the remainder is zero the circuit is correct; otherwise, it is buggy (Line 9).

\section*{B. Dynamic Backward Rewriting}

Before introducing the dynamic backward rewriting, we first make several definitions:

Definition 1. Atomic blocks, Converging Gate Cones (CGCs), and Fanout-Free Cones (FFCs) are called Components. A component has one output, if it is a CGC or an FFC; and may have several outputs if it is an atomic block. A multiplier consists of several components.

Definition 2. A component has a polynomial describing the (word-level) relation of output(s) and inputs. In components with one output (i.e. CGCs and FFCs), we have:
\[
\begin{equation*}
O u t=F\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right) \tag{4}
\end{equation*}
\]
where Out is the sole output of the component, and \(F\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right)\) is a polynomial based on the component inputs. In components with more than one output (i.e. atomic blocks), we have:
\[
\begin{align*}
\text { Out }_{1} & =F_{1}\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right) \\
\text { Out }_{2} & =F_{2}\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right) \\
& \ldots  \tag{5}\\
\text { Out }_{m} & =F_{m}\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right)
\end{align*}
\]
where each output can be described as a polynomial based on the primary inputs. There is also a more compact relation between outputs and inputs in these components:
\[
\begin{equation*}
G\left(O u t_{1}, O u t_{2}, \ldots, O u t_{m}\right)=F\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right) \tag{6}
\end{equation*}
\]
where \(G\) and \(F\) are polynomials based on the outputs and inputs, respectively.

Example 5. In an \(F A\) with the inputs \(X, Y\), and \(Z\), and outputs \(C\) and \(S\), we have:
\[
\begin{align*}
& C=X Y+X Z+Y Z-2 X Y Z  \tag{7}\\
& S=X+Y+Z-2 X Y-2 X Z-2 Y Z+4 X Y Z  \tag{8}\\
& 2 C+S=X+Y+Z \tag{9}
\end{align*}
\]
where (7) and (8) show the polynomials for carry and sum outputs, respectively. However, (9) indicates the compact relation between outputs and inputs.

Definition 3. Substituting a component polynomial in the intermediate specification polynomial (i.e. \(S P_{i}\) ) means finding and replacing all the occurrences of component output(s) by the corresponding polynomial(s) in \(S P_{i}\).


Fig. 4. Substitution candidate
```

Algorithm 2 Dynamic Backward Rewriting
Input: Specification Poly $S P$, Set of Components $B$, Set of Component Polys $P$
Output: Remainder $r$
$S P_{i} \leftarrow S P$
while $B$ is not empty do
$C B \leftarrow$ Find eligible candidates in $B$ for substitution
for each $b \in C B$ do
$O[b] \leftarrow$ count_occurrence $\left(b, S P_{i}\right)$
sorted $C B \leftarrow$ Sort $C B$ in ascending order based on $O$
$S P_{\text {old }} \leftarrow S P_{i}$
threshold $\leftarrow 0.1 ; j \leftarrow 0 ;$ no_candidate $\leftarrow T R U E$
while no_candidate is TRUE do
$S P_{i}^{-} \leftarrow$ Substitute $\left(P_{\text {sorted } B[j]}, S P_{i}\right)$
if $\left(\operatorname{size}\left(S P_{i}\right)-\operatorname{size}\left(S P_{o l d}\right)\right) / \operatorname{size}\left(S P_{\text {old }}\right)<$ threshold then
Remove sorted $C B[j]$ from $B$
no_candidate $\leftarrow F A L S E$
else
$S P_{i} \leftarrow S P_{\text {old }} ; j \leftarrow j+1$
if $j \geq \operatorname{size}(\operatorname{SortedCB}$ ) then
$j \leftarrow 0$; threshold $\leftarrow$ threshold $\times 2$
$r \leftarrow S P_{i}$
return $r$

```

To ensure correct substitution, we need two rules: (1) In components with more than one output (i.e. atomic blocks), which have a compact word-level relation between inputs and outputs (see Equation (6)), we first search for \(G\left(\right.\) Out \(_{1}\), Out \(\left._{2}, \ldots, O u t_{m}\right)\) in \(S P_{i}\). Then, we have to distinguish between two cases: If we have found \(G\left(\mathrm{Out}_{1}, \mathrm{Out}_{2}, \ldots, \mathrm{Out}_{m}\right)\), we directly substitute it by \(F\left(I N_{1}, I N_{2}, \ldots, I N_{n}\right)\); otherwise, we substitute each output with the corresponding polynomials. Finding the exact \(G\) polynomial is sometimes not possible, particularly in optimized multipliers. Therefore, following this rule guarantees the correct substitution. (2) Assume that \(C\) is a component with output \(W\), and \(C_{1}, C_{2}, \ldots, C_{n}\) are components having \(W\) as one of their inputs (see Fig. 4). The \(C\) polynomial should be substituted only after substituting the \(C_{1}, C_{2}, \ldots, C_{n}\) polynomials. Following this rule guarantees that a component polynomial needs to be substituted only once during the backward rewriting as for example the signal \(W\) never appears again in the \(S P_{i}\) after substituting the \(C\) polynomial. A component which satisfies this rule and is ready for substitution is called a Candidate.

At each step of backward rewriting, there are several substitution candidates. As we discussed in Section III, choosing a substitution order which keeps the size of \(S P_{i}\) as small as possible and avoids explosion in optimized multipliers is vital. Therefore, we propose dynamic backward rewriting.

Algorithm 2 shows our proposed dynamic backward rewriting. The algorithm receives the specification polynomial, the set of components, and their corresponding polynomials as inputs, and performs all possible backward rewriting steps. First, the substitution candidates are found (see Line 3 in Algorithm 2). Then, the number of times which the output of each candidate occurs in \(S P_{i}\) is counted (Line 4 - Line 5). We sort the candidates in ascending order based on the number of their output occurrences in \(S P_{i}\) (Line 6) and basically we try to substitute them in \(S P_{i}\) according to that order. The following example motivates this choice:

Example 6. Consider the polynomial \(P=a+4 a b c-2 a d-\) 2adc. Substituting a variable occurring \(k\) times in \(P\) by \(a\) polynomial containing \(h\) monomials increases the number of monomials in the result by \(k \cdot(h-1)\) in the worst case. E.g. substituting \(a=x+y+z+x z\) in \(P\) results in a new polynomial with 16 monomials. For this reason we refrain from substituting variables with large numbers of occurrences in \(P\) first. Instead we start with variables having smaller numbers of occurrences and hope that cancellation of monomials improves the situation when we finally arrive at the substitution of later variables. Assume in the example that we substitute \(b=x y\) first, then \(c=x z\) and then \(d=x y z\), leading to \(P=a\). After substituting \(a=x+y+z+x z\) we arrive at the final result of the series of substitutions without exceeding a number of four monomials in between.

In our experiments, we observed that the order mentioned above successfully keeps the sizes of intermediate polynomials small in many cases. Nevertheless, this method is only heuristical and may fail. Therefore, before each substitution we make a copy of \(S P_{i}\) (Line 7) to which we may backtrack if the size of the polynomial grows too much. Altogether, this leads to a dynamic substitution order based on the growth behavior we actually observe.

Example 7. Consider the polynomial \(P=a b x+a b y-2 a b x y+\) \(a b+a\). Substituting the 4 occurrences of \(b\) by \(m+n-m n\) leads to a polynomial with 13 monomials. After substituting \(a=x y\) in the resulting polynomial we arrive at 4 monomials. However, if we substitute a (with one occurrence more than b) first, we obtain 2 monomials after the first substitution and again 4 monomials after substituting b. Of course, we prefer the second order which we obtain if we discard the substitution of \(b\) due to the size of the intermediate result.

In Algorithm 2, this idea is implemented as follows: After making a copy of \(S P_{i}\) (Line 7), the polynomial for the first component in the sorted candidate list is substituted in \(S P_{i}\) (Line 10). We check the increase in size of the polynomials after substitution. To avoid a sharp increase in the number of monomials, we set the increase threshold to \(10 \%\) (Line 8). If the increase is less than the threshold, we recognize it as a successful substitution. Thus, the candidate is removed from the list of components (Line 12 - Line 13), and the process is repeated by finding a new set of candidates (Line 3). On the other hand, if the increase is more than the threshold, then \(S P_{i}\) is restored to its state before substitution, and the process continues by substituting the next candidate in the sorted list (Line 15). If there is no substitution of any candidate that satisfies the threshold limit, the value of the threshold is multiplied by two and the process is repeated from the first candidate in the sorted list (Line 16 - Line 17).

In the next section we present the experimental results.

\section*{V. Experimental Results}

We have implemented the proposed approach DyPoSub in \(\mathrm{C}++\). The experiments have been carried out on an Intel Xeon E3-1270 v3 with 3.50 GHz and 32 GByte of main memory. First, we have evaluated our approach on a wide range of optimized multipliers. The results are summarized in Table I. Please note that the Time-Out (TO) has been set to 24 hours.

The first column Size of Table I shows the size of the multiplier based on the input bits. The second column Benchmark lists the architecture of the multiplier based on its stages (abbreviations are given below the table). These multipliers have been generated using the Arithmetic Module Generator [20] and GenMul [21]. The third column Optimiz. reports the performed optimization: this is either none (-), or the well-known abc [19] optimizations \(d c 2\) and resyn 3 , respectively.

The verification data of our proposed approach is reported in the fourth column Verification data which consists of three subcolumns: Nodes shows the number of AIG nodes of the multiplier, Vanishing Monomials gives the total number of removed vanishing monomials, and Max Poly Size shows the maximum size of the current polynomial \(S P_{i}\) during backward rewriting by counting the number of monomials.

The fifth column Ours of Table I reports the overall runtime of our proposed approach in seconds. Finally, the runtimes of the state-of-the-art verification methods are shown in the sixth column which consists of seven subcolumns. While the first subcolumn Com. reports the run-times of the commercial verification tool from Onespin, the remaining subcolumns give the run-times of the most recent available SCA approaches.

As can be seen, our approach is able to verify all non-optimized multipliers and only produces a time out for two benchmarks with optimization \((S P \circ B D \circ K S\) and \(S P \circ W T \circ C L ; 4\) instances in total). In contrast, none of the state-of-the-art SCA methods can verify any of the optimized multipliers. The commercial tool can prove one multiplier type incl. the optim. version, however only the \(16 \times 16\) instance.

For one multiplier example we also demonstrate the effect of the proposed dynamic ordering in more detail. Fig. 5 shows the number of monomials (vertical axis) in each step of backward rewriting (horizontal axis) for the 32-bit \(S P \circ D T \circ L F\) multiplier with and without optimization. The black line represents the use of a static ordering and the red line shows the polynomial sizes when using the proposed dynamic backward rewriting. Before optimization, both static and dynamic ordering allow to verify the circuit (black and red lines overlap from approx substitution step 300). However, when looking at both optimized multipliers the static ordering leads to (several) peaks with the consequence that the verification fails (see Fig. 5b and Fig. 5c). In contrast, as can be seen in all three figures the proposed dynamic ordering keeps the maximum size of \(S P_{i}\) small (orders of magnitudes smaller in comparison to the static order). Therefore, the proposed approach successfully verifies both optimized multipliers.

In a second experiment, we consider industrial benchmarks from the Synopsys DesignWare Library. They have been optimized for delay. The gate-level Verilog description of these multipliers is generated by mapping the multiplier IP to a standard cell library consisting of up to 3-input logical gates using Synopsys Design Compiler. Then, the Verilog description is converted to AIG using abc. The results can be found in Table II. In addition, the table also includes the highly optimized multiplier from the EPFL combinational benchmark suite [22]. As can be seen, the proposed approach is able to prove the correctness for all these multipliers while the commercial tool can only verify the smallest instance and all other SCA methods fail.

TABLE I
RESULTS OF VERIFYING OPTIMIZED MULTIPLIERS
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline \multirow[b]{2}{*}{Size} & \multirow[b]{2}{*}{Benchmark} & \multirow[b]{2}{*}{Optimiz.} & \multicolumn{3}{|c|}{Verification data} & \multicolumn{8}{|c|}{Run-times (seconds)} \\
\hline & & & Nodes & \begin{tabular}{l}
Vanishing \\
Monomials
\end{tabular} & \[
\begin{aligned}
& \text { Max Poly } \\
& \text { Size }
\end{aligned}
\] & Ours & Com. & [13] & [10] & [6] & [5] & [11] & [8] \\
\hline \multirow{3}{*}{\(16 \times 16\)} & \multirow{3}{*}{\(S P \circ D T \circ L F\)} & \multirow[t]{3}{*}{\[
\begin{gathered}
\mathrm{dc} 2 \\
\text { resyn3 }
\end{gathered}
\]} & 2,884 & 728 & 431 & 0.13 & 43.00 & 0.27 & 0.58 & 3.21 & TO & TO & TO \\
\hline & & & 2,115 & 27 & 318 & 0.04 & 40.00 & TO & TO & TO & TO & TO & TO \\
\hline & & & 2,702 & 1,120 & 433 & 0.06 & 43.00 & TO & TO & TO & TO & TO & TO \\
\hline \multirow{18}{*}{\(64 \times 64\)} & \multirow{3}{*}{\(S P \circ D T \circ L F\)} & & 48,808 & 2,249 & 7,256 & 7.22 & TO & 40.72 & 120.76 & 2,273.64 & TO & TO & TO \\
\hline & & dc2 & 36,365 & 5,618 & 8,734 & 16.82 & TO & TO & TO & TO & TO & TO & TO \\
\hline & & resyn 3 & 45,115 & 3,038 & 6,908 & 7.58 & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(S P \circ A R \circ C K\)} & - & 48,073 & 0 & 4,284 & 5.33 & TO & 151.17 & TO & TO & TO & TO & TO \\
\hline & & dc2 & 36,154 & 0 & 4,285 & 5.30 & TO & TO & TO & TO & TO & TO & TO \\
\hline & & resyn 3 & 43,501 & 0 & 4,284 & 5.41 & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(S P \circ B D \circ K S\)} & & 50,756 & 613,454 & 5,607 & 14.30 & TO & 162.26 & TO & TO & TO & TO & TO \\
\hline & & dc2 & 37,655 & - & - & TO & TO & TO & TO & TO & TO & TO & TO \\
\hline & & resyn3 & 47,746 & - & - & TO & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(S P \circ W T \circ C L\)} & & 68,875 & 266,684 & 4,461 & 19.76 & TO & 96.27 & 224.43 & TO & TO & TO & TO \\
\hline & & dc2 & 44,643 & - & - & TO & TO & TO & то & TO & TO & TO & TO \\
\hline & & resyn3 & 63,962 & - & - & TO & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(B P \circ A R \circ R C\)} & & 38,439 & 0 & 20,099 & 21.50 & TO & 78.61 & 70.43 & 911.07 & 0.09 & TO & TO \\
\hline & & dc2 & 31,312 & 0 & 15,882 & 21.43 & TO & TO & TO & TO & TO & TO & TO \\
\hline & & resyn3 & 34,317 & 0 & 20,097 & 23.21 & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(B P \circ O S \circ C U\)} & - & 39,798 & 0 & 25,803 & 34.68 & TO & 302.01 & TO & TO & TO & TO & TO \\
\hline & & \[
\mathrm{dc} 2
\] & 31,925 & 0 & 26,200 & 50.78 & TO & TO & то & TO & TO & TO & TO \\
\hline & & resyn3 & 37,156 & 0 & 20,100 & 27.12 & TO & TO & TO & TO & TO & TO & TO \\
\hline \multirow{9}{*}{\(128 \times 128\)} & \multirow{3}{*}{\(S P \circ A R \circ R C\)} & & 162,304 & 0 & 16,640 & 108.82 & TO & 966.57 & TO & TO & 1.10 & TO & \\
\hline & & dc2 & 146,044 & 0 & 16,642 & 101.63 & TO & TO & то & TO & TO & TO & TO \\
\hline & & resyn 3 & 162,298 & 0 & 16,640 & 104.64 & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(S P \circ D T \circ L F\)} & & 164,572 & 3,642 & 29,811 & 194.09 & TO & 527.37 & TO & TO & TO & TO & TO \\
\hline & & dc2 & 146,655 & 43,086 & 57,708 & 1905.57 & TO & TO & то & TO & TO & TO & TO \\
\hline & & resyn 3 & 150,711 & 4,900 & 28,721 & 196.68 & TO & TO & TO & TO & TO & TO & TO \\
\hline & \multirow{3}{*}{\(S P \circ W T \circ B K\)} & & 166,938 & 1,623 & 22,406 & 172.14 & TO & 706.34 & TO & TO & TO & TO & TO \\
\hline & & dc2 & \[
149,353
\] & 4,160 & 26,911 & 329.81 & TO & TO & TO & TO & TO & TO & TO \\
\hline & & resyn3 & 154,146 & 1,808 & 22,347 & 161.30 & TO & TO & TO & TO & TO & TO & TO \\
\hline \multicolumn{6}{|l|}{Stage \(1 \Rightarrow\) SP: Simple partial product generator \(\quad\) BP: Booth partial product gener} & & & \multicolumn{3}{|l|}{\multirow[b]{2}{*}{BD: Balanced delay tree}} & \multicolumn{3}{|r|}{TO: Time-Out} \\
\hline Stage \(2 \Rightarrow\) & R: Array & DT: Dadda & & & & & & & & & & & \\
\hline Stage \(3 \Rightarrow\) & C: Ripple carry & BK: Brent & ng LF & Ladner-Fische & CK: Carr & -sip C & Condi & nal sum & CL: Car & look-ahea & KS & ogge- & one \\
\hline
\end{tabular}

TABLE II
RESULTS OF VERIFYING INDUSTRIAL MULTIPLIERS
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline \multirow[b]{2}{*}{Source} & \multirow[b]{2}{*}{Size} & \multirow[b]{2}{*}{Nodes} & \multicolumn{6}{|c|}{Run-times (seconds)} \\
\hline & & & Ours & Com. & [13] & [10] [6] & [5] [1] & [11] [8] \\
\hline & \(16 \times 16\) & 2,432 & 0.13 & 40 & TO & TO TO & TO T & TO TO \\
\hline & \(32 \times 32\) & 7,240 & 2.27 & TO & TO & TO TO & TO T & TO TO \\
\hline & \(48 \times 48\) & 16,086 & 16.97 & & TO & TO TO & TO T & TO TO \\
\hline Synopsys & \(64 \times 64\) & 27,658 & 62.69 & TO & TO & то TO & TO T & TO TO \\
\hline DesignWare & \(96 \times 96\) & 61,180 & 506.86 & TO & TO & TO TO & TO T & TO TO \\
\hline Library & \(128 \times 128\) & 106,949 & 1,861.56 & TO & TO & TO TO & TO T & TO TO \\
\hline (pparch*) & \(160 \times 160\) & 166,492 & 4,569.96 & TO & TO & TO TO & TO T & TO TO \\
\hline & \(192 \times 192\) & 238,920 & 9,846.22 & TO & TO & TO TO & TO T & TO TO \\
\hline & \(256 \times 256\) & 422,077 & 29,988.20 & TO & TO & TO TO & TO T & TO TO \\
\hline EPFL mul. & \(64 \times 64\) & 27,190 & 76.89 & TO & TO & TO TO & TO T & TO TO \\
\hline
\end{tabular}
* \(\overline{\text { Delay-optimized flexible Booth Wallace after technology mapping }}\)


Fig. 5. \(\quad S P_{i}\) size during backward rewriting of 32-bit \(S P \circ D T \circ L F\) using static (black line) and dynamic (red line) orderings

\section*{VI. CONCLUSION}

In this paper, we have proposed our novel SCA-based verification approach DyPoSub. For the first time a dynamic substitution order is used during backward rewriting which allows to control the size of the intermediate polynomial. Our approach allows to verify optimized and technology mapped multipliers including industrial benchmarks which has not been possible before.

\section*{REFERENCES}
[1] R. Drechsler, B. Becker, and S. Ruppertz, "The K*BMD: A verification data structure," IEEE Design \& Test of Computers, vol. 14, no. 2, pp. 51-59, 1997.
[2] R. P. Kurshan and L. Lamport, "Verification of a multiplier: 64 bits and beyond," in CAV, 1993, pp. 166-179.
[3] S. Vasudevan, V. Viswanath, R. W. Sumners, and J. A. Abraham, "Automatic verification of arithmetic circuits in RTL using stepwise refinement of term rewriting systems," TC, vol. 56, no. 10, pp. 1401-1414, 2007.
[4] F. Farahmandi and B. Alizadeh, "Gröbner basis based formal verification of large arithmetic circuits using gaussian elimination and cone-based polynomial extraction," MICPRO, vol. 39, no. 2, pp. 83-96, 2015.
[5] C. Yu, M. Ciesielski, and A. Mishchenko, "Fast algebraic rewriting based on and-inverter graphs," TCAD, vol. 37, no. 9, pp. 1907-1911, 2017.
[6] A. Sayed-Ahmed, D. Große, U. Kühne, M. Soeken, and R. Drechsler, "Formal verification of integer multipliers by combining Gröbner basis with logic reduction," in DATE, 2016, pp. 1048-1053.
[7] C. Yu, W. Brown, D. Liu, A. Rossi, and M. Ciesielski, "Formal verification of arithmetic circuits by function extraction," TCAD, vol. 35, no. 12, pp. 2131-2142, 2016.
[8] D. Ritirc, A. Biere, and M. Kauers, "Column-wise verification of multipliers using computer algebra," in FMCAD, 2017, pp. 23-30.
[9] A. Mahzoon, D. Große, and R. Drechsler, "Combining symbolic computer algebra and boolean satisfiability for automatic debugging and fixing of complex multipliers," in ISVLSI, 2018, pp. 351-356.
[10] A. Mahzoon, D. Große, and R. Drechsler, "PolyCleaner: clean your polynomials before backward rewriting to verify million-gate multipliers," in ICCAD, 2018, pp. 129:1-129:8.
[11] D. Ritirc, A. Biere, and M. Kauers, "Improving and extending the algebraic approach for verifying gate-level multipliers," in DATE, 2018, pp. 1556-1561.
[12] D. Kaufmann, A. Biere, and M. Kauers, "Verifying large multipliers by combining SAT and computer algebra," in FMCAD, 2019, pp. 28-36.
[13] A. Mahzoon, D. Große, and R. Drechsler, "RevSCA: Using reverse engineering to bring light into backward rewriting for big and dirty multipliers," in DAC, 2019, pp. 185:1-185:6
[14] P. Pan and C.-C. Lin, "A new retiming-based technology mapping algorithm for lut-based fpgas," in FPGAs for Custom Computing Machines, 1998, pp. 35-42.
[15] D. A. Cox, J. Little, and D. O'Shea, Ideals Varieties and Algorithms. Springer, 1997.
[16] D. Kaufmann, A. Biere, and M. Kauers, "Incremental column-wise verification of arithmetic circuits using computer algebra," Formal Methods in Sys. Design, Feb. 2019.
[17] I. Koren, Computer Arithmetic Algorithms, 2nd ed. A. K. Peters, Ltd., 2001.
[18] R. Zimmermann, "Binary adder architectures for cell-based vlsi and their synthesis," Ph.D. dissertation, Swiss Federal Institute of Technology, 1997.
[19] "Abc: A system for sequential synthesis and verification," available at https://people.eecs. berkeley.edu/~alanmi/abc/, 2018.
[20] "Arithmetic module generator based on acg," available at https://www.ecsis.riec.tohoku.ac jp/topics/amg/i-amg, 2019.
[21] A. Mahzoon, D. Große, and R. Drechsler, "GenMul: Generating architecturally complex multipliers to challenge formal verification tools," in IWLS, 2019.
[22] L. Amaru, P.-E. Gaillardon, and G. De Micheli, "The EPFL combinational benchmark suite," in IWLS, 2015.```


[^0]:    This work was supported by the University of Bremen's graduate school SyDe funded by the German Excellence Initiative, and by the German Academic Exchange Service (DAAD).

[^1]:    ${ }^{1}$ Our tool DyPoSub is available on GitHub; links can be found at http://www.sca-verification.org/dyposub

