# Joint Optimization of Randomizer and Computing Core for Low-Cost Stochastic Circuits

Kuncai Zhong, Xuan Wang, Chen Wang, Weikang Qian\*

University of Michigan-Shanghai Jiao Tong University Joint Institute, Shanghai Jiao Tong University, China {kczhong,xuan.wang,wangchen\_2011,qianwk}@sjtu.edu.cn

# ABSTRACT

Stochastic computing (SC) is an unconventional computing paradigm that computes on stochastic bit streams. It is promising to implement complex functions with low-cost circuitry. A stochastic circuit typically consists of a randomizer to generate the stochastic bit streams and an SC core computing on the bit streams. To design a low-cost stochastic circuit, many works have been proposed to optimize these two parts. However, the works optimize them insufficiently due to the overlook of some optimization space and separately without considering their mutual influence, thus causing the final stochastic circuit sub-optimal. In this work, to address this issue, we first introduce a low-cost randomizer architecture and a method for optimizing the SC core. Then, by combining these two techniques together, we further propose a method to jointly optimize the randomizer and the SC core. Our experimental results show that compared to the conventional method, the proposed joint optimization method can reduce 39.70% area and 42.74% power for the stochastic circuit.

# **CCS CONCEPTS**

- Hardware  $\rightarrow$  Arithmetic and datapath circuits.

# **KEYWORDS**

joint optimization, randomizer, stochastic computing core

### ACM Reference Format:

Kuncai Zhong, Xuan Wang, Chen Wang, Weikang Qian. 2022. Joint Optimization of Randomizer and Computing Core for Low-Cost

NANOARCH '22, December 7–9, 2022, Virtual, OR, USA

© 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9938-8/22/12...\$15.00

https://doi.org/10.1145/3565478.3572540

Stochastic Circuits. In 17th IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH '22), December 7–9, 2022, Virtual, OR, USA. ACM, New York, NY, USA, 6 pages. https://doi. org/10.1145/3565478.3572540

# **1** INTRODUCTION

Stochastic computing (SC) is an unconventional computing paradigm proposed in 1960s [1]. It computes on stochastic bit streams (SBSs), which consists of zeros and ones and encodes the value by the ratio of ones. Compared to the binary computing, it can implement complex functions with simple circuitry and has strong fault tolerance. For example, it only needs an AND gate to implement the multiplication.

A stochastic circuit typically consists of a randomizer and an SC core as shown in Fig. 1. The randomizer generates  $d_i$ SBSs encoding the input binary number  $X_i$  for  $1 \le i \le k$ and *m* SBSs of constant values. The SC core computes on these SBSs to implement the target function. In this paper, we consider optimizing the stochastic circuits that implement *univariate functions*. In this case, we have k = 1. For simplicity, in the following, we denote  $X_1$ ,  $d_1$ , and  $x_{1,i}$  as X, d, and  $x_i$ , respectively.



Figure 1: Illustration of a stochastic circuit.

There are many works proposed for stochastic circuit optimization by optimizing either the randomizer [2-5] or the SC core [6-8]. However, these methods generally optimize the two parts insufficiently due to the overlook of some optimization space and separately without considering their mutual influence, thus causing the final stochastic circuit suboptimal. For example, for the optimization of the randomizer, Ting *et al.* [3] propose a method to efficiently generate multiple SBSs of variable values by inserting D flip-flops (DFFs).

<sup>\*</sup>This work is supported by the National Key R&D Program of China under Grant 2020YFB2205501. Corresponding author: Weikang Qian. Weikang Qian is also with the MoE Key Lab of AI, Shanghai Jiao Tong University, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

The method can significantly reduce the hardware cost. Nevertheless, it does not give an efficient way to generate SBSs of constant values and does not consider further optimizing the SC core.

In this special session paper, we first review two of our previously proposed techniques for randomizer optimization and SC core optimization, respectively. To optimize the randomizer, we proposed a low-cost randomizer architecture and its associated configuration method in [9]. The architecture only needs a single random number source (RNS), a single comparator, and the minimum number of DFFs to generate all the SBSs of variable and constant values, and the configuration method can lead to a high accuracy. To optimize the SC core, we proposed a dynamical approximation method in [10]. It can efficiently obtain a low-cost SC core satisfying the accuracy requirement.

Then, based on the above two techniques, we propose a method to jointly optimize the randomizer and the SC core in the work. It uses an early-termination strategy for acceleration. The experimental results show that compared to the conventional method, the proposed joint optimization method can reduce 39.70% area and 42.74% power for the stochastic circuit.

We organize the rest of the paper as follows. Section 2 introduces the background and the related works. Section 3 introduces the low-cost randomizer architecture and the method to optimize the SC core, and presents the proposed joint optimization method. Section 4 shows the experimental results, and Section 5 concludes the paper.

### 2 BACKGROUND AND RELATED WORKS

We introduce some background and related works about the optimization of the stochastic circuit.

# 2.1 SC core optimization

In general, the SC core approximately implements the target function f(x) by a Bernstein polynomial  $\tilde{f}(x)$  as

$$\tilde{f}(x) = \sum_{i=0}^{d} \frac{G(i)}{2^m} x^i (1-x)^{d-i},$$
(1)

where *d* and *m* are the *degree* and the *precision*, respectively, and G(i)s are the integers in the range  $\left[0, 2^m \binom{d}{i}\right]$  [6; 7]. The vector  $(G(0), G(1), \ldots, G(d))$  is called the *feature vector*. As shown in Eq. (1), the degree, the precision, and the feature vector determine the approximation error between the Bernstein polynomial  $\tilde{f}(x)$  and the target function f(x). To obtain a high-accuracy SC core, we need to first obtain a Bernstein polynomial with a proper choice of degree, precision, and feature vector, and then design a core to implement this polynomial.

To achieve this, for a specified pair of *d* and *m*, Qian *et al*. first propose a method to obtain a feature vector giving the minimum approximation error [11]. Then, for this polynomial, Zhao et al. propose a core to implement it as shown in Fig. 2 [6]. The core is based on the combinational logic and takes as input d SBSs of the variable value x and m SBSs of the constant value 0.5. To efficiently synthesize the core with low hardware cost, Peng et al. propose a method based on the cube assignment [7]. It tries many possible ways to iteratively split the feature vector into some cubes, where a cube can be directly implemented by an AND gate, and the cubes of a split are ORed together to form a sum-of-product (SOP) expression. Then, it further simplifies each SOP expression by applying the two-level logic optimization tool ESPRESSO and obtains the corresponding SC core [12]. Finally, after obtaining all the SC cores for all the feature vector splittings, it chooses the one with the minimum hardware cost.

By the above methods, we can obtain a low-cost SC core approximately implementing the target function with a small approximation error. However, these methods only consider and synthesize a single feature vector for a pair of d and m. They ignore many other possible feature vectors, which also have small approximation errors and may lead to lower hardware costs.

#### 2.2 Randomizer optimization

SC generally applies a stochastic number generator (SNG) to generate an SBS. As shown in Fig. 2, an SNG typically consists of an RNS and a comparator, where the RNS generates a random binary number R in a clock cycle, and the comparator outputs a 1 if R is less than the input X. In general, to ensure a high accuracy for the stochastic circuit, we apply d different SNGs to generate d independent SBSs. However, it leads to a considerable hardware cost.



# Figure 2: Conventional architecture of the stochastic circuit.

To optimize it, several works are proposed [2–5]. Among them, the one proposed in [3] has a low hardware cost for generating the SBSs of the same value. It can generate dSBSs of the same value by applying (d - 1) DFFs after an SNG as shown in Fig. 2, where a DFF delays an SBS for a clock cycle and generates another independent one of the same value. However, it does not propose an efficient way to generate SBSs of constant values. Note that an *n*-bit RNS generates *n* bits with the probability of 0.5 to be a one in a clock cycle, and we can use each of them to constitute an SBS of the value 0.5 [6; 13]. Therefore, to generate *m* SBSs of the value 0.5 and the length  $2^n$ , we can apply an *n*-bit RNS and select *m* outputs from it. The conventional architecture for the stochastic circuit is shown in Fig. 2, where BS is a bit selection module to select *m* outputs from the *n*-bit RNS. Note that BS costs no hardware overhead. The design only needs 2 RNSs, a comparator, and (d - 1) DFFs to generate (d + m) SBSs. However, it still needs 2 RNSs. In this paper, we will introduce a low-cost architecture that can reduce the number of used RNSs to 1.

# 2.3 Conventional optimization method

A conventional method optimizes the stochastic circuit based on the above techniques. For a target function f(x) with an error bound given, it first applies the method in [11] to obtain the feature vectors for many different pairs of d and m. Then, for these feature vectors, it applies the method in [7] to synthesize and obtain many SC cores. Afterward, it applies the conventional architecture shown in Fig. 2 to optimize the randomizer for each of these SC cores and obtain many stochastic circuits. Finally, it outputs the stochastic circuit, which has an error less than the error bound and has the minimum hardware cost. The optimized stochastic circuit generally has low hardware cost. However, the method separately optimizes the SC core and the randomizer without considering their mutual influence and needs to synthesize many different SC cores. This generally leads to a sub-optimal stochastic circuit and a long runtime.

## **3 METHODOLOGY**

In this section, we first introduce a low-cost randomizer architecture and its associated configuration method. Then, we introduce a method to optimize the SC core. Finally, based on them, we propose a method to jointly optimize the randomizer and the SC core.

#### 3.1 Method to optimize randomizer

For the optimization of the randomizer, we introduce a lowcost architecture and a configuration method proposed in our previous work [9]. The low-cost architecture is shown as Fig. 3, where BS,  $SR_i$ 's, and  $NG_i$ 's are the bit selection module, the scrambling modules, and the negation modules, respectively. Compared to the conventional architecture shown in Fig. 2, it shares an RNS for generating the SBSs of both the value x and the value 0.5, and applies several scrambling and negation modules to permute and negate the output bits, respectively, to improve the accuracy [4; 9]. Note that the bit selection module, the scrambling modules, and the negation modules cost no hardware overhead, and the number of used DFFs is the minimum to generate d SBSs of the same value x. Thus, the randomizer architecture has a low hardware cost with only a single RNS, a single comparator, and the minimum number of DFFs.



# Figure 3: Low-cost architecture of the stochastic circuit [9].

To optimize the accuracy of the randomizer, we also proposed a method to configure the modules (RNS, BS, NG<sub>1</sub>,  $NG_2$ ,  $SR_1$ ,  $SR_2$ , and  $SR_3$ ) as shown in Fig. 4 [9]. It first initializes the modules. Then, it iteratively searches the configurations for some of them while fixing the others in a loop, which consists of 3 steps. In the first step, it exhaustively tries all possible configurations for BS,  $NG_1$ , and  $NG_2$ , and randomly tries the configurations for RNS. In the second step, it first exhaustively tries all possible configurations for  $SR_1$  and randomly tries the configurations for  $SR_2$  and RNS; then, it exhaustively tries all possible configurations for  $SR_2$ and randomly tries the configurations for  $SR_1$  and RNS. In the third step, it tries all possible configurations for  $SR_3$ . The method updates the best configuration for the architecture once obtaining a higher accuracy, and terminates when no update is obtained after traversing these 3 steps. As shown in [9], the method can generally lead to a high accuracy.



Figure 4: Flow chart of the configuration method [9].

#### 3.2 Method to optimize the SC core

As discussed in Section 2.1, the previous method in [7] only considers a single feature vector for a specified pair of d and m. It ignores other feature vectors, which may lead to lower hardware costs. To address this issue, we propose a *dynamic approximation* method in our previous work as shown in Fig. 5 [10]. It synthesizes an SC core for a target function f(x) satisfying a given error bound. It additionally takes a

feature vector as input, which we call *input feature vector*. based on It synthesizes an SC core iteratively with 2 steps in each by apply



Figure 5: Flow chart of the dynamic approximation method [10].

The first step is to try many possible ways to split the feature vector into a cube and a remaining feature vector. The cube is directly implemented by an AND gate, and the remaining feature vector will be left for the next split step. Different than the method proposed in [7], this step does not require that their sum exactly equals the original feature vector. For example, when d = 3 and m = 2, the feature vector (1, 1, 6, 2) can be split into the cube [0, 0, 2, 0] and the remaining feature vector (1, 1, 4, 2), or the cube [0, 2, 4, 2]and the remaining feature vector (1, 0, 2, 0), where the first split is exact, while the second is inexact. The cube [0, 0, 2, 0]can be implemented by a 4-input AND gate with 3 SBSs of the value *x* and an SBS of the value 0.5 as inputs. The cube [0, 2, 4, 2] can be implemented by a 2-input AND gate with an SBS of the value *x* and an SBS of the value 0.5 as inputs. The remaining feature vectors (1, 1, 4, 2) and (1, 0, 2, 0) will be left for the next split step. For the second split, as the sum of the cube and the remaining feature vector does not equal the original feature vector, the sum leads to a new feature vector for the target function f(x) with a new approximation error, which further leads to some new SC cores. To satisfy the accuracy requirement, we abandon any split with an approximation error larger than the error bound.

By applying the first step, we obtain many different splits for the input feature vector and have more possibility to find an SC core with a lower cost. However, as the process goes on, the number of splits grows exponentially, leading to a very large design space that needs a very long runtime to explore. To speed up the process, the second step prune some unpromising splits, which are more likely to have high hardware costs than the others. For example, for the splits [0, 0, 2, 0] + (1, 1, 4, 2) and [0, 2, 4, 2] + (1, 0, 2, 0), their cubes need a 4-input and a 2-input AND gates to implement, respectively. The former split is more likely to have a high hardware cost than the latter. Therefore, we can prune the former.

The loop terminates when no feature vector is left to split. Then, similar to [7], we construct many Boolean functions based on the cubes and further simplify them to the SC cores by applying ESPRESSO [12]. Among the cores, we choose the one with the lowest hardware cost. Compared to the method proposed in [7], this method dynamically considers many possible feature vectors to approximate the target functions in the synthesis process and generally obtains an SC core with a lower cost.

# 3.3 Joint optimization method

To optimize a stochastic circuit, a simple method is to separately apply the above randomizer and SC core optimization methods. Specifically, for a target function with an error bound given, we can first apply the dynamic approximation method to obtain an optimized SC core based on the initial input feature vector obtained by the method from [11], and then apply the low-cost architecture and the configuration method to optimize the randomizer. The method is very efficient to optimize a stochastic circuit. However, it still optimizes the randomizer and the SC core *separately*. Such a separate method will sometimes lead to an optimized stochastic circuit with an error either much less or larger than the error bound. For the former case, it means that we can further relax the accuracy constraint for the dynamic approximation method to obtain an SC core with a lower cost. For the latter case, it means that the stochastic circuit does not satisfy the accuracy requirement, and we need to tighten the accuracy constraint for the dynamic approximation method. In the following, we call this method the separate optimization method.

To address the issue of the separate method, we propose a method for joint optimization of the randomizer and the SC core as shown in Algorithm 1. Its basic idea is to dynamically update the accuracy constraint for the dynamic approximation method by taking the influence of the randomizer configuration method into consideration. A proper accuracy constraint will lead to a properly optimized SC core and finally a low-cost stochastic circuit.

For a target function f(x), based on the error bound for the entire stochastic circuit,  $E_{bound}$ , we first initialize the error bound for the dynamic approximation method, which optimizes the SC core, as  $E_{core-bound} = tE_{bound}$ , where *t* is a parameter. In general, to do optimization in a large design space and avoid missing the optimal design, we let  $t \ge 1$  so that the error bound for the SC core optimization is larger than that for the entire stochastic circuit. Then, we optimize the stochastic circuit in a loop, which consists of 3 steps. First, in Line 5, we consider many different pairs of *d* and *m*, and apply the method in [11] to obtain the corresponding feature vectors. Note that a pair of *d* and *m* with a smaller sum is more likely to lead to a low-cost SC core [14]. For these feature vectors, we choose the one as the input feature vector with its approximation error  $E_{input}$  less than the error bound  $E_{core-bound}$  and having the minimum sum of d and m. Then, in Line 6, based on the input feature vector, we apply the dynamic approximation method to obtain an optimized SC core under the error bound  $E_{core-bound}$ . Finally, in Line 7, based on the obtained SC core, we apply the low-cost architecture and the configuration method to optimize the randomizer and obtain the error  $E_{circuit}$  for the optimized stochastic circuit. If  $E_{circuit}$  is larger than  $E_{bound}$ , we start the next iteration by setting  $E_{core-bound}$  as  $E_{input}$  in Line 8. This will ensure an improvement in accuracy with a new and more accurate input feature vector obtained. Otherwise, Line 9 terminates the loop, and Line 10 outputs the optimized stochastic circuit.

Algorithm 1: Joint optimization method.

- 1 **input:** Target function f(x) and error bound  $E_{bound}$ ;
- 2 output: Optimized stochastic circuit;
- 3  $E_{core-bound} \leftarrow tE_{bound};$
- 4 while true do
- Apply the method in [11] to obtain an input feature vector V with the error E<sub>input</sub> < E<sub>core-bound</sub>;
- Based on the input feature vector V, apply the dynamic approximation method to obtain an optimized SC core under the error bound *E<sub>core-bound</sub>*;
- Based on the obtained SC core, apply the low-cost architecture and the configuration method to optimize the randomizer and obtain the error *E<sub>circuit</sub>*;
- s **if**  $E_{circuit} > E_{bound}$  then  $E_{core-bound} \leftarrow E_{input}$ ;
- 9 else break;
- 10 return Optimized stochastic circuit;

Note that we also slightly modify the randomizer configuration method for acceleration. The previous randomizer configuration method targets at minimizing the error  $E_{circuit}$ through an iterative improvement loop. We observe that we only require the optimized stochastic circuit to have its error  $E_{circuit} \leq E_{bound}$ . Thus, when applying the configuration method, we terminate its loop early once it finds a configuration with  $E_{circuit} \leq E_{bound}$ .

Compared to the separate method, the joint method can obtain a stochastic circuit that always satisfies the accuracy requirement and is more likely to have a lower hardware cost.

#### 4 EXPERIMENTAL RESULTS

In this section, we show the experimental results.

#### 4.1 Experimental setup

In this work, we consider linear feedback shift registers (LF-SRs) as the RNSs. We choose 6 functions as the test cases, which are listed in Table 1 together with their IDs. We compare the proposed joint method with the conventional and the separate optimization methods. For the joint optimization method shown in Algorithm 1, we set t = 3 and consider

all possible pairs of positive d and positive m with their sums larger than 2 and less than 8. For a fair comparison, for the conventional optimization method, we consider the same pairs of *d* and *m* and optimize the randomizer by randomly configuring the feedback polynomials and the seeds of LFSRs for a half hour. For the separate optimization method, we directly apply the error bound for the stochastic circuit as the accuracy constraint for the SC core optimization and obtain the input feature vector in the same way as the joint optimization method. Note that these 3 methods need to measure the hardware costs of some possible SC cores in the SC core optimization step. In this work, we apply area-delay product (ADP) as the hardware cost measurement, and obtain the ADP by applying ABC [15] based on the MCNC standard cell library [16]. For simplicity, in the following, we denote the conventional, the separate, and the joint optimization methods as Conventional, Separate, and Joint, respectively. We test them with the bit-width as 8, and correspondingly, the length of an SBS is 256.

Table 1: The target functions.

|          |        |           | -       |          |             |                      |
|----------|--------|-----------|---------|----------|-------------|----------------------|
| Function | sin(x) | $\cos(x)$ | tanh(x) | $e^{-x}$ | $\log(1+x)$ | $\frac{1}{1+e^{-x}}$ |
| ID       | 1      | 2         | 3       | 4        | 5           | 6                    |

#### 4.2 Accuracy comparison

We first compare the accuracy of the 3 methods. We apply root mean square error (RMSE) as the accuracy measure, and compute it over 1000 different inputs. We set the error bound  $E_{bound}$  for the stochastic circuit as 0.02. We apply the 3 methods to optimize the stochastic circuits. The RMSEs of the optimized stochastic circuits to implement different functions are shown in Table 2.

Table 2: RMSE comparison for 3 methods.

| function     | 1     | 2     | 3     | 4     | 5     |       | Average |
|--------------|-------|-------|-------|-------|-------|-------|---------|
| Conventional | 0.007 | 0.012 | 0.006 | 0.009 | 0.012 | 0.003 | 0.008   |
| Separate     | 0.006 | 0.005 | 0.007 | 0.005 | 0.007 | 0.004 | 0.006   |
| Joint        | 0.011 | 0.014 | 0.013 | 0.019 | 0.019 | 0.007 | 0.014   |

As shown in Table 2, the RMSEs of the optimized stochastic circuits for the 3 methods are all less than the error bound. Thus, the optimized stochastic circuits satisfy the accuracy requirement. Note that the RMSEs of *Joint* are higher than those of the other 2 methods. This is because we terminate the randomizer configuration process early once finding a configuration satisfying the accuracy requirement as introduced in Section 3.3. Actually, the accuracy of *Joint* can be further improved. However, this will lead to a longer runtime.

#### 4.3 Hardware cost comparison

For the hardware cost comparison, we first compare the 3 methods in terms of area. We synthesize the optimized stochastic circuits by Synopsys Design Compiler [17] and

obtain their areas based on the Nangate 45nm library [18]. The results are shown in Fig. 6.



Figure 6: Area comparison for 3 methods.

As shown in the figure, the optimized stochastic circuits obtained by *Joint* generally have the smallest area. Compared to *Conventional* and *Separate*, *Joint* can achieve 39.70% and 3.59% area reduction on average, respectively.

Then, we compare the 3 methods in terms of power. We analyze the optimized stochastic circuits by Synopsys Prime Time [17] and obtain their power based on the Nangate 45nm library [18]. The results are shown in Fig. 7.



Figure 7: Power comparison for 3 methods.

As shown in the figure, the optimized stochastic circuits obtained by *Joint* generally have the smallest power. Compared to *Conventional* and *Separate*, *Joint* can achieve 42.74% and 3.57% power reduction on average, respectively. Therefore, the proposed joint optimization method can achieve a lower hardware cost compared to the conventional and the separate methods under the same accuracy bound.

Note that compared to *Separate*, *Joint* only slightly reduces the hardware cost. This is because these 2 methods generally produce randomizers with the same hardware cost and the improvement of *Joint* mainly lies in the SC core, which does not occupy a large portion of a stochastic circuit.

#### 4.4 Runtime comparison

Finally, we compare the runtime of the methods. Note that we randomly configure the randomizer for a half hour for *Conventional*, and it needs to try many different pairs of degree and precision for each target function. This leads to a much longer runtime than the other 2 methods. Therefore, we only compare *Separate* and *Joint* on runtime. The results are listed in Table 3.

Table 3: Runtime (s) comparison for Separate and Joint.

| function | 1     | 2     | 3     | 4      | 5      | 6     | Average |
|----------|-------|-------|-------|--------|--------|-------|---------|
| Separate | 665.5 | 237.7 | 421.1 | 1100.6 | 812.9  | 177.8 | 569.3   |
| Joint    | 13.0  | 0.3   | 468.8 | 1066.0 | 1162.7 | 0.2   | 451.8   |

As shown in the table, compared to *Separate*, *Joint* has a smaller runtime on average. Particularly, the runtime for

functions 1, 2, and 6 is much shorter. This is because for these functions, *Joint* finds a valid stochastic circuit in the first round and the randomizer configuration process terminates early. In summary, *Joint* is efficient to optimize the stochastic circuit with a lower hardware cost.

# 5 CONCLUSION

In this work, aiming to design low-cost stochastic circuits, we first introduce a randomizer optimization method and an SC core optimization method. Then, by properly combining them, we propose a joint method to efficiently optimize the stochastic circuits. The experimental results show that compared to the conventional and the separate methods, the proposed joint method can generally achieve a lower hardware cost using a shorter synthesis time. In the future, we will explore a more efficient joint method to optimize stochastic circuits.

#### REFERENCES

- B. R. Gaines. Stochastic computing. In AFIPS Spring Joint Computer Conference, pages 149–156, 1967.
- [2] H. Ichihara, S. Ishii, et al. Compact and accurate stochastic circuits with shared random number sources. In *ICCD*, pages 361–366, 2014.
- [3] P. Ting and J. P. Hayes. Isolation-based decorrelation of stochastic circuits. In *ICCD*, pages 88–95, 2016.
- [4] J. H. Anderson, Y. Hara-Azumi, and S. Yamashita. Effect of LFSR seeding, scrambling and feedback polynomial on stochastic computing accuracy. In *DATE*, pages 1550–1555, 2016.
- [5] S. A. Salehi. Low-cost stochastic number generators for stochastic computing. *IEEE TVLSI*, 28(4):992–1001, 2020.
- [6] Z. Zhao and W. Qian. A general design of stochastic circuit and its synthesis. In DATE, pages 1467–1472, 2015.
- [7] X. Peng and W. Qian. Stochastic circuit synthesis by cube assignment. IEEE TCAD, 37(12):3109–3122, 2018.
- [8] K. Parhi and Y. Liu. Computing arithmetic functions using stochastic logic by series expansion. *IEEE TETC*, 7(1):44–59, 2019.
- [9] K. Zhong, Z. Li, and W. Qian. Towards low-cost high-accuracy stochastic computing architecture for univariate functions: Design and design space exploration. In *DATE*, pages 346–351, 2022.
- [10] C. Wang et al. Exploring target function approximation for stochastic circuit minimization. In *ICCAD*, pages 1–9, 2020.
- [11] W. Qian et al. An architecture for fault-tolerant computation with stochastic logic. *IEEE TC*, 60(1):93–105, 2011.
- [12] R. L. Rudell and A. Sangiovanni-Vincentelli. Multiple-valued minimization for PLA optimization. *IEEE TCAD*, 6(5):727–750, 1987.
- [13] J. Alspector et al. A vlsi-efficient technique for generating multiple uncorrelated noise sources and its application to stochastic neural networks. *IEEE TCAS*, 38(1):109–123, 1991.
- [14] X. Wang, Z. Chu, and W. Qian. Minsc: An exact synthesis-based method for minimal-area stochastic circuits under relaxed error bound. In *ICCAD*, pages 1–9, 2021.
- [15] A. Mishchenko et al. ABC: A system for sequential synthesis and verification, release 80916. http://people.eecs.berkeley.edu/~alanmi/ abc/, 2021.
- [16] S. Yang. Logic synthesis and optimization benchmarks. Technical report, Microelectronics Center of North Carolina, 1991.
- [17] Synopsys Inc. http://www.synopsys.com, 2021.
- [18] Nangate Inc. http://www.nangate.com, 2021.