# A Design Method of Address Generators Using Hash Memories 

Tsutomu Sasao<br>Department of Computer Science and Electronics, Kyushu Institute of Technology, Iizuka 820-8502, Japan


#### Abstract

An address generator produces a unique address from 1 to $k$ when the input that matches one of $k$ registered vectors, and produces 0 for other inputs. This paper presents a method to design an address generator using a hash memory and an LUT cascade. The hash memory realizes about $90 \%$ of the registered vectors, while the LUT cascade realizes the remaining $10 \%$ of the registered vectors. This method uses a non-disjoint functional decomposition to reduce the size of memory. The experimental results using lists of English words show that the usefulness of the approach. The total amount of memory is only $20 \%$ to $25 \%$ of the memory necessary to implement the function by using an LUT cascade alone. Theoretical analysis supports the experimental results.


## 1 Introduction

Consider a set of $k$ distinct binary vectors of $n$ bits. An address generation function produces a unique address from 1 to $k$ for the input that matches a vector in the set, and produces 0 for vectors outside the set [12]. Address generation functions are used in hardware for the internet [1], memory patching circuits [7], etc. In this paper, we assume that the number of vectors $k$ in the set is much smaller than that of the maximal possible input combinations $2^{n}$.
For example, consider an address generation function with $n=32$ and $k=40,000$. The straightforward way to implement this address generation function is to store the truth table into a memory. However, this method require a memory with unrealistic size, since the size of the memory is proportional to $2^{n}$. Another method to implement the function is a two-level logic circuit or a Programmable Logic Array (PLA). Unfortunately, this method still requires large chip area.
The third method is a content addressable memory (CAM)[6]. It requires a special circuit that cannot be implemented by ordinary logic gates or memory. Recently, we developed a method that uses an LUT cascade [13]. The LUT cascade is realized by a series connection of ordinary memories. It realizes address generation functions
efficiently when $k$ is up to $k=1000$. However, when $k$ is larger, an LUT cascade requires cells with $\left\lceil\log _{2}(k+1)\right\rceil+1$ inputs. So, it can be too large for the available resource in an embedded system.
In this paper, we present an efficient method to implement an address generation function by a hybrid method that uses a hash memory and an LUT cascade. The hash memory implements about $90 \%$ of the vectors, while the LUT cascade realizes remaining $10 \%$ of the vectors. The hybrid method requires only 20 to $25 \%$ of the memory necessary to implement the function by an LUT cascade alone. Theoretical analysis is also done, which supports the experimental results.
Besides address generation functions, this design method can implement an $n$-variable function, where the number of non-zero outputs $k$ is much smaller than $2^{n}$.

## 2 Address Generation Function

Definition 2.1 Consider a set of $k$ binary vectors of $n$ bits. Denote these vectors as registered vectors. For each registered vector, assign unique integer from 1 to $k$. A registered vector table shows the relation of registered vectors and corresponding integers. An address generation function produces the corresponding integer if the input matches to a registered vector, and produces 0 otherwise. $k$ is the weight of the address generation function.

In this paper, we assume that $k$ is much smaller than $2^{n}$, the total number of input combinations.

Example 2.1 Table 2.1 shows a registered vector table consisting of 7 vectors. The corresponding address generation function is shown in Table 2.2, where only non-zero outputs are shown. It produces a 3-bit number (e.g., 001) corresponding to the integer of the matched vector. When no entry matches to the input vector, the function produces 000.
(End of Example)

Table 2.1: Registered vector table

| Address | Vector |
| :---: | :---: |
| 1 | 000010 |
| 2 | 010010 |
| 3 | 001010 |
| 4 | 001110 |
| 5 | 000001 |
| 6 | 111011 |
| 7 | 010111 |

Table 2.2: Address generation function

| $x_{1}$ | $x_{2}$ | $x_{3}$ | $x_{4}$ | $x_{5}$ | $x_{6}$ | $f_{2}$ | $f_{1}$ | $f_{0}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
| 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
| 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |

## 3 Realization Using LUT Cascade

An address generation function can be directly implemented by an ordinary memory. For example, the address generation function shown in Table 2.2 can be directly implemented by a 64 -word memory, where each word consists of 3 bits. In the case of an address generation function of $n$ variables, the size of the memory is proportional to $2^{n}$ even if the registered vector table contains only a few elements. For such a case, the LUT cascade realization greatly reduces the necessary amount of memory.

Definition 3.1 Given a function $f(X): B^{n} \rightarrow\{0,1, \ldots, k\}$, where $B=\{0,1\}$ and $X=\left(x_{1}, x_{2}, \ldots, x_{n}\right)$. Let $\left(X_{1}, X_{2}\right)$ be a partition of $X$. Let $n_{1}$ be the number of variables in $X_{1}$, and let $n_{2}$ be the number of variables in $X_{2}\left(n_{1}+n_{2}=n\right)$. The decomposition chart of $f$ is a two-dimensional matrix with $2^{n_{1}}$ columns and $2^{n_{2}}$ rows. The column labels correspond to all possible binary numbers of $n_{1}$ bits, and the row labels correspond to all possible binary numbers of $n_{2}$ bits. And the corresponding matrix value is equal to $f\left(X_{1}, X_{2}\right)$. Among the decomposition charts for the function $f$, one with $X_{1}=\left(x_{1}, x_{2}, \ldots, x_{n_{1}}\right)$ and $X_{2}=\left(x_{n_{1}+1}, x_{n_{1}+2}, \ldots, x_{n}\right)$ is the standard decomposition chart. The number of different column patterns in a decomposition chart is the column multiplicity. As a special case of a decomposition chart, we also consider the case where $X_{1}=X$.

Definition 3.2 Let $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ be the ordering of the input variables. The $\mathbf{C}$-measure of $f$ is the maximum column multiplicity over all the standard decomposition charts for the function $f$.


Figure 3.1: Realization of logic function by decomposition.


Figure 3.2: LUT cascade with intermediate outputs
Example 3.1 The $C$-measure of $f_{1}=x_{1} x_{2} \vee x_{3} x_{4} \vee x_{5} x_{6}$ is 3, while the $C$-measure of $f_{2}=x_{1} x_{5} \vee x_{2} x_{6} \vee x_{3} x_{4}$ is 8 .
(End of Example)
Lemma 3.1 The C-measure of a logic function $f$ with weight $k$ is at most $k+1$.

Theorem 3.1 [2] For a given function $f$, let $X_{1}$ be the variables corresponding to the columns, and let $X_{2}$ be the variables corresponding to the rows of the decomposition chart. Let $\mu$ be the column multiplicity of the decomposition chart. Then, the function $f$ is realized by the circuit shown in Fig. 3.1. In this case, the number of connections that connect two blocks $H$ and $G$ is $\left\lceil\log _{2} \mu\right\rceil$.

When the number of connections that connect two blocks is smaller than the number of variables in $X_{1}$, we have a chance to reduce the amount of memory to realize the function. The decomposition shown in Fig. 3.1 is a disjoint decomposition, since $X_{1}$ and $X_{2}$ do not have a common element. By decomposing the given function iteratively, we have an LUT cascade[9] shown in Fig. 3.2. An LUT cascade consists of cells connected to each other by rails. A function with a small C-measure can be realized with a compact LUT cascade.

Theorem 3.2 [11] A logic function with $C$-measure $\mu$ can be realized by an LUT cascade consisting of cells with at most $q+1$ inputs and at most $q$ outputs, where $q=\left\lceil\log _{2} \mu\right\rceil$.

A function with a small C-measure can be efficiently realized by an LUT cascade. Thus, the C-measure can be used to predict the complexity of the LUT cascade.

Theorem 3.3 [11] Consider an LUT cascade that realizes a function $f$. Let $n$ be the number of primary inputs; let $s$ be the number of cells; let $q$ be the maximum number of rails (i.e., the number of signal line between adjacent cells); let $p$ be the maximum number of inputs to cells; and let $\mu$ be the $C$-measure of the function $f$. When $p \geq\left\lceil\log _{2} \mu\right\rceil+1$,
there exists an LUT cascade for $f$ that satisfies the following relation:

$$
s \leq\left\lceil\frac{n-q}{p-q}\right\rceil
$$

An address generation function with weight $k$ is realized by an LUT cascade with $\left\lceil\log _{2}(k+1)\right\rceil$ rails. However, when the value of $k$ is large, the single LUT cascade requires cells with many inputs, and this results in a large circuit. For example, when $k=40,000$, a cascade requires cells with 17 inputs and 16 outputs. In such a case, we can partition the set of vectors into several groups, and realize each group by a separate LUT cascade to reduce the total amount of memory. Such a method requires a special encoder to combine the outputs of cascades[8].

## 4 Hash-Based Design

### 4.1 Basic Idea

For an address generation function $f\left(X_{1}, X_{2}\right)$ with weight $k$, we transform the input variable $\left(X_{1}, X_{2}\right)$ into $\left(Y_{1}, X_{2}\right)$ to hash the address space. Let $\hat{f}\left(Y_{1}, X_{2}\right)$ be the function after hashing, and consider its decomposition chart (Fig.4.1). Let $p$ be the number of variables in $Y_{1}$. If the non-zero elements are uniformly distributed in the decomposition chart, then each column of the decomposition chart has at most one non-zero element when $2^{p}>k$. For simplicity, let us assume that each column of the decomposition chart has at most one non-zero element. Next, let

$$
\hat{h}\left(Y_{1}\right)=\max _{\vec{b} \in B^{n_{2}}} \hat{f}\left(Y_{1}, \vec{b}\right)
$$

and realize $\hat{h}\left(Y_{1}\right)$ by a hash memory.
Note that the function $\hat{f}$ depends on $\left(Y_{1}, X_{2}\right)$. On the other hand, the function $\hat{h}\left(Y_{1}\right)$ depends only on $Y_{1}$. Thus, the output of $\hat{h}\left(Y_{1}\right)$ may not be equal to $\hat{f}\left(Y_{1}, X_{2}\right)$. We will check the equality of the functions by the auxiliary (AUX) memory shown in Fig. 4.2. Note that the auxiliary memory has $q=\left\lceil\log _{2}(k+1)\right\rceil$ inputs and $r=n-p$ outputs. The auxiliary memory stores the $X_{2}$ part of the registered vector in the corresponding address. When the values of $X_{2}$ are equal to the values produced by the AUX memory, the output of the hash memory $\hat{f}\left(Y_{1}, X_{2}\right)$ is correct, and it is is sent to the OR gates. Otherwise, the hash memory produces an incorrect output, and the zero vector is sent to the OR gates. When a column has more than one non-zero element, we decompose $f\left(X_{1}, X_{2}\right)$ into two: $f\left(X_{1}, X_{2}\right)=\hat{f}_{1}\left(Y_{1}, X_{2}\right) \vee f_{2}\left(X_{1}, X_{2}\right)$, where $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$ has at most one non-zero element in each column of the decomposition chart. On the other hand $f_{2}\left(X_{1}, X_{2}\right)$ is the remaining address generation function with smaller weights. Thus, $f_{2}\left(X_{1}, X_{2}\right)$ can be realized by a smaller LUT cascade.
The features of the realization method are as follows: The hash memory is efficient, but it realizes a limited class of


Figure 4.1: Decomposition chart for the hashed function $\hat{f}\left(Y_{1}, X_{2}\right)$.


Figure 4.2: Realization of address generation function by the hybrid method.
address generation functions. On the other hand, the LUT cascade realizes any address generation function, but it is not so efficient when $k$ is large. The hybrid method shown in Fig.4.2 combines these two methods to implement an address generation function efficiently.
The hybrid method is effective for the address generation function, where $k<2^{p}$. In this method, the hash memory and the auxiliary memory realize about $90 \%$ of the nonzero elements, while the LUT cascade realizes the remaining $10 \%$ of non-zero elements. This fact is confirmed by the the theoretical analysis in Section 5, and the experimental results in Section 6. Note that in Fig. 4.2, the upper part (i.e., the LUT cascade) uses a disjoint decomposition, while the lower part, uses a non-disjoint decomposition.

### 4.2 A Method to Generate A Hash Function

A hash function is used to scatter the non-zero elements of the address generation function uniformly in the decomposition chart. In this paper, we use the following function $Y_{1}=\left(y_{1}, y_{2}, \ldots, y_{p}\right)$, where $y_{i}=x_{i} \oplus x_{j}$ and $x_{j} \in\left\{X_{2}\right\}$.

## Generation of the hash function

In an address generation function $f\left(X_{1}, X_{2}\right)$, let $X_{1}=$ $\left(x_{1}, x_{2}, \ldots, x_{p}\right)$ be the bound variables, and let $X_{2}=$
$\left(x_{p+1}, x_{p+2}, \ldots, x_{n}\right)$ be the free variable. Let $\hat{f}\left(Y_{1}, X_{2}\right)$ be the function which is obtained by replacing the bound variables $X_{1}$ with $Y_{1}=\left(y_{1}, y_{2}, \ldots, y_{p}\right)$. Let $w$ the number of columns that have at least one non-zero element. To store more registered vectors in the hash memory, we choose a $Y_{1}$ that maximizes $w$. We use the following heuristic:

Algorithm 4.1 For each element $x_{i}(i=1,2, \ldots, p)$ in $X_{1}$, let $y_{i}=x_{i} \oplus x_{j}$. Select $x_{j} \in\left\{X_{2}\right\}$ that makes $w$ maximum. Continue this operation while $w$ increases.

The hash function is obtained by using $Y_{1}$.

### 4.3 Design of Address Generator

For an address generation function $f\left(X_{1}, X_{2}\right)$ with weight $k$, let $\hat{f}\left(Y_{1}, X_{2}\right)$ be the function that is obtained by replacing the the bound variables $X_{1}=\left(x_{1}, x_{2}, \ldots, x_{p}\right)$ with $\left(y_{1} \oplus x_{j_{1}}, y_{2} \oplus\right.$ $\left.x_{j_{2}}, \ldots, y_{p} \oplus x_{j_{p}}\right)$, where, $p>\left\lceil\log _{2}(k+1)\right\rceil$. For each $\vec{a} \in$ $B^{p}$, when $\hat{f}\left(\vec{a}, X_{2}\right)$ has more than one non-zero output, replace the non-zero elements except for the minimum value by 0 , to obtain the function $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$. Next, let $\hat{f}_{2}\left(Y_{1}, X_{2}\right)=$ $\hat{f}\left(Y_{1}, X_{2}\right) \oplus \hat{f}_{1}\left(Y_{1}, X_{2}\right)$. Since $\hat{f}_{1}\left(Y_{1}, X_{2}\right) \cdot \hat{f}_{2}\left(Y_{1}, X_{2}\right)=0$, we have the relation: $\hat{f}\left(Y_{1}, X_{2}\right)=\hat{f}_{1}\left(Y_{1}, X_{2}\right) \vee \hat{f}_{2}\left(Y_{1}, X_{2}\right)$. Note that in the decomposition chart for $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$, each column has at most one non-zero element. Next, let

$$
\hat{h}\left(Y_{1}\right)=\max _{\vec{b} \in B^{n_{2}}} \hat{f}_{1}\left(Y_{1}, \vec{b}\right)
$$

and realize $\hat{h}\left(Y_{1}\right)$ by the hash memory. Since the value of $\hat{h}\left(Y_{1}\right)$ can be different from the value of $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$, we check if it is correct or not by using the auxiliary memory. Also, by transforming $x_{i}=y_{i} \oplus x_{j}$, we generate the function $f_{2}\left(X_{1}, X_{2}\right)$ from $\hat{f}_{2}\left(Y_{1}, X_{2}\right)$. Finally, realize $f_{2}\left(X_{1}, X_{2}\right)$ by an LUT cascade.

Example 4.1 Table 4.1 is a decomposition chart of the 6 variable function $f\left(X_{1}, X_{2}\right)$ with weight $k=7$ that appeared in Table 2.2. In this function, transform the bound variables $X_{1}=\left(x_{1}, x_{2}, x_{3}\right)$ into $Y_{1}=\left(y_{1}, y_{2}, y_{3}\right)=\left(x_{1} \oplus\right.$ $\left.x_{6}, x_{2} \oplus x_{5}, x_{3} \oplus x_{4}\right)$. The decomposition chart of the hashed function $\hat{f}\left(Y_{1}, X_{2}\right)$ is shown in Table 4.2. In the hashed function, the columns of the original decomposition tables are permutated. Also, each row has a different permutation. In the original table, three columns for $\left(x_{1}, x_{2}, x_{3}\right)=$ $(0,0,0),(0,1,0),(0,0,1)$ have two non-zero elements. On the other hand, in the decomposition table in Table 4.2 for the hashed function $\hat{h}\left(Y_{1}, X_{2}\right)$, only one column $\left(y_{1}, y_{2}, y_{3}\right)=$ $(0,1,0)$ has two non-zero elements. Let $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$ be the function where the non-zero element 4 is replaced by 0. The decomposition chart is shown in Table 4.3. Table 4.4 shows the decomposition chart of the function $\hat{f}_{2}\left(Y_{1}, X_{2}\right)$ that is realized by the cascade. In this case, the function has only one non-zero element. $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$ is implemented by the hash memory shown in Table 4.5 and the


Figure 4.3: Realization of a 6-variable function by the hybrid method.

Table 4.1: Decomposition chart for $f\left(X_{1}, X_{2}\right)$.

|  | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | $x_{3}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | $x_{2}$ |
|  | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | $x_{1}$ |
| 000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 010 | 1 | 0 | 2 | 0 | 3 | 0 | 0 | 0 |  |
| 011 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |  |
| 100 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 110 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |  |
| 111 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 |  |
| $x_{6} x_{5} x_{4}$ |  |  |  |  |  |  |  |  |  |

auxiliary memory shown in Table 4.6. The output of the hash memory $\hat{h}\left(Y_{1}\right)$ shows the non-zero value of the function $\hat{f}_{1}$ for the column $Y_{1}=\left(y_{1}, y_{2}, y_{3}\right)$. The auxiliary memory shown in Table 4.6 decides if the output is zero or not. The function that is implemented by the LUT cascade has non-zero output 4. The corresponding input values are $\left(x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}\right)=(0,0,1,1,1,0)$. Fig. 4.3 shows the whole network for function $f$. The AUX memory and comparator check if $\left(x_{4}, x_{5}, x_{6}\right)$ is the input that produces the non-zero output. The LUT cascade consists of AND gates. The non-zero output is 4, and its binary representation is $(1,0,0)$. This is implemented by ORing the most significant bit of the AND gates and the output of the cascade.
(End of Example)

Theorem 4.1 When $Y_{1}=\left(y_{1}, y_{2}, \ldots, y_{p}\right)$, where $y_{i}=x_{i} \oplus$ $x_{j}$, for $x_{j} \in\left\{X_{2}\right\}$, is used as the hash function, only the outputs for $X_{2}$ are necessary in the auxiliary memory and the comparator in Fig.4.2.

Table 4.2: Decomposition chart for $\hat{f}\left(Y_{1}, X_{2}\right)$ (hashed function).

|  | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | $y_{3}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | $y_{2}$ |
|  | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | $y_{1}$ |
| 000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 010 | 2 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |  |
| 011 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |  |
| 100 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 110 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 |  |
| 111 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 |  |
| $x_{6} x_{5} x_{4}$ |  |  |  |  |  |  |  |  |  |

Table 4.3: Decomposition chart for $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$.

|  | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | $y_{3}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | $y_{2}$ |
|  | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | $y_{1}$ |
| 000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 010 | 2 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |  |
| 011 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 100 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 110 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 |  |
| 111 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 |  |
| $x_{6} x_{5} x_{4}$ |  |  |  |  |  |  |  |  |  |

(Proof) The hash memory realizes the function $\hat{h}\left(Y_{1}\right)$. The output value is derived using the outputs of the hash memory and the output of the auxiliary memory. In the decomposition chart of the hashed function, the row elements of the original decomposition chart are permuted. To verify the correctness of the outputs of the hash memory, we need

Table 4.4: Decomposition chart for $\hat{f}_{2}\left(Y_{1}, X_{2}\right)$

|  | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | $y_{3}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | $y_{2}$ |
|  | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | $y_{1}$ |
| 000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 010 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 011 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |  |
| 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 110 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| 111 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  |
| $x_{6} x_{5} x_{4}$ |  |  |  |  |  |  |  |  |  |

Table 4.5: Function $\hat{h}\left(Y_{1}\right)$ realized by the hash memory.

| $y_{3}$ | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $y_{2}$ | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| $y_{1}$ | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| $\hat{h}\left(Y_{1}\right)$ | 2 | 5 | 1 | 0 | 6 | 7 | 3 | 0 |

Table 4.6: Contents of the auxiliary memory

| $z_{1} z_{2} z_{3}$ | $x_{4}$ | $x_{5}$ | $x_{6}$ |
| :---: | :---: | :---: | :---: |
| 000 | 0 | 0 | 0 |
| 001 | 0 | 1 | 0 |
| 010 | 0 | 1 | 0 |
| 011 | 0 | 1 | 0 |
| 100 | 0 | 0 | 0 |
| 101 | 0 | 0 | 1 |
| 110 | 0 | 1 | 1 |
| 111 | 1 | 1 | 1 |

only to check the value of $X_{2}$.
That is, if the output values of the auxiliary memory and the values of $X_{2}$ are the same, the values for $X_{1}$ are also the same, which is shown to be true by the relation $y_{i}=x_{i} \oplus$ $x_{j}$. Thus, the outputs of the hash memory are equal to the outputs. If the output values of the auxiliary memory and the value of $X_{2}$ are different, the output value is 0 . (Q.E.D.)

Example 4.2 For the network shown in Fig. 4.3, we need to check only $X_{2}=\left(x_{4}, x_{5}, x_{6}\right)$ in the output of the auxiliary memory to see if the input produces the non-zero output.
(End of Example)

## 5 Number of Registered Vectors Realized by Hash Memory

In this part, we assume that the non-zero elements are uniformly distributed in the decomposition chart, and obtain the fraction of registered vectors realized by the hash memory.

Theorem 5.1 Let $f$ be an n-variable address generation function with weight $k$, and the non-zero elements be uniformly distributed in the decomposition chart. Then, the fraction of registered vectors realized by the hash memory shown in Fig. 4.2 is given by

$$
\delta \simeq 1-\frac{1}{2}\left(\frac{k}{2^{p}}\right)+\frac{1}{6}\left(\frac{k}{2^{p}}\right)^{2}
$$

where $q=\left\lceil\log _{2}(k+1)\right\rceil$, and $p=\left|Y_{1}\right|$ denotes the number of bound variables in the decomposition chart for $f\left(Y_{1}, X_{2}\right)$, and $k<2^{p}$.
(Proof) Let $k$ be the total number of non-zero elements in the decomposition chart. Then, $\alpha=\frac{k}{2^{n}}$ denotes the fraction of the non-zero elements in the decomposition chart. Also, $\beta=1-\alpha$ denotes the fraction of the zero elements in the decomposition chart. And, $r=n-p=\left|X_{2}\right|$ denotes the number of free variables in the decomposition chart. In this case, we have the following:

1. The probability that a column has only 0 elements is $\beta^{2^{r}}$.
2. The probability that a column has at least one non-zero element is $1-\beta^{2^{r}}$.

In total, there are $2^{p}$ columns, and the total number of nonzero elements is $k$. For the column with more than one nonzero element, the hash memory realizes only one non-zero element. In this case, the fraction of registered vectors realized by the hash memory is given by

$$
\begin{aligned}
\delta & =\left(1-\beta^{2^{r}}\right) \cdot \frac{2^{p}}{k} \\
& =\left(1-(1-\alpha)^{2^{r}}\right) \cdot \frac{2^{p}}{k} \\
& =\left(1-\left(\sum_{i=0}^{2^{r}}(-1)^{i}\binom{2^{r}}{i} \alpha^{i}\right)\right) \cdot \frac{2^{p}}{k} \\
& =\left(\sum_{i=1}^{2^{r}}(-1)^{i+1}\binom{2^{r}}{i} \alpha^{i}\right) \cdot \frac{2^{p}}{k} .
\end{aligned}
$$

Note that when $2^{p}>k$, the absolute value of the each term of the above expression decreases with the increase of $i$. Next, by approximating $\delta$ by using first three terms of the above expression, we have

$$
\begin{aligned}
\delta & \simeq\left[2^{r} \alpha-\frac{2^{2 r} \alpha^{2}}{2}+\frac{2^{3 r} \alpha^{3}}{6}\right] \cdot \frac{2^{p}}{k} \\
& \simeq\left[2^{r}\left(\frac{k}{2^{n}}\right)-\frac{1}{2} 2^{2 r}\left(\frac{k}{2^{n}}\right)^{2}+\frac{1}{6} 2^{3 r}\left(\frac{k}{2^{n}}\right)^{3}\right] \cdot \frac{2^{p}}{k} \\
& \simeq 1-\frac{1}{2}\left(\frac{k}{2^{p}}\right)+\frac{1}{6}\left(\frac{k}{2^{p}}\right)^{2}
\end{aligned}
$$

(Q.E.D.)

For example, when $\frac{k}{2^{p}}=\frac{1}{4}$, we have $\delta \simeq 0.8854$.
Example 5.1 Consider the case of $n=40$ and $k=1730$ in Theorem 3.2. Since $q=\left\lceil\log _{2}(k+1)\right\rceil=\left\lceil\log _{2}(1730+\right.$ $1)\rceil=11$, the number of bound variables is $p=13$.

1. When the function is realized by an LUT cascade alone.
Let $p=13$ be the number of inputs for cells. Then, from Theorem 3.3, the number of levels of the cascade is given by

$$
\left\lceil\frac{n-q}{p-q}\right\rceil=\left\lceil\frac{40-11}{13-11}\right\rceil=\left\lceil\frac{29}{2}\right\rceil=15
$$

For each cell, the size of the memory is $2^{p} \times q=2^{13} \times$ 11 (bits). Thus, the total amount of memory is $2^{13} \times$ $11 \times 15=1,351,680$ (bits).
2. When the function is realized by the hybrid method.
Since $k<2^{p}$, the assumption for the approximation in Theorem 5.1 is valid. From Theorem 5.1, we have

$$
\begin{aligned}
\delta & \simeq 1-\frac{1}{2}\left(\frac{k}{2^{p}}\right)+\frac{1}{6}\left(\frac{k}{2^{p}}\right)^{2} \\
& =1-\frac{1}{2}\left(\frac{1730}{2^{13}}\right)+\frac{1}{6}\left(\frac{1730}{2^{13}}\right)^{2} \simeq 0.901 .
\end{aligned}
$$

The hash memory has $p=13$ inputs and $q=11$ outputs. The auxiliary memory has $q=11$ inputs and $r=n-p=27$ outputs. The LUT cascade realizes the address generation function with weight $1730 \times(1-$ $0.901)=170$. In this case, each cell in the cascade has $\left\lceil\log _{2}(170+1)\right\rceil=8$ outputs. Let the number of inputs of cells be 10 , then the number of levels in the LUT cascade is

$$
\left\lceil\frac{n-q}{p-q}\right\rceil=\left\lceil\frac{40-8}{10-8}\right\rceil=\left\lceil\frac{32}{2}\right\rceil=16
$$

Note that the size of a cell except for the last stage is $2^{10} \times 8$ (bits). The size of the cell in the last stage is $2^{10} \times 11$ (bits). Thus, the total amount of memory for the cascade is $2^{10} \times 8 \times 15+2^{10} \times 11=134,144$ (bits). The size of the hash memory is $2^{13} \times 11=90,112$ (bits). The size of the auxiliary memory is $2^{11} \times 27=$ 55,296 (bits). Thus, the total amount of memory is 279,552 (bits), which is $20.7 \%$ of the total memory for the LUT cascade-only realization.

In this example, the hybrid method requires smaller amount of memory than the LUT cascade alone. (End of Example)

## 6 Experimental Results

### 6.1 Realization of English Word Lists

As for examples of address generators, we realized lists of frequently used English words. Here, we use three kinds of English word lists: List 1, List 2, and List 3. The number of letters in the each word is at most 13, but we only consider the first 8 letters. For the English words consisting of fewer than 8 letters, we append blanks to the end of words to make them 8-letter words. Each English alphabet letter is represented by 5 bits. Thus, each English word is represented by 40 bits. The number of words in the lists are 1730,3366 , and 4705 , respectively. In each word list, each English word has a unique index, or an integer from 1 to $k$, where $k=1730$ or 3360 or 4705 . The numbers of bits for these indices are 11,12 , and 13 , respectively.

Table 6.1: Realization of English word Lists.

|  | List 1 | List 2 | List 3 |
| :--- | ---: | ---: | ---: |
| \# of words: $k$ | 1730 | 3366 | 4705 |
| \# of inputs: $n$ | 40 | 40 | 40 |
| \# of outputs: $q$ | 11 | 12 | 13 |
| \# of inputs for the hash <br> function: $p$ | 13 | 14 | 15 |
| \# of columns with only one <br> non-zero element | 1389 | 2752 | 4000 |
| \# of columns with two or <br> more non-zero elements | 165 | 293 | 342 |
| \# of registered vectors not <br> realized by hash memory | 176 | 321 | 363 |

Table 6.2: Memory sizes for English word lists.

| Realization by an LUT Cascade alone |  |  |  |  |
| :--- | :---: | ---: | ---: | ---: |
|  |  | List 1 | List 2 | List 3 |
| \# of inputs | $n$ | 40 | 40 | 40 |
| \# of outputs | $q$ | 11 | 12 | 13 |
| \# of inputs of cells | $p$ | 13 | 14 | 15 |
| \# of levels | $s$ | 15 | 14 | 14 |
| Total amount of <br> memory (bits) | $s q 2^{p}$ | $1,351,680$ | $2,752,512$ | $5,963,776$ |


| Realization by the hybrid method |  |  |  |  |
| :--- | ---: | ---: | :--- | :---: |
|  | List 1 | List 2 | List 3 |  |
| Size of hash memory | $q 2^{p}$ | 90,112 | 196,608 | 425,984 |
| Size of auxiliary <br> memory | $r 2^{q}$ | 55,296 | 110,592 | 221,189 |
| Size of cascade |  | 134,144 | 301,056 | 304,104 |
| Total amount of <br> memory (bits) |  | 279,552 | 608,256 | 951,277 |

Next, generate the function realized by the hash memory. The number of inputs for the hash function is $\left\lceil\log _{2}(k+\right.$ 1) $]+2$.

List 1 consists of $k=1730$ words. The number of bits for the index is $q=\left\lceil\log _{2}(1+k)\right\rceil=\left\lceil\log _{2}(1+1730)\right\rceil=11$. The number of bound variables is $p=q+2=13$. The number of columns in the decomposition chart is $2^{p}=2^{13}=8192$. The number of columns that has only one non-zero element is 1389 . The number of columns that has two or more nonzero elements is 165 . The number of registered vectors that are not realized by the hash table is 176 . In other words, about $90 \%$ of the registered vectors are realized by the hash memory, and the remaining $10 \%$ of the registered vectors are realized by the LUT cascade. Table 6.1 shows the experimental results for three English word lists.

Table 6.2 compares memory sizes. It shows that the hybrid method requires much smaller amount of memory than the LUT cascade alone.

Table 6.3: Realization of address generation functions produced by pseudo-random numbers (average of 100 functions).

|  | Function 1 | Function 2 | Function 3 |
| :--- | ---: | ---: | ---: |
| \# of words: $k$ | 1730 | 3366 | 4705 |
| \# of inputs: $n$ | 40 | 40 | 40 |
| \# of outputs: $q$ | 11 | 12 | 13 |
| \# of inputs for the hash <br> function: $p$ | 13 | 14 | 15 |
| \# of columns with only one <br> non-zero element | 1398.4 | 2737.7 | 4075.1 |
| \# of columns with two or more <br> non-zero elements | 160.0 | 302.7 | 307.0 |
| \# of registered vectors not <br> realized by hash memory | 171.6 | 325.6 | 322.9 |
| \# of registered vectors not <br> realized by hash memory <br> (obtained by Theorem 5.1) | 169.8 | 322.1 | 321.6 |

### 6.2 Realization of Randomly Generated Functions

Next, we generated address generations functions with the same sizes by pseudo-random numbers. Table 6.3 shows the average of 100 randomly generated functions. In this case, the average number of columns with only one nonzero element is 1398.4, the average number of columns with two or more non-zero elements is 160.0 , and the average number of registered vectors not realized by the hash memory is 171.6 . We did similar experiments for List 2 and List 3.

These results show that the experimental results using real word lists and the randomly generated functions do not have much difference with the theoretical results obtained in Section 5. This shows that the hash function generated by Algorithm 4.1 effectively scatters the non-zero elements in the decomposition charts.

## 7 Conclusions and Comments

In this paper, we presented a method to realize an address generation function using a hash memory and an LUT cascade. It uses both disjoint and non-disjoint functional decompositions. The basic idea is as follows: First decompose the address generation function $f$ into two non-overlapping address generation functions: $f\left(X_{1}, X_{2}\right)=$ $\hat{f}_{1}\left(Y_{1}, X_{2}\right) \vee f_{2}\left(X_{1}, X_{2}\right)$. In this case, $\hat{f}_{1}\left(Y_{1}, X_{2}\right)$ is decomposed as $\hat{f}_{1}\left(Y_{1}, X_{2}\right)=g\left(\hat{h}\left(Y_{1}\right), X_{2}\right)$, and each column of the decomposition chart has at most one non-zero element. The function $g$ is realized by the comparator and the auxiliary memory. Since $f_{2}$ is also an address generation function having smaller weight than $f_{1}$, it can be implemented by
an LUT cascade, or further decomposed by using the same technique.
This method is only useful for the functions with $k \ll 2^{n}$. Unfortunately, for most of the MCNC benchmark functions, this method is not useful. In a typical MCNC benchmark function, the number of non-zero outputs $k$ is not so small compared with the total number of the input combinations $2^{n}$.
Recent FPGAs contain embedded memories in addition to the LUTs with 4 or 5 inputs. If we implement the address generation function by using only LUTs with 4 or 5 inputs, the resulting circuit will be too large to fit into the FPGA. On the other hand, the method presented in this paper uses the embedded memories in FPGAs, so we have a compact and fast circuit.

## Acknowledgments

This research is supported in part by the Grants in Aid for Scientific Research of JSPS, and the grant of Kitakyushu Innovative Cluster Project. Discussion with Prof. Jon T. Butler improved English presentation. Mr. M. Matsuura did experiments.

## References

[1] ALTERA, "Designing switches and routers with APPEX CAM," White Paper, Oct. 2000, Altera Corporation.
[2] H. A. Curtis, A New Approach to the Design of Switching Circuits, Van Nostrand, Princeton, N.J., 1962.
[3] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor, "Longest prefix matching using Bloom filters," ACM SIGCOMM'03, August 25-29, 2003, Karlsruhe, Germany.
[4] T. Kohonen, Content-Addressable Memories, Springer Series in Information Sciences, Vol. 1, Springer Berlin Heidelberg 1987.
[5] V. N. Kravets and K. A. Sakallah. "Constructive libraryaware synthesis using symmetries", Proc. DATE '00, pp. 208-216, 2000.
[6] P-F. Lin and J. B. Kuo, "A 1-V 128-kb four-way set-associative CMOS cache memory using wordlineoriented tag-compare (WLOTC) structure with the content-addressable-memory (CAM) 10-transistor tag cell," IEEE Journal of Solid-State Circuits, Vol. 36, pp. 666-675, April 2001.
[7] J. C. Moran, "Memory patching circuit," US Patent 4028678.
[8] H. Qin and T. Sasao, "Design of address generators using multiple LUT cascade on FPGA," SASIMI 2006, April 2006, pp.146-152.
[9] T. Sasao, M. Matsuura, and Y. Iguchi, "A cascade realization of multiple-output function for reconfigurable hardware," International Workshop on Logic and Synthesis (IWLSO1), Lake Tahoe, CA, June 12-15, 2001, pp. 225-230.
[10] T. Sasao, Switching Theory for Logic Synthesis, Kluwer Academic Publishers, 1999.
[11] T. Sasao, Y. Iguchi, M. Matsuura, "LUT cascades and emulators for realizations of logic functions," RM2005, Tokyo, Japan, Sept. 5 - Sept. 6, 2005, pp.63-70.
[12] T. Sasao, "Design methods for multiple-valued input address generators,"(invited paper) International Symposium on Multiple-Valued Logic, Singapore, May 2006.
[13] T. Sasao and J. T. Butler, "Implementation of multiplevalued CAM functions by LUT cascades," International Symposium on Multiple-Valued Logic, Singapore, May 1720, 2006 (accepted).
[14] H. Vandierendonck and K. D. Bosschere, "XOR-Based hash functions," IEEE Transactions on Computers, Vol. 54, No. 7, July 2005, pp.800-812.

