## Consiglio Nazionale delle Ricezche # ISTITUTO DI ELABORAZIONE DELLA INFORMAZIONE PISA ### POLYNOMIAL EVALUATION AND INTERPOLATION IN VLSI B.CODENOTTI,G.LOTTI,L.MATROPIETRO Nota interna B 4-14 Maggio 1987 #### POLYNOMIAL EVALUATION AND INTERPOLATION IN VLSI B. CODENOTTI1, G.LOTTI2 and L.MASTROPIETRO3 \*Istituto di Elaborazione dell'Informazione - CNR, via S.Maria 46, PISA, ITALY. Dipartimento di Scienze dell'Informazione - University of Pisa, Corso Italia 40, PISA, ITALY. 3Geomath, Via Cavour 43, PISA, ITALY. #### ABSTRACT The area-tima complexity of polynomial interpolation, and of the evaluation of a polynomial at given points is studied. We will show VLSI designs producing bounds close to the lower bounds also derived. KEY WORDS. Area-Time Complexity, VLSI, Interpolation, Polynomial Evaluation. #### 1. INTRODUCTION In this paper, the problem of finding a polynomial $$P(x) = \sum_{i=0}^{n} a_{i} x$$ of degree less or equal than n, The coefficients of the interpolating polynomial are the solution of the linear system $V = \underline{f}$ , where V is the $(n+1) \times (n+1)$ Vandermonde matrix $V = \{v_{ij}\}$ , $v_{ij} = x_{ij}$ , $v_{ij} = x_{ij}$ , and $$\underline{\mathbf{f}} = [\mathbf{f} \quad \mathbf{f} \quad \dots \quad \mathbf{f} \quad ]^{\mathrm{T}}$$ . It comes out that the interpolation problem is equivalent to the solution of a particular linear system, with vandermonde coefficient matrix. In the next section, the question of lower bounds is addressed. In section 3, a VLSI design for the computation of the value of the interpolating polynomial at a given point with $\frac{2}{4}$ AT = O(n) is obtained; another VLSI network producing the same upper bound, is shown for the computation of the coefficients of the required polynomial. Moreover, the VLSI implementation of Estrin algorithm [3] for the evaluation of a given polynomial at one point is described, with $AT^2 = O(n \log^3 n)$ . The evaluation of a given polynomial at n points is also considered; this problem is equivalent to the computation of the product of a Vandermonde matrix by a vector. A VLSI design derived from Estrin algorithm yields the AT2 upper bound $O(n^2 \log^3 n)$ , while a VLSI design for generic matrix-vector product, e.g. mesh of trees, produces $AT^2 = O(n^2 \log^4 n)$ [4]. #### 2. Area-Time lower bounds In this section we will prove a lower bound to the problem of evaluating a polynomial of degree less or equal than n at a given point, with respect to the areax(time) 2 measure. Let a(i), i=1,...,n be the coefficients of the given polynomial. As usual in numerical analysis, we will assume that a fixed number of digits are used to represent such coefficients. The evaluation of the polynomial is not easier than the computation of the sum of the coefficients a(i), i=1,...,n. For the latter problem it is possible to prove the lower bound AT2= $\Omega(n\log^2 n)$ . Indeed, we will use the fact that the lower bound $AT^2=\mathcal{L}(n\log^2 n)$ holds for the "counter" problem, i.e. the problem of counting the number of occurrences of 1-s in a binary string of length n. This result easily follows by applying the technique used by Johnson for integer addition. Such technique is based on the input-output functional dependence, which holds for the counter problem as well. In the special case of one digit used to represent the coefficients of the polynomial, we have that the problem of computing the sum of such coefficients is equivalent to the counter problem. Therefore the lower bound $AT^2 = \mathcal{Q}(n\log^2 n)$ holds for polynomial evaluation, as well. In the next section, we will show a design which produces an upper bound to $AT^2$ , optimal up to the exponent of the logarithm. For what concerns the interpolation problem, it is equivalent to the solution of a special linear system, with Vandermonde coefficient matrix. The upper bound derived in the next section attains the lower bound holding for general linear systems solution, and it is characterized by a simple and modular structure. 3. VLSI designs for polynomial interpolation and evaluation Let $\prod$ denote the set of complex or real polynomials of degree less or equal to n. Given n+1 arbitrary points (x, f), i = 0, 1, ..., n, the problem of finding the polynomial $P(x) \in \prod_{n}$ such that $$P(x) = f, i=0, 1, ..., n,$$ i.e. the interpolating polynomial, is considered. In this section, two well known algorithms for the computation of the interpolating polynomial are taken into account, namely Neville algorithm [5], which is particularly suitable to find the value of the interpolating polynomial at a given point, and Newton formula [5], which can be efficiently used to compute the coefficients of the required polynomial. . The basic step of Neville algorithm is described by the following recurrent formula: $$P(x) = f,$$ (2.1) $$P = (x-x) P = (x) - (x-x) P = (x)$$ $$0 1 k k k 0 k-1$$ $$x - x$$ $$i i \\ k 0$$ where P (x) is the required value of the polynomial. 0 1 ... n In fig. 1, a VLSI design is presented, in the case n = 3; processor [i i ... i ] receives as input the values P (x) 0 1 k i ... i 1 k and P (x), computed respectively by processors [i ...i ] i ...i 0 k-1 and [i --- i ], together with the values x, x , x and computes 0 k-1 i i 0 k the value P (x), according to formula (2.1). Observe that processors denoted by dashed squares in fig.1, send x (x ) to the adjacent processor, while 0 n each other processor [i ...i] send x (x ) to processor 0 k i i k 0 $$\begin{bmatrix} i & -1 & i & ---i \\ 0 & 0 & k \end{bmatrix}$$ ( $\begin{bmatrix} i & ---i & i \\ 0 & k & k+1 \end{bmatrix}$ ). At the end of the computation, processor [ 0 1 $\dots$ n ], holds the result. If each processor is supposed to have two storage cells, the values x ,...,x can be sent to the proper processor 0 n Neville algorithm is not practical to compute the coefficients of the interpolating polynomial. For this purpose, it is convenient to use Newton formula, that is The required coefficient a , corresponding to the k-th power k of x, results to be $a = f(x, x, \dots, x)$ , $k = 0,1,\dots,n$ . The design for this algorithm is very similar to that one shown in fig.1, a part from the different computation performed by each processor. Moreover, the required coefficients are available, at different stages of the computation, in processors [0], [0 1], ..., [0 1 ... n]. In both cases, it is convenient to rearrange the networks, in order to obtain the upper diagonal part of an nxn mesh of processors (see fig. 2). It readily follows that $A = O(n^2)$ and T = O(n). Note that the computation proceeds by diagonals; therefore the circuit is very suitable for a pipelined implementation. The output period P of the network, defined as the maximum time between two successive data passages at any output port when the circuit is used in a pipelined fashion, is P = O(1); then $AP^2 = O(n^2)$ and, moreover, the complexity $AT^2 = O(n^4)$ is obtained for the computation either of the interpolating polynomial at n given points or of n interpolating polynomials at one point. #### REMARK Note that if x = w , i = 1,2,...,n, w being a n-th principal i root of the unity, then the Vandermonde matrix associated to the interpolation problem becomes the Fourier matrix [2]; for this particular interpolation problem (trigonometric interpolation) the well known optimal VLSI designs for DFT [1] can be used. We turn now to the problem of evaluating a given polynomial at given points. Let a , a ,..., a be the coefficients of a polinomial 0 1 $\,$ n $$P \in \overline{\prod}$$ The following problems are considered: - 3.1 evaluate P at a given point x, - 3.2 evaluate P at n different points. Note that problem 3.1 is not easier than evaluating the expression $$\sum_{i=0}^{n} a_{i}$$ for which the twindak lower bound AT = s(n (ogn)) has beenproved, in section 2. In the following, some VLSI designs are derived, which are optimal up to logarithmic factors. Problem 3.1 can be solved by Estrin algorithm [3], which computes P(x) as (3.3) $$P(x) = Q(x) x + R(x).$$ k Let n = 2; then a recursive VLSI design can be readily obtained from (3.3) (see fig.3). Let us consider the network performing the n-th power of x, in time T (n) = 0 (log n) and area A (n) = L (n) H (n), where 1 1 1 1 the width L (n) and the height H (n) can be chosen respectively 1 of order $O(\log n)$ and O(1). Then, the complexity analysis of the network showed in fig.3, produces the following bounds: $$T(n) = \max \{ T_{n/2}, T_{n/2} \} = O(\log n),$$ $$L(n) = \max \{ L_{n/2}, L_{n/2} \} + O(1) = O(\log n),$$ $$H(n) = 2 H(n/2) + H_{n/2} = O(n).$$ Hence, the areax(time) complexity is $$O(n \log n).$$ Problem (3.1) can also be solved by Dorn algorithm [3], which computes hich computes $$\begin{pmatrix} k & k & k & (k-1)k \\ q(x) = a + a x + \cdots + a x \\ 0 & k & (k-1)k \end{pmatrix}$$ $$q(x) = a + a x + \cdots + a x \\ 1 & 1 & k+1 & (k-1)k+1 \end{pmatrix}$$ $$q(x) = a + a x + \cdots + a x \\ k & (k-1)k \\ q(x) = a + a x + \cdots + a x \\ k-1 & 2k-1 & n$$ and then $$P(x) = q_{0}(x) + x q_{1}(x) + ... + x q_{k-1}(x).$$ Let n=k; then a straightforward VLSI implementation of this algorithm can be obtained according to the following three stages: 0. Compute $$x, x, \dots, x$$ and $x, x, \dots, x$ 1. Compute the matrix-vector product 2. Compute P(x) as a scalar product. The area-time complexity of this design would be of the same order of the previous one up to logarithmic factors, if the input and the output nodes of the module performing matrix-vector product are not required to lie on the border of the circuit. Let us consider the evaluation of a given polynomial at n points; it is easy to see that this problem is equivalent to the computation of the product of a Vandermonde matrix by a vector. For this problem, the lower bound $\Omega$ (n<sup>2</sup>) holds, since the Fourier matrix, for which this lower bound has been proved [6], is a special Vandermonde matrix. A VLSI design for generic matrix-vector product, using the structure of mesh of trees [4], produces $AT^2 = O(n^2 \log^4 n)$ for this problem. #### REFERENCES - [1] G. Bilardi and M. Sarrafzadeh, Optimal Discrete Fourier Transform in VLSI, International Workshop on Parallel Computing and VLSI, Amalfi, 1984. - [2] P.Davis, Circulant Matrices, Wiley Interscience, New Jork, 1979. - [3] D.E. Knuth, The Art of Computer Programming, Vol.2, Seminumerical algorithms, Addison-Wesley, 1969. - [4] F. T. Leighton, New Lower Bound Tecniques for VLSI, Proceedings of the 22nd Annual IEEE Symposium on Foundations of Computer Science, 1981, pp. 1-12. - [5] J. Stoer, Einfuhrung in die Numerische Mathematik, alysis Springer Verlag, Berlin, 1973. - [6] C.D. Thompson, Area-Time Complexity for VLSI, Proc. 11th Annual ACM Symp. on the Theory of Computing (SIGACT), 1979, pp.81-88. #### FIGURE LEGENDS: Fig. 1 Module structure. Fig. 2 Mesh of PEs. Fig. 3. Recursive VLSI design. Fig. 1 Module structure. Fig. 3. Recursive VLSI design.