# Accurate and High Performance Computing on the Cell processor

Stef Graillat

LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6)

Young Investigators Symposium, Oak Ridge National Laboratory, Tennessee, USA, October 13-15, 2008



Can you count up to 6 with your computer?



1.0000000000000000

2.00000000000011

2.999999997414701

20 times) 4.00000000629434

NaN

Inf

#### Ariane 5 rocket failure



Conversion of the horizontal speed from a 64-bit floating point number to a 16-bit signed integer  $\rightarrow$  overflow !

- Vancouver Stock Exchange : introduction in 1982 of a new index, with initial value of 1000.00 recomputed after each transaction and then truncated to 3 digits (e.g. 556.56 → 556)
  22 months later : 520 where the correct value was 1098
- German election in Schleswig-Holstein, April 5th, 1992, the print of the pourcentage of votes of the Green party with only one place after the decimal changed the result and gave the majority to SPD in the parlement (4.97% was printed as 5.0% after rounding).

Floating point system  $\mathbb{F} \subset \mathbb{R}$  :

$$x = \pm \underbrace{x_0.x_1\dots x_{p-1}}_{mantissa} \times b^e, \quad 0 \le x_i \le b-1, \quad x_0 \ne 0$$

b : basis, p : precision, e : exponent range s.t.  $e_{\min} \le e \le e_{\max}$ 

Machine epsilon  $\epsilon = b^{1-p}$ 

Approximation of  $\mathbb{R}$  by  $\mathbb{F}$ , rounding fl :  $\mathbb{R} \to \mathbb{F}$ Let  $x \in \mathbb{R}$  then

 $fl(x) = x(1 + \delta), \quad |\delta| \le u.$ 

Unit roundoff  $\mathbf{u} = \epsilon/2$  for round-to-nearest

Let  $x, y \in \mathbb{F}$ ,

 $\mathsf{fl}(x \circ y) = (x \circ y)(1 + \delta), \quad |\delta| \le \mathsf{u}, \quad \circ \in \{+, -, \cdot, /\}$ 

#### IEEE 754 standard (1985)

|        |         |           |         |                                            | Range                 |
|--------|---------|-----------|---------|--------------------------------------------|-----------------------|
| Single | 32 bits | 23+1 bits | 8 bits  | $u = 2^{-24} \approx 5,96 \times 10^{-8}$  |                       |
| Double | 64 bits | 52+1 bits | 11 bits | $u = 2^{-53} \approx 1,11 \times 10^{-16}$ | $ pprox 10^{\pm 308}$ |

## The Cell processor (1/2)



SP > 200 GFlops, DP=15 Gflops, 25GB/s memory BW, 300 GB/s EIB

## The Cell processor (2/2)



The PPE is based on the 2-way Power Architecture with :

- 32 KB of L1 cache for instructions
- 32 KB of L1 cache for data
- 512 KB of L2 cache

The PPE is fully pipelined for double precision computation and fully IEEE compliant.

The SPE is a small processor with a vectorial unit.

- small memory (256 KB) for instructions and data, named "local store" (LS)
- 128 registers of 128 bits
- 1 SPU "Synergistic Processing Unit"
  - 4 units for single precision computation
  - 1 unit for double precision computation
- MFC "Memory Flow Controller" which manages memory access through DMA

128-bit registers :

- 16 integers of 8-bits,
- 8 integers of 16-bits,
- 4 integers of 32-bits,
- 4 single precision floating point numbers,
- 2 double precision floating point numbers.
- The SIMD processor is based on FMA and is fully pipelined in SP : Peak performance SP :  $4 \times 2 \times 3.2 = 25.6$  GFLOPs

Not fully pipelined in double precision :

Peak performance in DP :  $2 \times 2 \times 3.2/7 = 1.8$  GFLOPs

#### 3 levels of parallelism

- **1** Processes run on Cell processors, exchange with a MPI library
- ② Data distribution and communication between PPE and SPE :
  - ALF
  - mailing box
  - exchange through DMA
  - data need to be aligned on quadword
  - double buffering technique
- on an SPE
  - only 256 KB
  - Altivec programming
  - code and data dependencies : not to break the SIMD pipeline

No division

1/x and  $1/\sqrt{x}$ : only the 12 first bits are exact.

SPU float arithmetic is not IEEE compliant :

- only rounding mode to zero (truncation).
- The highest exponent (128) is used not for Infinity or NaN, but is used to extend the range of the floating point.
- Inf and NaN are not recognized by arithmetic operations.
- Overflow results saturate to the largest representable positive or negative values, rather than producing +/-IEEE Infinity.
- No denormalized results : +0 instead.

SPU double arithmetic is IEEE compliant except :

- FP trapping is not supported.
- Denormalized operands are treated as 0.
- NaN results are always the default QNaN (Quiet NaN)

#### Definition 1

An extended precision number of n is a non-evaluated sum of n floating point numbers  $x = x_1 + x_2 + \cdots + x_n$ 

Precision used on Cell processor : single precision

- n = 2 : single-single
- *n* = 4 : quad-single

 $\Rightarrow$  achieve a 64-bits and 128-bits precision while working with the fast single precision SIMD units.

### Reliable Computing on the Cell processor

Intervals

$$\boldsymbol{x} = [\underline{x}; \overline{x}] = \{ x \in \mathbb{R} : \underline{x} \le x \le \overline{x} \}.$$

Given 2 intervals  $\pmb{x}$ ,  $\pmb{y}$  and  $\diamond \in \{+, -, \times, /\}$ , one can define

$$\boldsymbol{x} \diamond \boldsymbol{y} = \{ \boldsymbol{x} \diamond \boldsymbol{y} : \boldsymbol{x} \in \boldsymbol{x}, \boldsymbol{y} \in \boldsymbol{y} \}.$$

This can be computed

$$\begin{array}{lll} \boldsymbol{x} + \boldsymbol{y} &= & [\underline{x} + \underline{y}; \overline{x} + \overline{y}], \\ \boldsymbol{x} \times \boldsymbol{y} &= & [\min\{\underline{x}\underline{y}, \underline{x}\overline{y}, \overline{x}\underline{y}, \overline{x}\overline{y}, \overline{x}\overline{y}\}; \max\{\underline{x}\underline{y}, \underline{x}\overline{y}, \overline{x}\underline{y}, \overline{x}\overline{y}\}]. \end{array}$$

In floating point arithematic, there are rounding errors!

$$\begin{array}{lll} \boldsymbol{x} + \boldsymbol{y} &= & [\nabla(\underline{x} + \underline{y}), \Delta(\overline{x} + \overline{y})] \supseteq \{x + y | x \in X, y \in Y\} \\ \boldsymbol{x} - \boldsymbol{y} &= & [\nabla(\underline{x} - \overline{y}), \Delta(\overline{x} - \underline{y})] \supseteq \{x - y | x \in X, y \in Y\} \end{array}$$

where  $\nabla$  (resp.  $\Delta)$  represents rounding toward  $-\infty$  (resp. rounding toward  $+\infty).$ 

We are looking for

- people needing accurate and reliable high performance computing for real-life applications
- people with high performance computing skills to improve the performance of our libraries

## Thank you for your attention