Why is the Cell architecture fast, if at all?

Technical discussion on the newly released and hard to find PS3.

Moderators: cheriff, emoon

Post Reply
Neels
Posts: 5
Joined: Tue Dec 19, 2006 7:08 am

Why is the Cell architecture fast, if at all?

Post by Neels »

Hi,

I did some basic benchmarking on the ps3 against a P4 2,8GHZ both on the PPU and the SPU, and well, my P4 beats the ps3 in every test. Only when using 7 threads or more on the SPU, the ps3 beats my P4, but only with little difference. My Pentium D 3,4G would kill the ps3 again.

Some tests:

Code: Select all

Calculating prime numbers up to 200000:

Pentium 4 2,8GHZ: 13234ms
Pentium 4 3,4GHZ: 9828ms
Playstation3 PPU: 25247ms
Playstation3 SPU: 33264ms

Simple loop length 10 million:

- 1 integer-add
- 1 if
- 2 float-adds
- 1 float-muls
- 1 float-div
                
Pentium 4 2,8 GHZ 1 Thread   946ms

PS3 SPE 1 Thread             4019ms
PS3 SPE 7 Threads            1147ms
PS3 SPE 16 Threads           772ms

Calculating mandelbrot fractal graphics:

Pentium 4 2,8 GHZ       11103ms
PS3 SPE 1 Thread        16137ms
PS3 PPU                 40290ms
What are good things to test on a PS3? I tested the vector types a bit but multiplication of vec_float4 consume four times more time than muls with float, so why shall I use the ps3 vector types?

Nils
Neels
Posts: 5
Joined: Tue Dec 19, 2006 7:08 am

Post by Neels »

Meanwhile, I did some memory tests which are quite interesting to look at:

Code: Select all


memcpy 10GB (10240 times 1MB)

P4 2,8GHZ DDR1:        12024ms
P4 3,4GHZ DDR2:        6234ms
PS3 PPU:               5120ms
PS3 SPU:               2736ms
PS3 SPU 7 Threads:     781ms

Last edited by Neels on Wed Dec 20, 2006 10:57 pm, edited 1 time in total.
ooPo
Site Admin
Posts: 2023
Joined: Sat Jan 17, 2004 9:56 am
Location: Canada
Contact:

Post by ooPo »

Keep in mind that having a videogame console that has even that much raw computing power is quite an achievement on its own. Typically more effort is put into the graphics hardware and the cpu just limps along.

Perhaps you're seeing the lack of maturity in the PS3 toolchain, or maybe code could be better optimized to work around some of the quirks like the lack of branch prediction. Still, that memcpy speed is interesting. Keep at it - I'm interested in why it is performing like that.
Shine
Posts: 728
Joined: Fri Dec 03, 2004 12:10 pm
Location: Germany

Re: Why is the Cell architecture fast, if at all?

Post by Shine »

Neels wrote:I did some basic benchmarking on the ps3 against a P4 2,8GHZ both on the PPU and the SPU, and well, my P4 beats the ps3 in every test. Only when using 7 threads or more on the SPU, the ps3 beats my P4, but only with little difference. My Pentium D 3,4G would kill the ps3 again.
Could you post the benchmarking code and some info about the compiler, P4 OS and compiler settings you have used? Did you specified "-O2" on PS3, used a 64 bit version of GCC and a version > 4?
soks
Posts: 100
Joined: Tue May 25, 2004 1:25 am
Location: Chicago, IL

Post by soks »

Mind you although GCC will compile for the Cell it is FAR from optimized and using the IBM compiler will show much better numbers.

At least that's what I learned from the Cell online courses.
rapso
Posts: 140
Joined: Mon Mar 28, 2005 6:35 am

Re: Why is the Cell architecture fast, if at all?

Post by rapso »

Neels wrote:

Code: Select all

PS3 SPE 1 Thread             4019ms
PS3 SPE 7 Threads            1147ms
PS3 SPE 16 Threads           772ms
if you have a speedup when using more threads than SPUs available, then there is an other bottleneck than just raw computation power. make some simpler tests like some vector multiplications.
Neels
Posts: 5
Joined: Tue Dec 19, 2006 7:08 am

Post by Neels »

Here is the memcpy-Benchmark code:

Code: Select all

#define MEMCPY_SIZE	1048576		// 1MB

unsigned int dobenchmark_memcpy()
{
	unsigned char* pDataSrc = (unsigned char*)malloc(sizeof(unsigned char) * MEMCPY_SIZE);
	unsigned char* pDataDst = (unsigned char*)malloc(sizeof(unsigned char) * MEMCPY_SIZE);

	unsigned int i=0;

	for&#40; i=0; i<MEMCPY_SIZE; i++ &#41;
		pDataSrc&#91;i&#93; = i;

	for&#40; i=0; i<1024*10; i++ &#41;	// 10GB
	&#123;
		// fortunately, this loop isn't optimized away, so just memcpy here...
		memcpy&#40; pDataDst, pDataSrc, sizeof&#40;unsigned char&#41; * MEMCPY_SIZE &#41;;
	&#125;

	unsigned int result = pDataDst&#91;0&#93;;
	for&#40; i=1; i<MEMCPY_SIZE; i++ &#41;
	&#123;
		result += pDataDst&#91;i&#93;;
	&#125;

	free&#40;pDataSrc&#41;;
	free&#40;pDataDst&#41;;
	return result;
&#125;
The whole functions is measured, but when doing only one memcpy the time is 6ms on my P4 so I don't care.

The P4 is the following machine:

Intel Pentium 4 521 2,8GHZ 1MB Cache
RAM noname DDR1 2GB
Mainboard Asrock 775i65GV
OS Windows XP Professional SP2
Benchmark Code compiled with Intel C++ Compiler 8.0
misfire
Posts: 110
Joined: Mon Sep 06, 2004 7:53 am
Location: Germany

Post by misfire »

The German c't magazine has recently tested the PS3's processor capabilities.

The PS3 reached a SPECint_2000 base score of 400, which is comparable to an Athlon at 1,33 GHz. Also, they wrote that both Cell and the Xbox360 processor aren't any good at single-threaded applications.

The Cell only excels at processing when using the SPUs, which are optimised for calculating single-precision floating-point numbers. With 6 active SPUs, the PS3 is twice as fast as a Core 2 Duo 6400. (Then again, the Core 2 easily beats the PS3 when it comes to double-precision.)

This might be an explanation for your test results, Neels.
User avatar
Saotome
Posts: 182
Joined: Sat Apr 03, 2004 3:45 am

Post by Saotome »

@Neels:
you might find this thread on beyond3d interesting, especially the posts from "inefficient". The people in that thread are also trying to benchmark the PS3 / PPU / SPEs, but they are already on page 9, so you might find some ideas how to optimize, which they already found out ;)

btw. I don't understand how you're using memcpy on SPE, what memory are you copying to where? Is it LS (LocalStore) -> LS, XDR -> XDR, etc ??
What confuses me the most about it, is that SPEs have 256kB memory but you're copying 1MB at once.
infj
Post Reply