two for loops very low frame rate

Glas · Post by **Glas** » Mon May 12, 2008 8:27 am

Hi.

The next code segment, which is the whole main loop is very slow.
It has just 2 for loops, which may has the main time consumption.
Please have a quick look at the code.

Code: Select all

...
while&#40; !bDone &#41;
	&#123;				
		/////////////////////////////////////////////
		// Handle input
		while&#40; SDL_PollEvent&#40; &event &#41; &#41;
		&#123;
			switch&#40; event.type &#41;
			&#123;
			case SDL_KEYDOWN&#58;
				switch&#40; event.key.keysym.sym &#41;
				&#123;
				case SDLK_ESCAPE&#58;
					bDone = true ;
					break ;
				case SDLK_p&#58;
					&#123;
					char sz&#91;25&#93; ;
					sprintf&#40; sz, "/tmp/screenshot\0" &#41; ;
					int i = SDL_SaveBMP&#40;screen, sz&#41; ;
					i=0 ;
					&#125;
					break ;
				&#125; // switch sym
				break ;
			case SDL_MOUSEMOTION&#58;
				x = event.motion.x ;
				y = event.motion.y ;
				break ;
			case SDL_MOUSEBUTTONDOWN&#58;
				break ;
			&#125; // switch type
		&#125; // poll event
		
		///////////////////////////////
		// start timing
		runtime = clock&#40;&#41; ;
		// lets do something
		SDL_LockSurface&#40; screen &#41; ;
		unsigned int *p = &#40;&#40;unsigned int*&#41;&#40;screen->pixels&#41;&#41; ;
		
		for&#40; int j=0; j< SCREEN_HEIGHT; j++ &#41;
		&#123;
			for&#40; int i=0; i< SCREEN_WIDTH; i++ &#41;&#123;
				
				vector float color = _load_vec_float4&#40;1.0f,1.0f,0.0f,0.0f&#41; ;
				
				// put the color to the screen
				p&#91;j*SCREEN_WIDTH+i&#93; = VEC_TO_A8R8G8B8&#40;color&#41; ;
			&#125;
			
		&#125;
		SDL_UnlockSurface&#40; screen &#41; ;
		
		
	
		//time_int&#40;1&#41; ;
		elapsed = clock&#40;&#41; - runtime ;
		SDL_Color sdlcolor=&#123;1,1,1,1&#125; ;
		char szfps&#91;50&#93; ;
		float ftime = float&#40;elapsed&#41;/float&#40;CLOCKS_PER_SEC&#41; ;
		sprintf&#40;szfps, "Frame Time&#58; %.3f", ftime &#41; ;
		RenderText&#40;screen, font, sdlcolor,5,50,szfps&#41; ;
		sprintf&#40;szfps, "FPS&#58; %.3f", 1.0f/ftime &#41; ;
		RenderText&#40;screen, font, sdlcolor,5,150,szfps&#41; ;
		//printf&#40; "s/frame&#58; %f\n", float&#40;elapsed&#41;/float&#40;CLOCKS_PER_SEC&#41; &#41; ;
		SDL_Flip&#40; screen &#41; ;	
	&#125;
...

With this code I just get a max frame rate of 5-7 fps, which means when I dont add anything to it.
Screen width and height are defined as follows:

Code: Select all

#define SCREEN_WIDTH 	1100 // 1100 max
#define SCREEN_HEIGHT 	600

So its 660000 times looping on a ppc of 3.2Ghz.
Is this possible?
I mean whats wrong?

I use Fedora 7 on the ps3 and the libs sdl, sdl_ttf, zlib, truetype.
And this code is just executed on the ppu.

Thanks

Glas · Post by **Glas** » Mon May 12, 2008 9:24 pm

I found something out.

The function

Code: Select all

VEC_TO_A8R8G8B8&#40;vector float &&#41;

costs about 20fps. This function is called in the inner for loop, which
is 660000 times.
This function is defined as follows:

Code: Select all

unsigned int VEC_TO_A8R8G8B8&#40; vector float  &x &#41; 
&#123;
	return 	&#40;&#40;unsigned int&#41;&#40; x&#91;3&#93;* 255&#41; << 24 &#41; | 
			&#40;&#40;unsigned int&#41;&#40;x&#91;0&#93;*255&#41; << 16&#41; | 
			&#40;&#40;unsigned int&#41;&#40;x&#91;1&#93;*255&#41;<<8&#41; | 
			&#40;unsigned int&#41;&#40;x&#91;2&#93;*255&#41; ;
&#125;

Ok, here we have four multiplications.
Do you have any suggestion how to improve this function?

In the code from the first post, I changed the code like this:
From

Code: Select all

p&#91;j*SCREEN_WIDTH+i&#93; = VEC_TO_A8R8G8B8&#40;color&#41; ;

to

Code: Select all

p&#91;j*SCREEN_WIDTH+i&#93; = 0x00ff0000 ; //VEC_TO_A8R8G8B8&#40;color&#41; ;

where p is a pointer to the screen array.

Thanks
Alex

Glas · Post by **Glas** » Mon May 12, 2008 9:46 pm

In the gmath lib of ibm, there is function called:

Code: Select all

unsigned int _pack_color8&#40; vecotr float rgba &#41;

With which I have nearly 20fps.
But its still very slow, or not?

Is 20fps slow for two for loops with 660000 cycles?

Ok, if I calculate the max loop count, this will result in:

Code: Select all

20 * 660000 = 13.2 million loops

With a 3.2 Ghz ppc, still too slow in my opinion!
Grateful if somebody could agree or disagree or give some answers!?

Thanks
Alex

Glas · Post by **Glas** » Mon May 12, 2008 9:57 pm

New results:

Now Im at 50fps.

I changed the following term in the inner loop:

Code: Select all

vector float color = _load_vec_float4&#40;1.0f,1.0f,1.0f,0.0f&#41; ;

to

Code: Select all

vector float color = &#40;vector float&#41;&#123;1.0f,1.0f,1.0f,1.0f&#125; ;//_load_vec_float4&#40;1.0f,1.0f,1.0f,0.0f&#41; ;

This is 30fps. Incredible!!!

Code: Select all

vector float _load_vec_float4&#40; ... &#41;

is also a function of the ibm vector lib of the example libraries.
Have I made mistakes in compiling these example lib from ibm?
Please give me <our opinion about these libs and what you think.

Thanks

Mihawk · Post by **Mihawk** » Mon May 12, 2008 11:06 pm

Looks like the color doesn't change inside the loop anyway, so why not just define an unsigned int color outside the loop and calculate it just once there?

2nd:
maybe you also need to add "-O3" compiler flag in the Makefile?

Glas · Post by **Glas** » Wed May 14, 2008 8:38 am

Hi Mihawk.

The loop is empty because i was wondering why it so slow.
And its still too slow for just two for loops. I have no idea whats going on!

The -O3 flag is set.

Thanks

rapso · Post by **rapso** » Wed May 14, 2008 6:20 pm

Code: Select all

p&#91;j*SCREEN_WIDTH+i&#93;

maybe you could avoid that math, i'm not sure weather the compiler is optimizing this or if the ppu calculates the offset every loop, but calculating it could be slow due to the register dependencies.

you could try to do the same amount of work in one loop with just

Code: Select all

p&#91;i&#93;=...

and i think it's slower using

Code: Select all

_load_vec_float4

than the declaration, because the compiler cannot optimize the intrinsic, it's done every loop, while it can take the declaration out of the loop... just a guess.

you could also check if __restrict could give you any benefits ;)

Jim · Post by **Jim** » Wed May 14, 2008 8:15 pm

Refactor so the screen is just linear
ie.
When you get to
*p++ = 0x0000ff00;

then you've made real progress :)

Jim

Glas · Post by **Glas** » Thu May 15, 2008 12:57 am

Thanks for replying.

Ok, I changed it.
But the frame rate is still 50fps.
The for loops looks like this now.

Code: Select all

&#40;some code here&#41;...		
		///////////////////////////////
		// start timing
		runtime = clock&#40;&#41; ;
		// lets do something
		SDL_LockSurface&#40; screen &#41; ;
		unsigned int *p = &#40;&#40;unsigned int*&#41;&#40;screen->pixels&#41;&#41; ;
		vector float vcHit = &#40;vector float&#41;&#123;MATHCONST_FLOATMAX,MATHCONST_FLOATMAX,MATHCONST_FLOATMAX,MATHCONST_FLOATMAX&#125;;
		
		for&#40; int j=0; j< SCREEN_HEIGHT; j++ &#41;
		&#123;
			for&#40; int i=0; i< SCREEN_WIDTH; i++ &#41;&#123;
				
				RAY ray1, ray2, ray3, ray4 ;
				
				
				ray1 = CreateRayFromSurfacePixel_Perspective_scalar&#40;i,j,g_fSW,g_fSH,g_frecip_SW, g_frecip_SH,cam &#41; ;	
				vector float color = &#40;vector float&#41;&#123;1.0f,1.0f,1.0f,1.0f&#125; ;//_load_vec_float4&#40;1.0f,1.0f,1.0f,0.0f&#41; ;
				
				*p++ = _pack_color8&#40;color&#41; ;
			&#125;
			
		&#125;
		SDL_UnlockSurface&#40; screen &#41; ;
		
		
	
		&#40;some code here&#41; ...

		SDL_Flip&#40; screen &#41; ;	
	&#125;

So, unfortunately, the pointer arithmetic has no effect at all.
Instead, another problem occurs.
If I want to trace 4 rays at once, I still have to introduce the calculation again.
For example.
If I trace 4 rays( in screen coords ):
ray 1: (y,x)
ray 2: (y, x+1)
ray 3: (y+1, x )
ray 4: (y+1, x+1)

With array index technique I could just write again:

Code: Select all

p&#91;y*width+x&#93;= ray_color1
p&#91;y*width+&#40;x+1&#41;&#93; = ray_color2
...

The current results are:
The two for loops without the ray creation function

Code: Select all

CreateRayFromSurfacePixel_Perspective_scalar&#40;&#41;

its 50 fps
And with this function, this means, it creates primary rays for width*height pixels, where
width = 1100
height = 800
the frame rate is 10 fps.

What can I consider next?
Do you have any other opportunities for me to choose?

Thanks

rapso · Post by **rapso** » Thu May 15, 2008 5:42 pm

Glas wrote:Thanks for replying.

Ok, I changed it.
But the frame rate is still 50fps.

how fast is it without the loops, with just the rest of the code, maybe the bottleneck isn't in this loops.

ray 1: (y,x)
ray 2: (y, x+1)
ray 3: (y+1, x )
ray 4: (y+1, x+1)

prefer p[index] over *p++ for two reason, 1. the loop-iteration counter is incremented anyway 2. if you do several *p++ in a row, you can get register RAW hazards on most cpus (on ARM *p++ is actuall faster :D)
if you need offsets by one line or one pixel, just add them directly without the mul

Code: Select all

p&#91;index&#93;
p&#91;index+1&#93;
p&#91;index+SCREEN_WIDTH&#93;
p&#91;index+SCREEN_WIDTH+1&#93;

or with Jim's version

Code: Select all

p&#91;0&#93;
p&#91;1&#93;
p&#91;SCREEN_WIDTH&#93;
p&#91;SCREEN_WIDTH+1&#93;
p+=2;

but this again can result in register RAW hazards. if you want to write 4pixel in a row, you could combine them into a 128bit quad and write them at once, this can also be beneficial for the buffer fill performance, especially if you do it with those altivec intrinsics.

of course, you wont write a pixel-quad anymore, but you can try to either trace 4*1 packets or to trace 2*2 but write them as 4*1 (swizzled), doing some post processing afterwards is a good chance to unswizzle the buffer (kinda for free).

sorry if something is incorrect, i'm not that deep into ppu assembler, i've just seen that there is a load for one indirection, so p[index] should be free, p[index+...] shouldn't.

cheers

Jim · Post by **Jim** » Thu May 15, 2008 8:35 pm

vector float color = (vector float){1.0f,1.0f,1.0f,1.0f} ;//_load_vec_float4(1.0f,1.0f,1.0f,0.0f) ;
*p++ = _pack_color8(color) ;

this still has the potential to generate a load of fp code (depends on the optimiser/code analyser).

*p++ = 0xffffffff;

gives the same result.

How does this pair of loops compare with memset(screen, 0xff, w*h*4)?

Jim

d-range · Post by **d-range** » Fri May 16, 2008 12:09 am

Maybe this is a stupid question, but why are you doing this kind of stuff from the PPU anyway? Regardless of how you're writing your code, your still doing ±50 million dword writes a second @25 fps, which will be pretty slow no matter how you do it.

Glas · Post by **Glas** » Fri May 16, 2008 12:06 pm

Hi all and thanks again.

First of all, rapso:
I cut the for loops out of the code and got a frame rate of ~50000 fps.
For a 3.2 Ghz Cell, utilizing the ppu only, I consider this also as a bit too slow.
Putting the for loops in, shrinking the fps count 10000 time down?!

Jim:
Of course you are right. But later on, when ray tracing with color, this function will be needed anyway, so it doesnt matter to put it in or not. It also hasnt a deep impact in the fps count...
This means, I just cut out the ray tracing algo, just to take a look, why the loops constrain my fps count to 50 fps.

d-range:
I dont really understand the question. Sorry.

Thanks

ldesnogu · Post by **ldesnogu** » Fri May 16, 2008 7:12 pm

What d-range meant is that your for loops write 1100 x 800 words per frame and that doing it this way on the PPU is far from optimal.

The "standard" way to write big amounts of data on the Cell is by using DMA with multiple SPU's.

However even at 50 fps, we are talking of 1100 x 800 x 4 x 50 = 176,000,000 B/s which is low.
If you look this http://www-128.ibm.com/developerworks/f ... D=13975586& you will see that you are not limited by the memory BW from PPU to memory. (As a side note, I got higher results for STREAM than what the guy quoted by using some assembly with manual unrolling and cache preload.)

Glas · Post by **Glas** » Sat May 17, 2008 11:30 am

Hi ldesnogu.

Ah, you are talking about the spus, but unfortunately, currently Im not considering using the spus.
I first want to test trees and algorithms on the ppu.

Im just wondering why these for loops are so slow!
But tanks for the link. I will consider it when it time for spu programming and
dma transfers. This is a interesting discussion.

I have another question.
How do you consider the example libs from ibm.
I mean the libs like, the misc lib, the gamth lib, the vector lib and so on...

Because currently Im intersecting tri data with rays in the dimension mentioned
above. 1100 x 600.
I got a frame time of 4 secs for just a cube with 2 tris per face, resulting in
2 * 6 tris = 12 tris

Im a bit frustrated, because everything is so slow!!!

Thanks
Alex

ldesnogu · Post by **ldesnogu** » Sat May 17, 2008 9:34 pm

Stupid question just in case: do you use -O2 or -O3 when you compile?

And again: the Cell PPU is a very poor processor, using it alone will only be giving you very deceiving results.

Glas · Post by **Glas** » Mon May 19, 2008 2:42 am

Hello ldesnogu.

I compile it with -O3. -O2 I havent tried jet. Though, for debugging I use no optimization of course.

ldesnogu wrote: And again: the Cell PPU is a very poor processor, using it alone will only be giving you very deceiving results.

But why?
What makes it so poor compared to an ordinary intel like cpu?

Alex

ldesnogu · Post by **ldesnogu** » Mon May 19, 2008 3:03 am

In order, no (or poor can't remember :D) branch prediction.

IBM has posted SPECint 2k results of 423 and fp of 387 (ref).

That basically means the PPU has a similar performance of a PIII 800 MHz for integers and about 10% better than the PIII for FP.

So if you don't use the Cell "properly" you will be *very* disappointed :)

Glas · Post by **Glas** » Mon May 19, 2008 7:22 am

Hi ldesnogu.

http://spec.it.miami.edu/cgi-bin/osgresults
I found results here, but none for cbea.

ldesnogu · Post by **ldesnogu** » Mon May 19, 2008 7:27 am

1. I posted the link to IBM result in my previous post (it's close to the end of the article)
2. If you want to search SPEC 2000 results use the official site: http://www.spec.org/cgi-bin/osgresults?conf=cpu2000

Glas · Post by **Glas** » Mon May 19, 2008 1:06 pm

Hi.

Thanks.

Ok. I dont mind the original problem for now. I use what I get and porting everything to the spus soon.

Btw, I tried a simple console app on windows on my athlonX2 2.4 Ghz by testing a while loop. Its the same construct like that on the ppu.
Here I get ~600k fps. On the ppu, ~50k fps.
This is much difference I think, especially with a 800Mhz slower cpu...

#edit#
Its just like ldesnogu said. It like a PIII ;)
#/edit#

Thanks to all for your help.
Alex

rapso · Post by **rapso** » Mon May 19, 2008 5:20 pm

Glas wrote: First of all, rapso:
I cut the for loops out of the code and got a frame rate of ~50000 fps.
For a 3.2 Ghz Cell, utilizing the ppu only, I consider this also as a bit too slow.
Putting the for loops in, shrinking the fps count 10000 time down?!

but you kept everything else like SDL_LockSurface( screen ) ; ?

50MB/s is extremly slow, i kinda doubt it's just the loop.

IronPeter · Post by **IronPeter** » Mon May 19, 2008 6:37 pm

>So if you don't use the Cell "properly" you will be *very* disappointed :)

Yes, that is the point.

Use PPU as IO-processor and SPU scheduler only.

HD · Post by HD » Tue May 20, 2008 2:27 am

I think you can get a lot more speed out of the ppu by using altivec, cache clearing and loop unrolling. An optimized ppu-memset similar to the one you need achieves appr. 5800MBytes/sec ~1650fps. Download from here:
http://www.fh-furtwangen.de/~dersch/memcpy_cell.c
If you need to do format conversions (float4->uchar4): these can
also be done quite efficiently in altivec-code.

Regards

HD

Glas · Post by **Glas** » Wed May 21, 2008 4:51 pm

Hi and thanks for your replies.

Sorry that my answer has taken so long.

@rapso:
You were right!
Now I cut everything out, even the display flip and got a fps count of
~100k - ~120k fps.
But when I cut out the input handling, I get
~600k - ~700k fps

This is all sdl code!
I have never thought that this could be the bottleneck, because the input handling is just a bit switch - case stuff.
Is this because of the in-order processing of the ppu?

I count the fps like this:

Code: Select all

unsigned long long start = __mftb&#40;&#41; ;
unsigned long long elapsed = __mftb&#40;&#41; - start ;

print out &#40; 1.0f/&#40;elapsed / timebase Per Sec&#41;&#41;;

So, what do you suggest?
I could manage the ps3 ray tracer via my pc(over network) but later in this real time rt project, I need some joypad input.

On the other hand, I even change the code to spe code so this wouldnt make that
difference, because even 50k fps are enough for just management tasks.

But what makes the sdl code so slow? I dont use any rasterization or any other compute intensive sdl code.

Alex

rapso · Post by **rapso** » Wed May 21, 2008 7:48 pm

Glas wrote:Hi and thanks for your replies.

Sorry that my answer has taken so long.

@rapso:
You were right!

i'm glad I guide you to some issue finding although this special input issue wasn't what I was intending. so additionally:
I asked if you kept SDL_LockSurface( screen ) in your code when you tested without the loop, cause although it might look like a simple function, SDL might do a lot of format conversions before returning of the ptr. this can be a way more expensive than your loop.

regarding input. maybe you can post some of the 'expensive' code.

cheriff · Post by **cheriff** » Thu May 22, 2008 10:21 am

Glas wrote:Now I cut everything out, even the display flip and got a fps count of
~100k - ~120k fps.
But when I cut out the input handling, I get
~600k - ~700k fps

This is all sdl code!
I have never thought that this could be the bottleneck, because the input handling is just a bit switch - case stuff.

Hi, whilst I cant find the link right now, I do seem to remember an article on gamedev or something on the folly of relying on FPS for this kind of performance tuning (especially this early on in development) I cant recall the exact details, but it basically comes down to the fact that FPS is measuring the reciprocal of time, which is not linear, and so conflicts with intuition. Maybe you should be looking at average frame time instead

So say your app as a bare loop runs at 600kfps (really? half a million?) so each frame is taking 1/600k = 1.6e-6 seconds each.
With the SDL input routines being run, you're down to 1/100k = 1.0e-5 seconds each

So a call to input routines takes 8.3e-6 seconds each, which is the same order of magnitude as the actual loop - so its only natural that 99% of processing time is in SDL - where else would it be? Since the code doesnt attempt to do anything else, of course the few things you ARE doing will dominate execution time

Now consider a project a bit further along, calculating graphics and stuff, running at a respectable 100fps, or 0.01 seconds per frame.

Now lets add input code, which we already know to take 8.3e-6 seconds, so each frame now requires 0.010008 seconds to render - which equates back to 99.9167 FPS. Does SDL still feel like a bottleneck to you now? :)

So the SDL code either costs 500k frames per second - or 0.08 frames per second - depending on what else in in the game loop.

In short - dont get too caught up with FPS early on in the project :) If you insist on trying to optimise this early on, at the very least, deal in seconds per frame on your mental graph paper as you plot performance - at least it is linear and will be a truer indication of the cost of features you add.

Glas · Post by **Glas** » Fri May 23, 2008 6:16 am

Hi all and thanks again.

rapso:
Here is the sdl input code. Its quite simple and short.

Code: Select all

/////////////////////////////////////////////
		// Handle input
		while&#40; SDL_PollEvent&#40; &event &#41; &#41;
		&#123;
			switch&#40; event.type &#41;
			&#123;
			case SDL_KEYDOWN&#58;
				switch&#40; event.key.keysym.sym &#41;
				&#123;
				case SDLK_ESCAPE&#58;
					bDone = true ;
					break ;
				case SDLK_p&#58;
					&#123;
					char sz&#91;25&#93; ;
					sprintf&#40; sz, "/tmp/screenshot.png\0" &#41; ;
					int i = SDL_SaveBMP&#40;screen, sz&#41; ;
					i=0 ;
					&#125;
					break ;
				&#125; // switch sym
				break ;
			case SDL_MOUSEMOTION&#58;
				x = event.motion.x ;
				y = event.motion.y ;
				break ;
			case SDL_MOUSEBUTTONDOWN&#58;
				break ;
			&#125; // switch type
		&#125; // poll event

Actually, I just need sdl for input handling. Because the fb utility from http://www.cellperformance.com works just fine but I dont use it currently, as you can imagine.
But still, in the next one or two weeks, I change to the spus and that, this shouldnt make that difference.

cheriff:
Actually I have more code as just the two for loops.
I have a ray tracer running. The problem was that I didnt know where the whole performance has gone. So I cut out pieces and end up with just that what rapso
suggested.
With 20 Spheres and 6 plane (all implicit) a frame took about 4s.
Without the ray tracer and just input and for loops and primary ray gen, a frame took 0.5s.
And with just the main while loop and no for loops and no input, I got 700kfps.
I should try the whole thing with just the fb utility from cellperformance.com, to
see how it actually is.

So what do you say about this?

Thanks
Alex