Framebuffer hello world and performance measurement

Investigation into how Linux on the PS3 might lead to homebrew development.

Moderators: cheriff, emoon

Post Reply
Shine
Posts: 728
Joined: Fri Dec 03, 2004 12:10 pm
Location: Germany

Framebuffer hello world and performance measurement

Post by Shine »

I've written a very unoptimized program, which draws a fullscreen background image and on top of this a moving bar. In 720x480 resolution mode (mode 480i, set with "ps3videomode -v 1") the usable area is 648x432 and with a bar height of 20 pixel, nearly 60 fps are possible. I think when using the SPEs for blitting and more optimized code, good 2D games, like jump-and-run games, with multi layer parallax scrolling, should be no problem.

Code: Select all

// performance test with VSync IRQ, inspired by the VSync example on the cell add-on CD
//
// compile:
// gcc -I /usr/src/linux-2.6.16-cell-r1/include -lm vsync.c -o vsync
//
// tested on Gentoo, installed with this guide: http://wiki.ps2dev.org/ps3:linux:installing_gentoo

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stdint.h>
#include <math.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <linux/kd.h>
#include <sys/time.h>
#include <linux/fb.h>
#include <asm/ps3fb.h>

int width, height, memoryWidth, memoryHeight;
uint32_t* background;

void draw&#40;uint32_t* fb&#41;
&#123;
	int x, y, yp;
	static int t = 0;
	int barHeight = 20;
	float amplitude = &#40;&#40;float&#41; &#40;height - barHeight&#41;&#41; / 2.0;
	float frequency = 40.0;

	// blit background
	for &#40;x = 0; x < width; x++&#41; &#123;
		for &#40;y = 0; y < height; y++&#41; &#123;
			fb&#91;y * memoryWidth + x&#93; = background&#91;y * width + x&#93;;
		&#125;
	&#125;
	
	// draw a bar
	yp = sin&#40;&#40;&#40;float&#41;t&#41; / frequency&#41; * amplitude + amplitude;
	for &#40;y = yp; y < yp + barHeight; y++&#41; &#123;
		if &#40;y < height && y >= 0&#41; &#123;
			for &#40;x = 0; x < width; x++&#41; &#123;
				fb&#91;y * memoryWidth + x&#93; = 0xffffff;
			&#125;
		&#125;
	&#125;
	t++;
	if &#40;t == height&#41; t = 0;
		
&#125;

void enableCursor&#40;int enable&#41;
&#123;
	int fd = open&#40;"/dev/console", O_NONBLOCK&#41;;
	if &#40;fd >= 0&#41; &#123;
		ioctl&#40;fd, KDSETMODE, enable ? KD_TEXT &#58; KD_GRAPHICS&#41;;
		close&#40;fd&#41;;
	&#125;
&#125;

int main&#40;int argc, char *argv&#91;&#93;&#41;
&#123;
	int fd;
	void *addr;
	int length;
	struct ps3fb_ioctl_res res;
	int x, y;
	uint32_t frame = 0;
	struct timeval tv;
	uint32_t time;
	int count;

	// switch to graphics mode &#40;disable cursor&#41;
	enableCursor&#40;0&#41;;
	
	// access framebuffer
	fd = open&#40;"/dev/fb0", O_RDWR&#41;;
	ioctl&#40;fd, PS3FB_IOCTL_SCREENINFO, &#40;unsigned long&#41;&res&#41;;
	printf&#40;"xres&#58; %d, yres&#58; %d, xoff&#58; %d, yoff&#58; %d, num_frames&#58; %d\n",
		res.xres, res.yres, res.xoff, res.yoff, res.num_frames&#41;;
	length = res.xres * res.yres * 4 * res.num_frames;
	addr = mmap&#40;NULL, length, PROT_WRITE, MAP_SHARED, fd, 0&#41;;

	// stop flipping in kernel thread with vsync
	ioctl&#40;fd, PS3FB_IOCTL_ON, 0&#41;;

	// create test background image
	memoryWidth = res.xres;
	memoryHeight = res.yres;
	width = res.xres - 2 * res.xoff;
	height = res.yres - 2 * res.yoff;
	background = malloc&#40;width * height * 4&#41;;
	for &#40;x = 0; x < width; x++&#41; &#123;
		for &#40;y = 0; y < height; y++&#41; &#123;
			int c = &#40;11 * x&#41; & 255;
			background&#91;y * width + x&#93; = x*y << 3;
		&#125;
	&#125;

	// start timing	
	gettimeofday&#40;&tv, NULL&#41;;
	time = tv.tv_sec * 1000000 + tv.tv_usec;

	// draw test
	count = 300;
	for &#40;x = 0; x < count; x++&#41; &#123;
		// wait for vsync interrupt */
		uint32_t crt = 0;
		ioctl&#40;fd, FBIO_WAITFORVSYNC, &#40;unsigned long&#41;&crt&#41;;
		
		// draw frame
		draw&#40;addr + frame * memoryWidth * 4 * memoryHeight&#41;;

		// blit and flip with vsync request */
		ioctl&#40;fd, PS3FB_IOCTL_FSEL, &#40;unsigned long&#41;&frame&#41;;
		
		// switch frame
		frame = 1 - frame;
	&#125;

	// end timing	
	gettimeofday&#40;&tv, NULL&#41;;
	time = tv.tv_sec * 1000000 + tv.tv_usec - time;
	printf&#40;"fps&#58; %d\n", count * 1000000 / time&#41;;

	free&#40;background&#41;;

	// start flipping in kernel thread with vsync
	ioctl&#40;fd, PS3FB_IOCTL_OFF, 0&#41;;
	munmap&#40;NULL, length&#41;;

	// close device
	close&#40;fd&#41;;
	
	// back to text mode
	enableCursor&#40;1&#41;;

	return 0;
&#125;
Shine
Posts: 728
Joined: Fri Dec 03, 2004 12:10 pm
Location: Germany

Re: Framebuffer hello world and performance measurement

Post by Shine »

No need to worry: When compiling with -O2, copying the memory with 64 bit access and reordering the access (line-by-line instead of column-by-column) you can do 10 full screen blits with an additional 50 pixel bar overlay in 13 ms :-)
mtb
Posts: 19
Joined: Thu Oct 19, 2006 6:55 am
Location: UK/Tokyo

Re: Framebuffer hello world and performance measurement

Post by mtb »

Shine wrote:No need to worry: When compiling with -O2, copying the memory with 64 bit access and reordering the access (line-by-line instead of column-by-column) you can do 10 full screen blits with an additional 50 pixel bar overlay in 13 ms :-)
Shine, do you have that updated code, would be cool of we could get it tested at a number of resolutions:)
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

next baby step (spe offloaded computation of screen)

Post by tweakoz »

got a next baby step working!!!

I have spe oflloaded computation of the screen (configurable from 1 to 6 spe's)
i have basic Blit / DMA working,
unfortunately it requires the PPE to initiate the DMA per screen line per tile
( i just get a hang when attempting to have the SPE initiate the DMA)
with the PPE initiating DMA requests in 1080, that would be 1080 * (1920/128) = 16200 individual dma requests, would be nice if possible to get this down to (1080/128) * (1920/128) = ~ 127 requests.

also note i am procedurally generating the screen (sort of like a simple pixel shader)

at any rate with this baby step i hit these framerates
23 FPS in 1920x1080i (vsync off)
52 FPS in 1280x720p (vsync off)
150 FPS in 720x480i (vsync off)

next baby step im gonna get working is spe initiated DMA
then try smaller tiles...
eventually i would like to turn this into a TBDR (Tile Based Deferred Renderer)

code at
http://www.tweakoz.com/portfolio/spurast1.tar

enjoy.....

michael t. mayers
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

almost forgot

Post by tweakoz »

those FPS numbers mentioned last post were with 4 SPU's enabled

michael t. mayers
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

1 more thing

Post by tweakoz »

upgrade to the latest libspe2 (think its 2.01 or 2.02) the code wont work with 2.0

mtm
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

spu initiated DMA working

Post by tweakoz »

problem was 64bit addresses vs 32bit addresses .

i was calling a 64bit address mfc function,
im now calling the 32 bit version.

peak blit transfer rate reached so far is > 800MB /sec
at tilesize=128 and nspus = 6 (~ same result for 4 spus)
max fps of just blitting so far at 720x480i is > 1000fps

im pretty sure im ppe limited now,
adding SPU's doesnt help unless i add more work to each spu

i am still single buffering the spe - calc / dma cycle.
next babystep is double buffering....

updated code posted at the same url

mtm.
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

sorry, 1 error in numbers

Post by tweakoz »

sorry - need to correct a misquote:

im not actually blitting 720x480, im blitting 512x384 (integer multiples of the tile size (128))

dma blit transfer rate still > 800MB/sec

mtm
Shine
Posts: 728
Joined: Fri Dec 03, 2004 12:10 pm
Location: Germany

Re: spu initiated DMA working

Post by Shine »

tweakoz wrote:peak blit transfer rate reached so far is > 800MB /sec
at tilesize=128 and nspus = 6 (~ same result for 4 spus)
max fps of just blitting so far at 720x480i is > 1000fps
This sounds better, the 150 fps was really too slow, because this is already possible with pure C loops from the main CPU and without DMA.

A library which provides fast blittings with alpha blending would be nice. An idea: the SPE sends jobs to the SPUs, which transfers the background to local memory (in stripes, because of limited memory), then the image to blit, then the SPU performs the alpha blending and finally transfers it back to the framebuffer or other memory region (for double buffering or for blitting to images). Alpha blending should be really fast with SIMD operations.
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

for fun

Post by tweakoz »

just for fun i put up a parallel juliaset -> framebuffer blitter...

http://www.tweakoz.com/portfolio/spurast_julia.tar

mtm
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

new spu blittest tar file up

Post by tweakoz »

now im getting:the following performance
1080i : >700fps (4.5GB/sec)
720p: >1350 fps (3.7GB/sec)
480i: > 2700 fps (2.3 GB/sec)

this is with 4 spu's and pure blitting (spu localmem -> framebuffer)
(no per pixel computation)

again the tar is at
http://www.tweakoz.com/portfolio/spurast1.tar
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

Post by tweakoz »

i noticed it goes up to these numbers when i use -f on ps3videomode
(fullscreen)

1080i: fps: 801 [6336 MB/Sec]
720p: fps: 1577 [5544 MB/Sec]
480i: fps: 2759 [3637 MB/Sec]

not only is the bandwidth going up, but the fps is going up too...
the non-fullscreen mode is reducing performance for some reason...

mtm
Arwin
Posts: 426
Joined: Tue Jul 12, 2005 7:00 pm

Post by Arwin »

I wonder how this works. Is the framebuffer in RSX (GDDR) memory and then directly manipulated by Cell (which in the schematics we've seen was rated at about 4GB/s, wasn't it?)? If so, then you've already seem to have gotten better performance than in the original specs.

Or is the RSX using a framebuffer in XDR memory?
J.F.
Posts: 2906
Joined: Sun Feb 22, 2004 11:41 am

Post by J.F. »

tweakoz wrote:the non-fullscreen mode is reducing performance for some reason...
Windowed mode has to go through an extra layer of system functions to provide clipping to the window's layer... just in case the window isn't fully visible. Fullscreen mode doesn't. Most of the time, this amounts to only a minor performance decrease that is absorbed in the rest of the app (game), but you're pushing the edge to try to get absolute speed figures, so naturally you'll see this overhead.
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

Post by tweakoz »

Windowed mode has to go through an extra layer of system functions to provide clipping to the window's layer...
hmm - the SPU's DMA controller has no notion of this, it is just DMA'ing from the local store to the framebuffer with no clipping,
although maybe you are speaking of the DMA "from" the framebuffer to the RSX

mtm
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

new spu blit test build up

Post by tweakoz »

(getting closer to real world workloads)

new build has following features:

1. basic destination additive blending (read modify write cycle)
2. basic texture mapping with 2D rotated UV space

performance numbers (4 spu's, single 64x64 texture)
1080i: 208fps (1645MB/sec read, 1645MB/sec write)
720i: 435fps (1529MB/sec read, 1529MB/sec write)
480i: 1004fps (1323MB/sec read, 1323MB/sec write)

as usual:
http://www.tweakoz.com/portfolio/spurast1.tar

mtm
DaveRoyal
Posts: 16
Joined: Sat Jun 25, 2005 3:52 am
Location: Northern California, USA

Hello

Post by DaveRoyal »

TO,

I just started following your work. I haven't tested your code yet, but hope to this evening.

Just this past weekend, I got frame buffer working, after many trial and errors, wish I had found your work earlier.

I see you're using a main loop counter. My main interest is a gameloop, and I'm currently using the typical while(1)...

And then inside the loop, I got the joystick in nonblock mode, so I test to see if a particular button has been pressed, and exit that way.

I'm hoping to get old-school demos working, from the early days of DOS coding, where they used to use mode 13h.

I've been looking over the articles at flipcode, and hope to come up to speed over the next few weeks where you are, so I can understand the relationship between the processors.

Thanks for your great work, I look forward to more!


Dr. Dave 'Wheels' Royal
J.F.
Posts: 2906
Joined: Sun Feb 22, 2004 11:41 am

Post by J.F. »

tweakoz wrote:
Windowed mode has to go through an extra layer of system functions to provide clipping to the window's layer...
hmm - the SPU's DMA controller has no notion of this, it is just DMA'ing from the local store to the framebuffer with no clipping,
although maybe you are speaking of the DMA "from" the framebuffer to the RSX

mtm
If you're simply setting the DMA to blit the data directly to the framebuffer without any regard to the windows, then yes, there won't be any clipping and what-not. If you're using system functions associated with the window, it will adjust things automatically to take into account the window borders and overlap and such. I guess I should probably look at the code to see exactly what you're doing rather than just guessing. :)
tweakoz
Posts: 21
Joined: Tue Feb 17, 2004 10:51 am
Location: Santa Cruz, CA
Contact:

its getting harder and harder

Post by tweakoz »

its getting harder and harder to squeeze more performance out
of the 64x64 texture additive blending test.

spu-gcc compiler seems to be choking with too much inlining.
(after a point with unrolling/inlining the demo still compiles
but stops working)

current performance:
1080i: fps: 337 [2665 MB/Sec]
720p: fps: 688 [2418 MB/Sec]
480i: fps: 1380 [1819 MB/Sec]

so thats >5GB/sec ( in and out combined ) bandwidth

since there seems to be a lot of overhead in just loops (which i am unrolling to a point) there may be more effective use of bandwidth to be had via multiple texture lookups (higher math/bandwidth ratio) ....
Warren
Posts: 175
Joined: Sat Jan 24, 2004 8:26 am
Location: San Diego, CA

Post by Warren »

I just downloaded and tried your program tweakoz but it seems to lock up right after it starts the 4th (#3) SPU unit. I'm running YDL with libspe2 installed.
Minase
Posts: 6
Joined: Sun Apr 03, 2005 1:38 am

Post by Minase »

Any idea where I can get <asm/ps3fb.h> from? (Can't compile your code because it's missing)

It doesn't seem to be in the toolchain tarball from bsc.es.
If it just comes with the FC or Gentoo installs, well, I'm running Debian... :)
Minase
Posts: 6
Joined: Sun Apr 03, 2005 1:38 am

Post by Minase »

Oh pfft, forgot the obvious, nevermind :)

(For anyone else: apt-get install linux-headers-2.6.16-1-ps3pf, and add -I/usr/src/linux-headers-2.6.16-1-ps3pf/include)
JuSho
Posts: 4
Joined: Thu Dec 06, 2007 1:04 am
Location: Scottsdale, AZ

SDL frontend for native fb access ?

Post by JuSho »

I see this is a rather old topic but I haven't seen any update/post on SDL recently. So my question, has there been any work done to use a native fb interface for the SDL video drivers ? (I saw the media lib seems to be on track to allow some X acceleration, but anything low level for fb?)

Sorry to post it here, but this seems a pretty good thread for some fb performance programming.
IronPeter
Posts: 207
Joined: Mon Aug 06, 2007 12:46 am
Contact:

Post by IronPeter »

Switch to fullscreen mode, map video ram, use SPU DMA scattering.

What do you want beyond this functionality?
Post Reply