Higher Level GPU Questions

cnlohr · Post by **cnlohr** » Tue Nov 13, 2007 6:29 pm

Since the other thread has been very eventful with important news, and it appears thousands of people are visiting it for news, I decided to post here for a general discussion for some of the noobs like myself who have been following the thread.

I had three questions, but if possible I would like this thread to be continued as a Q/A thread.

1. Do we know if the swapping buffers is the same as on the '7800?

2. I'm having fun writing an interactive program, except it crashes very quickly, in fact, whenever wptr (ps3gpu.c:511) approaches 16384. I noticed some brief discussion about trying to place a jump in the FIFO buffer, but I haven't found any information on how to do that. It would be great if someone could provide some insight there.

3. The kernel patch provided doesn't seem to take to _any_ kernel I applied it to. It wasn't until I manually edited the files and pasted in the code that I could get 3D working on my PS3. What kernel are you guys using!?

Additionally, I ran into the following problems and solved them. I think this may be useful for any other noobs like myself.

I was having a system that got a black screen after KBOOT, or failed to do much of anything with the 2.6.22 and 2.6.23 kernels. My solution was to use the kernel in this post http://forums.ps2dev.org/viewtopic.php? ... +git#59961.

I couldn't get my system to find the hard drives after I upgraded my kernel: Two problems 1: Name of hard drive driver changed. Now, it gets built into the kernel by default. 2: drive /dev/sdaX was changed to /dev/ps3daX.

I couldn't check out the files with the web URL for the subversion repo. Answer: actual url of the repo was svn://svn.ps2dev.org/ps3ware/trunk/libps3rsx

IronPeter · Post by **IronPeter** » Tue Nov 13, 2007 7:05 pm

1.) You can flip visual screen area with

http://wiki.ps2dev.org/ps3:hypervisor:l ... _attribute with L1GPU_CONTEXT_ATTRIBUTE_DISPLAY_FLIP attribute.

2.) Just place OUT_RING(0x20000000 | address_to_jump ); in your push buffer. This comand do not have subchannel id and tag mask.

It is good idea to rule GPU with jumps in push buffer, do not modify control registers in runtime, kick is very slow. There are many other ways to keep GPU in sync with CPU.

3.) GIT head from Levand Geoff kernel, http://git.kernel.org/gitweb.cgi?p=linu ... -linux.git ( not sure, I am away from my console ).

cnlohr · Post by **cnlohr** » Tue Nov 13, 2007 7:15 pm

Thank you for your prompt response. I will continue to follow your project closely, as well as experimenting and continuing learning about this on my own time.

And, I know you've heard it a lot but:

Thank you for your excellent work, making 3D possible for all of us!

cnlohr · Post by **cnlohr** » Wed Nov 14, 2007 5:09 am

Hmm -- whenever I perform the following code:

Code: Select all

        ctrl&#91;0x10&#93; &= &#40;~&#40;gpu->fifo.len-1&#41;&#41;;
        ctrl&#91;0x10&#93; |=   0x1448; //I've tried 0 here too.
        OUT_RING&#40;0x20000000 | ctrl&#91;0x10&#93; &#41;;

after (or before) the fifo_push() and fifo_wait(), the app locks up. I can still talk to the system, but if I try running any more 3D applications or using the screen for anything, the whole PS3 locks up and I have to hard reboot it.

Note: I've checked to verify that ctrl[0x10] does look like a reasonable value. (before it was 0xe1f3c80, after the code execution, 0xe1f1448)

I wouldn't be surprised if I'm just missing a lot considering all that's in the nouveau_dma_wait function that I don't understand ( http://cgit.freedesktop.org/~ahuillet/d ... 1d7e1613c2 )

Is there any simple answer to this one? (or some example code) If not, I don't mind waiting a few weeks for someone who knows what they're doing to come along and write some code.

IronPeter · Post by **IronPeter** » Wed Nov 14, 2007 5:35 am

just put OUT_RING( 0x20000000 | 0xe1f0000 ); in simple triangle code before fifo_push. And watch nice picture :).

This loop can be broken by leave_direct ( after 3 seconds timeout or by Ctrl-C signal ).

It is very good idea to loop your push buffer. You can create semaphor like jump : goto jump; and rewrite this semaphor by CPU just by putting 0. You can use DMA by GPU to insert this semaphor in the push buffer.

Any access to wptr/rptr is very slow.

cnlohr · Post by **cnlohr** » Wed Nov 14, 2007 4:50 pm

Oh man, I apologize. My problem was that I wasn't updating wptr and ptr before performing the jump.

Even with executing all of the unnecessary setup code every frame, I'm getting 200+ FPS. I have a wonderful animated spinning bunch of triangles.

I was able to use the built-in Vsyincing stuff to lock the image to 60FPS. I haven't figured out a clean way of using the display flipping using lv1_gpu_context, as I can't find anywhere in the program where the context_handle is exposed, unless it's ctrl member of 'gpu.' I'll mess around more in the morning.

Thanks for all the help.

cnlohr · Post by **cnlohr** » Thu Nov 15, 2007 4:09 am

I still can't seem to figure out how to get the GPU context and call the lv1 gpu command from userspace as root (ie from the simple_triangle demo). If anyone has a demo for calling those functions from outside the kernel space, any help would be appreciated. *Edit* I'm sure people other than iron peter have got to know the answer. I feel bad taking up his valuable time answering my noob questions.

But, I was just playing around, and even with jumping and clearing the buffers, I was able to get 4,400 FPS drawing the three triangles in immediate mode. When not clearing the screen, the number was around 22,000 FPS.

I tried doing the jump every frame vs every 20 frames. The speed difference was virtually non-existent (4.484s vs 4.487s (averaged over 3 runs) for 100,000 frames). I was under the impression jumping and messing with those variables would be slow.

ralferoo · Post by **ralferoo** » Thu Nov 15, 2007 4:23 am

cnlohr wrote:I still can't seem to figure out how to get the GPU context and call the lv1 gpu command from userspace as root

You can't call any hypervisor functions from userspace. Write a small kernel module or more likely, if it's a graphics related call modify ps3fb.

IronPeter · Post by **IronPeter** » Thu Nov 15, 2007 4:45 am

Do not worry about my time, it is fun for me.

To call lv1 function from userspace it is good idea to use driver ioctl ( you can refer glaurung's patch to ps3fb.c ). Not very fast. ~10K CPU ticks for one single ioctl call, I think.

Jumps in push buffer are our friends.

Keep in mind that we totally miss L3 caching ( so called TILES ), Z and color compression. We will have good speedup with these features enabled.

Also we miss swizzle for framebuffer, it is great for locality and caching.

It is great if you can test something. Glaurung wrote:

- I played a bit with the values of lv1_gpu_memory_allocate(). The four values set to zero are actually refering to resources, probably two memory resources and 2 other resources. Here are the maximum values I could set before the call returns invalid parameters (-17):
status = lv1_gpu_memory_allocate(ps3fb.vram_size,
512*1024,
3075*1024,
15,
8,
&ps3fb.memory_handle,
&ps3fb.vram_lpar);

I think that this call is related with TILE and ZCOMP setup. Would you like to test perfomance with different memory properties?

cnlohr · Post by **cnlohr** » Thu Nov 15, 2007 7:18 am

Wow, that was easy. Thanks for the suggestion to use IOCTL to use context_attribute, I can now look at arbitrary locations easily.

I can't seem to change the COLOR0 offset. It seems like no matter where I put the following code, the actual location of where the output is drawn remains the same. Is this the wrong method for changing the offset of the draw buffer?

Code: Select all

  BEGIN_RING&#40;Nv3D, NV40TCL_COLOR0_OFFSET, 1 &#41;;
  OUT_RING&#40; WIDTH*HEIGHT*4&#41;;

*EDIT* If I put that code immediately before the clear buffers and after the NV40TCL_RT_FORMAT command, it works! I now have everything double-buffered and pretty.

Or, should I just be changing the clipping and other values to just force it to draw to the area immediately after where it's drawing now?

On the side of figuring out the values for memory_allocate -- What should I be modifying to mess around with that. I can see an instance in ps3fb.c (line 1370)

Code: Select all

        status = lv1_gpu_memory_allocate&#40;0, 0, 0, 0, 0,
                                         &ps3fb.memory_handle,
                                         &ps3fb.vram_lpar&#41;;

Is this what I should be changing?

and

Is it possible for me to make the ps3fb a module so I can modify it without a reboot, or is there an issue with that?

cnlohr · Post by **cnlohr** » Thu Nov 15, 2007 1:59 pm

Ok, I tried modifying the value going into the function in ps3fb.c:1370. I couldn't seem to make it run without crashing. And, I must be going about this the wrong way, each modify->compile->reboot cycle takes in upwards of 15 minutes for me. Also -- When I tried making it a module, I couldn't seem to get the init3d command to work.

Code: Select all

5000 frames

Real&#58; 10.442s
user&#58; 10.353s
sys&#58;  .074s
0,0,0,0,0

crash on init3d
ps3fb.vram_size,256*1024, 512*1024, 15, 8,

crash on init3d
ps3fb.vram_size,16*1024, 32*1024, 15, 8,

crash on init3d
ps3fb.vram_size,0,0,0,0

FB is completely broken.  I can SSH in, but nothing works on the video.  KBOOT remains on-screen
ps3fb.vram_size/128,0,0,0,0,

&#40;sanity check&#41;
real    0m10.438s
user    0m10.369s
sys     0m0.054s
0,0,0,0,0

IronPeter · Post by **IronPeter** » Thu Nov 15, 2007 7:00 pm

Use only lv1_gpu_memory_allocate( 0, 0, 0, 0, 0 ... ); for context memory.

After context is allocated you can make sequence of lv1_gpu_memory_allocs ( make these call via ioctl ). Hypervisor will place memory regions sequentially from the zero offset. Memory allocation seems to alter global GPU state by design.

So allocate context as usually, then call lv1_gpu_memory_allocate( screen_size_up_to_megabyte, ?, ?, ?, ? ) via ioctl and test perfomance. Make deallocate, repeat tests :).

I think modular build is possible. But I did not try it. Good ioctl is enough for me.

cnlohr · Post by **cnlohr** » Thu Nov 15, 2007 7:22 pm

Ok. I think I get it. I will test it in about 12 hours. I didn't really understand what you were asking for earlier.

So, if I get this right:

Never change the way the allocate( 0,0,0,0,0...) call in the kernel works.

Just modify the un-vsynced, un-flipped animated program that just executes a few thousand frames.

Add in the program, before the gfx_test call, a call to lv1_gpu_memory_allocate( <various sizes>, <various numbers>, <various numbers>, <various numbers>, handle, lpar ) via IOCTL call.

Run program over fixed number of frames (probably 5000.) And record total running time.

Modify parameters to the allocate function in code, recompile, re-run test.

If I find any interesting results (it not always being exactly the same) I will graph the results.

If I misunderstood anything, it'd be great if you can correct me.

I'm in the EST time zone in the USA, so it's about 4:30 AM here and I have classes tomorrow, so I'm not going to get to this until the evening.

Once again, thanks for the help, and with any luck, I may be able to contribute :).

IronPeter · Post by **IronPeter** » Thu Nov 15, 2007 7:26 pm

Ok, everything is ok.

> And record total running time.

and unmap the old memory.

cnlohr · Post by **cnlohr** » Thu Nov 15, 2007 7:30 pm

Understood. I am not used to systems that don't automatically relinquish allocations and rights upon exit, it's going to take some getting used to on my part.

cnlohr · Post by **cnlohr** » Fri Nov 16, 2007 5:06 am

Ok, I ran a bunch of tests with various sizes. I ran all tests at least twice, some four times. I was running it with all the triangles (Animated immediate mode and index buffer'd ones, all textured, no vsync or double buffering over 1000 frames)

Code: Select all

NO MALLOC TEST&#58;Total Time&#58; 2070861 us
NO MALLOC TEST&#58;Total Time&#58; 2070670 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x400 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070760 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x400 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070743 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x80000 MA.p1=0x300c00 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070400 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x80000 MA.p1=0x300c00 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2069991 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x80000 MA.p1=0x300c00 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070151 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x80000 MA.p1=0x300c00 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070864 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x4000 MA.p1=0x8000 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2071025 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x4000 MA.p1=0x8000 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070523 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x4000 MA.p1=0x8000 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2071002 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x4000 MA.p1=0x8000 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070647 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070921 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070880 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0x8 MA.p3=0x4   Total Time&#58; 2174259 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0x8 MA.p3=0x4   Total Time&#58; 2070398 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0x8 MA.p3=0x4   Total Time&#58; 2070177 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x1800 MA.p2 = 0x8 MA.p3=0x4   Total Time&#58; 2070135 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070610 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070840 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2069852 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x8   Total Time&#58; 2070493 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2070788 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2070579 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2070287 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2070295 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2070262 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x0 MA.p3=0x0   Total Time&#58; 2074280 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x0   Total Time&#58; 2070501 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x0   Total Time&#58; 2070464 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x0   Total Time&#58; 2070883 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0xf MA.p3=0x0   Total Time&#58; 2078512 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x7 MA.p3=0x0   Total Time&#58; 2070841 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x7 MA.p3=0x0   Total Time&#58; 2070781 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x7 MA.p3=0x0   Total Time&#58; 2070219 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x400 MA.p1=0x80 MA.p2 = 0x7 MA.p3=0x0   Total Time&#58; 2070382 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2070246 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2070551 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2070608 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x40000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2070487 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x90a000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2069996 us
TEST&#58; MA.size=0xfc00000 MA.p0=0x90a000 MA.p1=0x80000 MA.p2 = 0x0 MA.p3=0x8   Total Time&#58; 2069780 us  &#40;Note&#58; These last two failed to allocate since P0 was so large.

I modified my kernel to add:

Code: Select all

	case PS3FB_IOCTL_MEMALLOC&#58;
		&#123;
			struct ps3fb_memalloc_attr arg;

			if &#40;copy_from_user&#40;&arg, argp, sizeof&#40;arg&#41;&#41;&#41;
				break;

			dev_dbg&#40;info->device,
				" PS3FB_IOCTL_MEMALLOC&#40;0x%lx,0x%lx,0x%lx,0x%lx,0x%lx&#41;\n",
				arg.size, arg.p0, arg.p1, arg.p2, arg.p3&#41;;

			retval = lv1_gpu_memory_allocate&#40;
				     arg.size, arg.p0, arg.p1, arg.p2, arg.p3,
				     &arg.memory_handle, &arg.ddr_lpar &#41;;
			printk&#40; "LV1 allocate handle&#58; 0x%lx retval&#58; %d\n", arg.memory_handle,retval &#41;;
			copy_to_user&#40;argp, &arg, sizeof&#40;arg&#41;&#41;;

			if &#40;retval < 0&#41; &#123;
				dev_err&#40;info->device,
					"lv1_gpu_memory_allocate failed &#40;%d&#41;\n",
					retval&#41;;
				retval = -EIO;
			&#125;
		&#125;
		break;
	case PS3FB_IOCTL_MEMFREE&#58;
		&#123;
			struct ps3fb_memfree_attr arg;

			if &#40;copy_from_user&#40;&arg, argp, sizeof&#40;arg&#41;&#41;&#41;
				break;

			dev_dbg&#40;info->device,
				" PS3FB_IOCTL_MEMFREE&#40;0x%lx&#41;\n",
				arg.memory_handle&#41;;
			retval = lv1_gpu_memory_free&#40; arg.memory_handle &#41;;
			printk&#40; "LV1 free handle&#58; 0x%lx retval&#58; %d\n", arg.memory_handle, retval &#41;;
			if &#40;retval < 0&#41; &#123;
				dev_err&#40;info->device,
					"lv1_gpu_memory_free failed &#40;%d&#41;\n",
					retval&#41;;
				retval = -EIO;
			&#125;
		&#125;
		break;

The ps3gpu.c file was modified to read approximately:

Code: Select all

MA.size = &#40;0x0fc00000UL&#41;;//720*480*4*4;
MA.p0 = 256*1024;  // these numbers change
MA.p1 = 512*1024;
MA.p2 = 0;
MA.p3 = 8;
printf&#40; "MALLOC RESPONSE&#58; %d\n", ioctl&#40;fb_fd, PS3FB_IOCTL_MEMALLOC, &MA&#41;&#41;;
printf&#40; "Malloc MemHandle&#58; 0x%lx\n",MA.memory_handle&#41;;
printf&#40; "Malloc ddr_lpar&#58; 0x%lx\n",MA.ddr_lpar&#41;;

gettimeofday&#40; &T, 0 &#41;;
iCS = T.tv_sec;
iCMS = T.tv_usec;

for&#40; i = 0; i < 1000; i++ &#41;
&#123;
  gfx_test&#40; &gpu, 0xfeed0003 &#41;;
&#125;

gettimeofday&#40; &T, 0 &#41;;
msdiff = &#40;T.tv_sec - iCS&#41;*1000000 + &#40;T.tv_usec - iCMS &#41;;
FILE * f = fopen&#40; "results.txt", "a" &#41;;
fprintf&#40; f, "TEST&#58; MA.size=0x%x MA.p0=0x%x MA.p1=0x%x MA.p2 = 0x%x MA.p3=0x%x   Total Time&#58; %d us\n", MA.size, MA.p0, MA.p1, MA.p2, MA.p3, msdiff &#41;;
fclose&#40; f &#41;;

MF.memory_handle = MA.memory_handle;
printf&#40; "FREE RESPONSE&#58; %d\n", ioctl&#40;fb_fd, PS3FB_IOCTL_MEMFREE, &MF&#41;&#41;;

Additionally, the gfx function was modified so it only executes the initialization code once and does a jump to the beginning of the fifo every frame.

The output of the program was:

Code: Select all

vram 264241152 fifo 65536 ctrl 4096
mmap&#58; /dev/fb0 len 16777216
mmap&#58; /dev/ps3gpu_vram len 264241152
mmap&#58; /dev/ps3gpu_fifo len 65536
mmap&#58; /dev/ps3gpu_ctrl len 4096
MALLOC RESPONSE&#58; 0
Malloc MemHandle&#58; 0x5a5a5a5b
Malloc ddr_lpar&#58; 0x7001a0000000
FREE RESPONSE&#58; 0

When I exceeded the normal size, the output was:

Code: Select all

vram 264241152 fifo 65536 ctrl 4096
mmap&#58; /dev/fb0 len 16777216
mmap&#58; /dev/ps3gpu_vram len 264241152
mmap&#58; /dev/ps3gpu_fifo len 65536
mmap&#58; /dev/ps3gpu_ctrl len 4096
MALLOC RESPONSE&#58; -1
Malloc MemHandle&#58; 0x0
Malloc ddr_lpar&#58; 0x0
FREE RESPONSE&#58; -1

So, provided I'm doing everything right and the program output looks good, it doesn't look like simply doing an alloc, running GPU code, then freeing has any affect over speed. (But don't take this to be absolute truth, since I have pretty much no idea what I'm doing.)

IronPeter · Post by **IronPeter** » Fri Nov 16, 2007 5:18 am

Negative result is result also :).

I'll search it more deeply. Thanks you for the hard work.

I can explain why this memory allocation routine is interesting for me.

NV40 class hardware has some "channels" of L3 cache memory ( 16 + 8 ? ). Each channel can be mapped to memory region, you can assign amount of cache memory dedicated to that channel, can define compression flags, etc...

I think that these perfomance tunnings are important for large scenes with posteffects, HDR rendering, etc. Not critical for now, only critical if we want to beat commercial titles :).

cnlohr · Post by **cnlohr** » Fri Nov 16, 2007 5:43 am

Wouldn't it be necessary to allocate small chunks for each thing (depth, texture, framebuffer) and then setting the offsets (NV40TCL_ZETA_OFFSET, NV40TCL_COLOR0_OFFSET, NV40TCL_TEX_OFFSET, etc.). I am certainly not doing that yet.

Should I try to run another test with the last two values being different, and allocating two separate large chunks (one for FB, other for Z)? Or does that have nothing to do with it?

IronPeter · Post by **IronPeter** » Fri Nov 16, 2007 5:52 am

You may try, but perfomance difference must be noticable with "tiled memory" in any case.

It seems like broken interface by Sony.

Not really critical thing. It is better to tame DXT textures, vertex streams, shaders.

cnlohr · Post by **cnlohr** » Sat Nov 17, 2007 5:06 pm

Most of my experience is in higher level programming, IE game engines, games, tools, etc.

What has interested me most right now is trying to come up with some assembler for the NV Fragment/vertex stuff. Or writing a higher level C++ engine that on its back-end directly performs the RING_ calls.

Since the first thing is kinda required for the second, I guess I have more interest in the first.

I noticed that the Nouveau dumps,

http://nouveau.freedesktop.org/tests/g7 ... ram.txt.gz
and
http://nouveau.freedesktop.org/tests/g7 ... am2.txt.gz

are both very complete in the dump analysis. I was wondering if anyone knew offhand what that system is like, and if I could use it to my advantage, instead of manual transcoding. *EDIT: the renouveau CVS has a treasure trove of awesome stuff*

I have particular interest in supporting cgc (yes, I know it's intel-architecture only) simply because in my many run-ins with it, in almost every case, the assembly it put out was extremely good and tight. (Heck, once or twice, it even out-optimized me)

Being able to simply transcode the nv30 to nvidia binary seems like something that would be extremely useful. I noticed some talking about it in the other thread. Has anyone really dug into this?

IronPeter · Post by **IronPeter** » Sat Nov 17, 2007 6:54 pm

>Has anyone really dug into this?

Nouveau did. There is a link in the other thread to the full featured NV_fp / NV_vp assembler. Nouveau project has many branches, you must dig these branches for information.

Fragment program assembler is simple: operation opcodes, src swizzles and result transforms, register opcodes, constants, stop bit, temporary registers amount. That's all.

You may start with nouveau assembler ( MIT licensed ). Or write your own assembler, with yacc or antlr it is relative easy.

Relative hard thing is register compactification. You must reduce the number of temporary registers on NV40 hardware. Any program must be annotated with that number during setup.

You are welcome to commit in libps3rsx ( if you agree with MIT license terms ). You may send patches with your animated demo to me and I'll commit these patches. Or I can ask admin to grant you write svn access.

I want to reinstall Linux and my ps3 will be closed for coding for few days. After that I'll refactor libps3rsx into { more } usable form.

cnlohr · Post by **cnlohr** » Sun Nov 18, 2007 4:49 am

Aah,

http://gitweb.freedesktop.org/?p=mesa/m ... ri/nouveau

I haven't looked into it too deeply but I am really confused already. I guess it will just take time. As far as I can tell though (which isn't very far), it looks pretty much fully featured for a shader assembler.

About 70% of everything I do is MIT, 20% New BSD, and 10% GNU. So, I have no problem putting my work under the MIT license, especially since something like AFL or GNU would put unacceptable restrictions on the work.

Why are you trying to do register compactification if we're already dealing with the assembly code? Wouldn't whatever compiler that takes us from high level (CG or GLSL) code handle the compactification for us?

And you say the number has to be reduced -- does the RSX have less temp registers than other NV40 chipsets or something?

If I can strip out and strip down the shader stuff from Nouveau effectively, then I may want to either tar you the package or have write access. I expect it to take me about a week to get a good handle on the code.

If anyone else wants to work on this as well and does it better or faster than me, I won't feel bad if my work doesn't get used.

EDIT: PS: I am starting to get excited about the prospect of writing a C++ game engine that isn't middleware.

IronPeter · Post by **IronPeter** » Sun Nov 18, 2007 5:54 pm

Ok, if we trust in CG we do not need register optmization.

>And you say the number has to be reduced -- does the RSX have less temp registers than other NV40 chipsets or something?

I do not think so. There is a pretty full article about NV40 hardware http://www.digit-life.com/articles2/gff ... rt1-a.html

Now we are going to touch upon the most interesting facts. First, in contrast to earlier NV3Xs that only had one quad processor taking a block of four pixels (2x2) per clock, we now have four such processors. ... Then, each processor still has its own quad round queue (see DX Curent). Consequently, they also execute pixel shaders similarly to the way it's done in NV3X: more than a hundred quads are run through one setting (operation) followed by a setting change according to the shader code.

So the length of round queue is ( total number of HW registers ) / ( number of temp registers in fp annotation ). Strictly speaking it is not true that larger round queue means better perfomance ( in general it is so ). Larger round queue means good latency hiding but poor texture caching.

cnlohr · Post by **cnlohr** » Tue Nov 20, 2007 1:28 pm

Ok, it looks like my plan of attack is to use most of the header information from the nouveau mesa driver, parse the files myself (since before, mesa did that), and pop out a linked-list of sorts of torn apart shader.

In doing so, it will get all of the opcodes collected in that list. Each element can represent one instruction. For instructions that cannot fit in a single opcode, I will generate multiple nodes, and string them together.

I'm going to focus on fragment programs first, then vertex programs.

I don't know how fully functional my code will be to begin with. But hopefully in a week or two, I will be able to put shader code in one side, and on the other side I will get a series of these structures that I can synthisize the opcode stream with.

One major note: Do you want me to do this using the NV assembly shading language or the GLSL Assembly shading language? Everything in Nouveau is set up for the GLSL Assembly shading language so it would be easier to code. Note that the NV asm shading language does have a little bit more expanded functionality.

I'm currently working using the GLSL Asm shading language, since that's where most of the work has been done for me.

Additionally -- if I would be working in the environment of the full Nouveau-Mesa implementation, this process would be much easier since all of the shader parsing and opcodizing would be done for us, are you sure we don't want to try to mod the Nouveau-Mesa drivers?

I'd understand the cons of being slower having more overhead, doing a lot of stuff we don't want, etc. So, I can understand why you probably wouldn't want to do it, but I'm just throwing it out there.

IronPeter · Post by **IronPeter** » Tue Nov 20, 2007 5:23 pm

Hi. Your only chance to survive is to be at very low level.

I have 10 years with Mesa experience. Then Voodoo2 launched the roadmap of Mesa was "in a year of HW accelerated Quake". Now the roadmap of Mesa/Nouveau is "in a year of HW accelerated Quake3".

You will die in bugfix with GLSL.

You will die with high-level concepts also.

One small example. High-level interface has SetPixelShaderConstant function. That's ok, but NV40 hw does not have pixel shader constants. Pixel shader constants are embedded in the pixel shader body. To set constant you must rewrite its locations ( after each using of that constant ) in the shader microcode. You have two possibilities.

1.) Make fragment shaders double-buffered and patch mircocode by CPU. That is about many SetPixelShaderConstant call per frame? Die...

2.) Patch shader constants by GPU. Via DMA or 2D blitting. Die...

The solution is simple. Keep synchronization on the user side. User can use many instances of pixel shader, patching by CPU and fencing. User can use one instance for immutable shader. User can patch shaders via blit.

cnlohr · Post by **cnlohr** » Wed Nov 21, 2007 7:54 am

So, don't code using mesa but do code for GLSL ARB ASM?

IronPeter · Post by **IronPeter** » Wed Nov 21, 2007 3:45 pm

I think, do code for NV_FRAGMENT/VERTEX_PROGRAM. It is good low-level.

CG has NV_FRAGMENT/VERTEX_PROGRAM output, so we can have full toolchain.

ps2devman · Post by **ps2devman** » Wed Nov 21, 2007 5:33 pm

Iron Peter, do you feel a high level language is missing out there?
I mean one, that is well adapted to describe what will become both NV_FRAGMENT/VERTEX_PROGRAM binary micro-code and the code on CPU (or SPU) side that synchronizes well with it.

Feel free to keep on being our locomotive and just describe this language to create the binary micro-code and syncing code matching it. All coders will help to build the compiler of such language once spec exists.

Maybe, Cgc may insert itself as a lower part of the micro-code side chain.

IronPeter · Post by **IronPeter** » Wed Nov 21, 2007 6:08 pm

ps2devman, development is now in the very early stage.

At this moment I know that NV_FP/NV_VP language is very close to hardware. It is very good idea to write assembler for it ( not very hard ).

Also I want some high-level features. These things will be critical in the production code.

For fragment program it is static dispatching. For example, we want to have two versions of shader: with fog and without fog. Good idea is to precompile these 2 versions of shader. Also there are many other switches for material like specular mapping, bump mapping, envmapping, selfillum, etc. You will have multidimensional matrix of precompiled shaders for many switches. Without this feature you will unable to develop fast production code with many materials/shaders.

For vertex pipeline it is SPU geometry processing. We want to handle skeletal animation, vertex lighting, back face culling on the SPU. It is very high-level code. I can develop such a library in a future. Few months ago I coded full featured SPU driven skeletal animation with COLLADA export, it worked just perfect on real in-game models ( on software MesaGL :).

So do not worry about high level, I have some homeworks. Our specs are NV_VP/NV_FP now.

IronPeter · Post by **IronPeter** » Wed Nov 21, 2007 6:24 pm

PS. Of course, I can write header file for NV_VP/NV_FP assembler. With shader compiling, setting and constants setting interfaces. If cnlohr wants I can do it.

forums.ps2dev.org

Higher Level GPU Questions

Higher Level GPU Questions

modifying lv1_gpu_memory_allocate

Test results

How can I help?

Assembly Shader Assembler Status