Revisiting GE performance

chp · Post by **chp** » Fri Dec 30, 2005 9:28 am

Since most of the GE has been explored in one way or another, I had a look back at the assumptions that both I and others have made, and I've found some interesting results.

First of all, the re-filling of the texturepage seems to be a bit different, a lot more different than I had assumed before. By rewriting large parts of the blit-sample I have managed to do more precise samples than the first ones I did 6 months ago (and from the results which I based a lot of my conclusions on). Refilling seems to be more optimal at 32-pixel wide stripes, not 64. This applies to all pixel formats. Someone had tested and mentioned this on the forums recently, and apologies to that someone, (can't remember who it was) you were RIGHT. :)

The maximum fillrate seems to vary quite a bit depending on the pixelformat selected. In 16-bit it seems to level out at 523MB/s or 2100 frames per second (35 fullscreen passes per frame), but 32-bit manages to push 716MB/s worth of pixels, or 1438 frames per second (24 fullscreen passes per frame). Enabling or disabling depthwrites/reads do nothing to these results which implies that they are performed anyway, but it's only the results that do not happen. Also, changing the strip-width when filling doesn't seem to do much either, which means that the PSP either has a smarter page-write, or I'm missing something. :) This test was done with textures disabled, because with texturing, you won't ever reach these numbers (more on that later).

Another surprise (for me atleast) is when using 16-bit or 32-bit textures, VRAM->VRAM blits is FASTER when using linear textures, than when using swizzled textures. This means that rendertargets give no speed-penalty if dealt with correctly. Swizzling has an impact when using 4- and 8-bit CLUT though, but when using these formats and copy optimally, you are not limited by the texture-cache. Swizzling from system-ram is a must.

Texture-reads seems to be around a maximum of 289MB/s, but unless you use 32-bit RGBA you'll never reach these speeds, since pixel-shading limitations seem to kick in when you get really optimal memory reads. This means that texturing cuts pixel-throughput more than in half, so if you really can you should avoid it (you can do a lot with non-textured geometry, especially if you can use it on lower lods or distant geometry).

Doing no optimizations at all (no striping, no swizzling, system ram) gives you a maximum speed of 7MB/s as figured out earlier. DO NOT DO THIS! Swizzling will boost it to 30MB/s, but seriously, it's not an option to not care for the texture-cache.

These tests were done on a 32-bit drawbuffer, with a strip-width of 32:

4-bit CLUT

VRAM->VRAM:
- Linear: 80MB/s (FPS:1295)
- Swizzled:97MB/s (FPS:1570)
RAM->VRAM:
- Linear: 48MB/s (FPS: 779)
- Swizzled: 83MB/s (FPS: 1337)

8-bit CLUT

VRAM->VRAM:
- Linear: 116MB/s (FPS:938)
- Swizzled:155MB/s (FPS:1247)
RAM->VRAM:
- Linear: 54MB/s (FPS: 442)
- Swizzled: 128MB/s (FPS: 1034)

16-bit RGBA

VRAM->VRAM:
- Linear: 233MB/s (FPS:940)
- Swizzled:220MB/s (FPS:885)
RAM->VRAM:
- Linear: 54MB/s (FPS: 217)
- Swizzled: 167MB/s (FPS: 672)

32-bit RGBA

VRAM->VRAM:
- Linear: 290MB/s (FPS:582)
- Swizzled:283MB/s (FPS:569)
RAM->VRAM:
- Linear: 57MB/s (FPS: 115)
- Swizzled: 204MB/s (FPS: 410)

Note that the swizzled 4-bit CLUT VRAM->VRAM blit is faster than just rendering flatshaded sprites on top of the screen, hitting a massive 1570 frames per second when the non-textured version only hits 1438. This could perhaps be since it does not have to do any interpolation of color-elements and can just output the texel color, but I'm not certain.

From these numbers, swizzled 4/8-bit CLUT textures is the way to go with static textures, and linear 16/32-bit textures that should be used as rendertargets. I have not done any tests on compressed textures yet, but the sample has been committed to SVN (samples/gu/speed), you are welcome to extend it (or find errors in my conclusions :D).

Enjoy!

ector · Post by **ector** » Fri Dec 30, 2005 7:37 pm

chp wrote:Another surprise (for me atleast) is when using 16-bit or 32-bit textures, VRAM->VRAM blits is FASTER when using linear textures, than when using swizzled textures.

This changes when you are doing a rotated blit, I assume? It seems to...

Shazz · Post by **Shazz** » Fri Dec 30, 2005 9:58 pm

Good job chp !

And it answers top my past question (http://forums.ps2dev.org/viewtopic.php?t=4368) !

I did some tests too (not so precise unfortunately) and yes I found too that sometimes slices of 64 pixels is not a good choice (5551). But not always...

Interesting thing maybe to test (as I found it while coding stuff), it is really better to do multiple texture blitting into a 256x256 (and 128x128 is not that ugly thx to bilinear filtering) rendertarget buffer and finally blit the last texture filtered in fullscreen (480x272) using good slices. It really seems that 480x272 is a bad size for the GU, I did not test it but I don't think the nb of frames you can blit in 480x272 is proportional to the nb of frames you can blit in 256x256 (a lot more I think)

But where I improve/decrease performance too was on :
- where I reserve memory for vertex list (static global, local of GuMemory)
- if I call one time sceGumDrawArray or mutilple times with smaller lists

my 2 cents....

Shazz · Post by **Shazz** » Sat Dec 31, 2005 12:49 am

I took a few minutes to pass all the tests...results are interesting :D

Forgive me, I just took the integer value :D
Results are in frames per second, red is the best value, texture is 512x512

So 64 pixels slices seems still to be the best value except for 4Bit clut... And swizzle is required for clut modes... even in VRAM...

What to conclude ?? :D

chp · Post by **chp** » Sat Dec 31, 2005 3:06 am

ector wrote:
chp wrote:Another surprise (for me atleast) is when using 16-bit or 32-bit textures, VRAM->VRAM blits is FASTER when using linear textures, than when using swizzled textures.
This changes when you are doing a rotated blit, I assume? It seems to...

Perhaps, it depends on how the cache-fill really works and if buffers in vram are treated differently, without any official info (and for anyone reading: NO, I don't want any, so don't even try to send it) what we do now is mostly guesswork to get closer to the truth.

chp · Post by **chp** » Sat Dec 31, 2005 5:39 am

Shazz wrote:So 64 pixels slices seems still to be the best value except for 4Bit clut... And swizzle is required for clut modes... even in VRAM...

What to conclude ?? :D

480x272 could be a bad size since it clips the last 64-width block in half, which makes 32-texel slices better when going fullscreen directly.

Have you seen the presentation about the PSP from Breakpoint 2005? If not, I really recommend it. Look for 'bp05_seminars_-_Markus_TRiNiTY_Glanzer_-_PSP_unleash_the_beast_-_xvid.avi' , it contains some nice info about improving performance not relating to textures but more on vertex performance (optimal vertex sizes, ram vs. vram etc).

ufoz · Post by **ufoz** » Sat Dec 31, 2005 3:31 pm

chp wrote:Have you seen the presentation about the PSP from Breakpoint 2005? If not, I really recommend it. Look for 'bp05_seminars_-_Markus_TRiNiTY_Glanzer_-_PSP_unleash_the_beast_-_xvid.avi' , it contains some nice info about improving performance not relating to textures but more on vertex performance (optimal vertex sizes, ram vs. vram etc).

Thanks for this, I just watched it and some of it was pretty informative, enev though the guy didn't talk well.
My favorite quote would be "The only thing you can't do from vram is execute code" - maybe the 2.0 tiff exploit would like to have a word with this guy :)

Shazz · Post by **Shazz** » Sat Dec 31, 2005 9:38 pm

chp wrote:
480x272 could be a bad size since it clips the last 64-width block in half, which makes 32-texel slices better when going fullscreen directly.

Have you seen the presentation about the PSP from Breakpoint 2005?

Yes really makes sense... I'll try to test the stuff in 448x256, just to see... if it's linear or not.

And thx for the BP05 video !!! Damn, I knew I should have gone to the Breakpoint 2005 :D Makes me think that emoon never released his demo :( : http://forums.ps2dev.org/viewtopic.php? ... breakpoint

Hummm quite off topic but... who is the last guy who ask a question ??? I bet on NeoV, Nagra or emoon.... Damn.... using the DMAC was so fun :D

jsgf · Post by **jsgf** » Tue Jan 03, 2006 10:52 am

chp wrote:In 16-bit it seems to level out at 523MB/s or 2100 frames per second (35 fullscreen passes per frame), but 32-bit manages to push 716MB/s worth of pixels, or 1438 frames per second (24 fullscreen passes per frame).

This doesn't look right. Surely 716Mbyte/s should get more fps than 523Mbyte/s?

Enabling or disabling depthwrites/reads do nothing to these results which implies that they are performed anyway, but it's only the results that do not happen.

That would be odd. What if you don't allocate/configure a depth buffer? It still does dummy depth-buffer transactions? Or depth tests are just free because they're hidden in other latencies? It would be interesting to how enabling depth testing but always making the depth test fail would affect the results (particularly when using texturing).

Another surprise (for me atleast) is when using 16-bit or 32-bit textures, VRAM->VRAM blits is FASTER when using linear textures, than when using swizzled textures. This means that rendertargets give no speed-penalty if dealt with correctly.

That's great. I've been trying to work out how to do a pipelined swizzle in hardware, but if it doesn't matter for VRAM textures, then I won't bother. Also, I wonder if it means that PSPGL should only swizzle system memory textures and unswizzle them in VRAM...

Texture-reads seems to be around a maximum of 289MB/s, but unless you use 32-bit RGBA you'll never reach these speeds, since pixel-shading limitations seem to kick in when you get really optimal memory reads. This means that texturing cuts pixel-throughput more than in half, so if you really can you should avoid it (you can do a lot with non-textured geometry, especially if you can use it on lower lods or distant geometry).

Did you measure how much blending affects fillrate?

Cool work!

chp · Post by **chp** » Tue Jan 03, 2006 10:16 pm

jsgf wrote:This doesn't look right. Surely 716Mbyte/s should get more fps than 523Mbyte/s?

This is destination write-speed (not read-speed), and if you have to write twice the data (16-bit vs. 32-bit), you can't get more frames per second unless that speed would atleast double.

jsgf wrote: That would be odd. What if you don't allocate/configure a depth buffer? It still does dummy depth-buffer transactions? Or depth tests are just free because they're hidden in other latencies? It would be interesting to how enabling depth testing but always making the depth test fail would affect the results (particularly when using texturing).

It may be that the page-buffer for depth-tests is always filled, but not read from/written to depending on the result. These are very synthetic results though, it may be that it doesn't cost anything because the writes were optimal and that hid the depth-buffer refills.

jsgf wrote:Did you measure how much blending affects fillrate?

I think that Markus Glanzer mentioned that blending is free (and it was free on PS2 aswell), so I don't think it'll have an impact. I don't have that presentation here though, so I'll have to look into that later today.

jsgf · Post by **jsgf** » Tue Jan 10, 2006 10:13 am

So these results apply to blitting full-screen images to the screen. But how do these results relate to a typical use of drawing textured meshes of triangles?

Using tiled blitting certainly gives huge performance wins. It seems to me that this will happen naturally if you're drawing relatively small triangles, and make sure adjacent triangles have similar UV coords. Am I right? Is it simply a matter of making sure your triangles don't get too big (which is a good thing anyway, because of the clipping issues)?

chp · Post by **chp** » Tue Jan 10, 2006 5:42 pm

jsgf wrote:So these results apply to blitting full-screen images to the screen. But how do these results relate to a typical use of drawing textured meshes of triangles?

Using tiled blitting certainly gives huge performance wins. It seems to me that this will happen naturally if you're drawing relatively small triangles, and make sure adjacent triangles have similar UV coords. Am I right? Is it simply a matter of making sure your triangles don't get too big (which is a good thing anyway, because of the clipping issues)?

You still have the texture-cache to care for. Revisiting the BP05-presentation confirmed the 8kB texture-cache (you see something new every time you watch that one :D), but the refill-logic is now 4-way set associative (PS2 is direct-mapped with only one page), which laxes the restrictions (since it won't refill all of the cache if you break the boundaries). From that information they also suggest that small polygons aren't a requirement (probably because the T&L pipe is a bit weak when you start doing more advanced stuff), but thanks to the clipping issues keeping them reasonably small is a good idea.

Limiting the UVs to a reasonable limit to help the texture cache is a must too I'd think, but it might not be top priority as it is on the PS2, and keeping down the number of transformed vertices should be taken into account aswell. I really should take a look at if there is any index-cache at all, which could help in that case...

jsgf · Post by **jsgf** » Tue Jan 10, 2006 5:59 pm

Someone mentioned that using indexing has a heavy performance hit, but without details. I've been meaning to compare transform throughput with different sized vertices, indexing by strips, indexing linearly, indexing randomly and unindexed. But I haven't got around to it...

Is the BP05 presentation available somewhere?

BTW, I optimised the optimised blit a bit, by issuing a single SPRITES command rather than one per quad. It made a 1 Mbyte/s improvement to the 32bpp test (ie, < 1%), and nothing noticable for any other tests. As you'd expect.

I'm also wondering if doing a single blit then shutting down the pipeline is really measuring what we think it is, or if the startup/sync costs are making a significant contribution. Repeating the test to amortize the startup/shutdown cost would be a reasonable experiment...

chp · Post by **chp** » Tue Jan 10, 2006 7:58 pm

jsgf wrote:Someone mentioned that using indexing has a heavy performance hit, but without details. I've been meaning to compare transform throughput with different sized vertices, indexing by strips, indexing linearly, indexing randomly and unindexed. But I haven't got around to it...

Yeah, perhaps reading indices will have more impact than it'll save in any caching (disrupting the memory transfers?). I could probably whip together a simple sample where you can control the size of each vertex and the index to measure raw performance. Can be good to figure out how you want to optimize your vertices.

jsgf wrote:Is the BP05 presentation available somewhere?

This should give you a good link to the presentation.

jsgf wrote: BTW, I optimised the optimised blit a bit, by issuing a single SPRITES command rather than one per quad. It made a 1 Mbyte/s improvement to the 32bpp test (ie, < 1%), and nothing noticable for any other tests. As you'd expect.

It might be the case that the 32bpp-test is the only one that is bandwidth-limited, the rest seem to hit their limits in other parts of the pipeline.

jsgf wrote:I'm also wondering if doing a single blit then shutting down the pipeline is really measuring what we think it is, or if the startup/sync costs are making a significant contribution. Repeating the test to amortize the startup/shutdown cost would be a reasonable experiment...

Running more than one blit per pass could possibly affect performance, that should be rather easy to test.

forums.ps2dev.org

Revisiting GE performance

Revisiting GE performance

Re: Revisiting GE performance

Re: Revisiting GE performance

Re: Revisiting GE performance

Re: Revisiting GE performance