Vertex Performance (Revisiting GE)

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Vertex Performance (Revisiting GE)

Post by chp »

Ok, now the time has come to vertex performance, and as per usual I have a few interesting numbers and a sample that allows you to experiment for yourself.

First up: DO NOT USE INDEXBUFFERS. They are the succubus of performance for your applications. It has been suspected and people have mentioned on the forum that they have received a lot more speed not using index-buffers. Well, here are the cold, hard facts: If your vertices are 12 bytes in size (optimal size) and vertex buffer in vram, your performance drops from 13.67 million vertices per second to 6.31(!) million vertices per second. That's a 53% performance drop! And it gets even worse if you use system ram, then it drops to 3.16(!!!) million vertices per second. That's 76 per cent, and a real system killer. Why Sony added index-buffers is a complete mystery, since they really suck.

Now, after this public announcement, here are a few numbers:

From the test I have running, kicking as many batches containing 1536 vertices as possible, the maximum T&L that the PSP can push is 13.67 million vertices per second. This is nowhere close the 35 million vertices per second that they have stated before, so these numbers aren't final (will research some more to make sure I haven't broken something). This kind of performance is reached when the vertex is around 8-12 bytes in size, and seems to be very memory-sensitive, since when you grow beyond that size performance starts dropping rapidly, ending at 0.48mv/s (544 bytes with full skinning & morphing).

Raw numbers for transform (vram-numbers in parentheses):

4 bytes: 13.03mv/s (13.67mv/s)
8 bytes: 13.67mv/s (13.67mv/s)
10 bytes: 13.67mv/s (13.67mv/s)
12 bytes: 13.67mv/s (13.67mv/s)
16 bytes: 13.62mv/s (13.67mv/s)
20 bytes: 11.76mv/s (13.67mv/s)
24 bytes: 11.76mv/s (12.81mv/s)
28 bytes: 9.81mv/s (11.37mv/s)
32 bytes: 7.36mv/s (10.25mv/s)
36 bytes: 6.54mv/s (9.31mv/s)

No lighting, skinning, or morphing affected these numbers.

Skinning & Morphing

These operations are real powerhungry, and they should be used with care. The numbers are as follows on optimal vertices with only weights or morphs added:

Skinning

Disabled: 13.67mv/s
2 weights: 6.42mv/s
3 weights: 4.55mv/s
4 weights: 3.53mv/s
5 weights: 2.88mv/s
6 weights: 2.43mv/s
7 weights: 2.10mv/s
8 weights: 2.10mv/s

Morphing
1 vertex (disabled): 13.67mv/s
2 vertices: 9.80mv/s
3 vertices: 6.55mv/s
4 vertices: 4.92mv/s
5 vertices: 3.93mv/s
6 vertices: 3.28mv/s
7 vertices: 2.81mv/s
8 vertices: 2.46mv/s

Combining both skinning & morphing gives 0.58mv/s and a vertex-size of 352 bytes (not really usable).

The sample I used for these values are available at gu/vertex/vertex.c. Hack away! I'm going to add a few real-world examples on this sample, but it shouldn't be too hard to do it yourself if you want to test your own code.

Please note that these values are raw performance values, and no pixels have been rendered. Your application will not receive the same benifits, but this may act as a guide.
GE Dominator
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Re: Vertex Performance (Revisiting GE)

Post by jsgf »

chp wrote:Well, here are the cold, hard facts: If your vertices are 12 bytes in size (optimal size) and vertex buffer in vram, your performance drops from 13.67 million vertices per second to 6.31(!) million vertices per second. That's a 53% performance drop! And it gets even worse if you use system ram, then it drops to 3.16(!!!) million vertices per second. That's 76 per cent, and a real system killer. Why Sony added index-buffers is a complete mystery, since they really suck.
Yeah, that's really bad. What order were you reading vertices in? Does linear versus random make much difference? I expect it would.

Also, what primitive were you using for this? I'm wondering whether strip vs fan vs independent triangles vs points makes a difference. I wonder if there's any evidence of a tranform cache (ie, independent triangles presented in strip order get better performance than completely independent tris).
From the test I have running, kicking as many batches containing 1536 vertices as possible, the maximum T&L that the PSP can push is 13.67 million vertices per second. This is nowhere close the 35 million vertices per second that they have stated before, so these numbers aren't final (will research some more to make sure I haven't broken something).
I suspect vertices generated by the subdivision operators are dealt with much more quickly than explicitly specified ones; I think you'll find that a bezier patch will approach 35Mvert/s.
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Re: Vertex Performance (Revisiting GE)

Post by chp »

jsgf wrote:
chp wrote:... Why Sony added index-buffers is a complete mystery, since they really suck.
Yeah, that's really bad. What order were you reading vertices in? Does linear versus random make much difference? I expect it would.
This was linear access to vertices, which should have shown maximum performance if there's any startup-cost for starting to read from memory.
jsgf wrote:Also, what primitive were you using for this? I'm wondering whether strip vs fan vs independent triangles vs points makes a difference. I wonder if there's any evidence of a tranform cache (ie, independent triangles presented in strip order get better performance than completely independent tris).
I used simple points, to avoid any possible issues with primitive assembly. Cache-sizes tested were 4, 8, 16 and 32, and I saw no signs of improvements from running with a completely linear indexbuffer. It seems there's no transform-cache, or it works in a way that we haven't figured out yet.
jsgf wrote:I suspect vertices generated by the subdivision operators are dealt with much more quickly than explicitly specified ones; I think you'll find that a bezier patch will approach 35Mvert/s.
Yes, that might be true. Should probably do some benchmarking on those kinds of primitives too.
GE Dominator
User avatar
Shazz
Posts: 244
Joined: Tue Aug 31, 2004 11:42 pm
Location: Somewhere over the rainbow
Contact:

Post by Shazz »

Again, interesting tests chp !

I looked at the test code, hard to find what to optimize to achieve the "Graphics sub-system running at 166 MHz on a 512-bit bus with 2 MB of DRAM, rendering [...] 35 million polygons per second" (sic) (what's a polygon ? 3 vertices ?)

Or maybe, as the PS2, the PSP has different data paths with different priorities to the GE... The scratchpad could also help (as it really help the "bad" path 3 way to not be that ridiculous)....

Yep other primitives could be interesting too.. 16bit draw buffer...Or maybe they bypass the Transform engine :D directly to the rasterizer... Quite difficult to find how to double the number of vertices...

by the way, good work ! Very interesting....
- TiTAN Art Division -
http://www.titandemo.org
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

Bypassing the transform pipe does nothing, it even lowers performance down to just below 13mv/s. Also, running 2D vertices from VRAM is a really bad idea, performance drops to 5mv/s for some reason.

Using an ortho-projection would really be a better way than blitting in 2D, since it seems to give better vertex-performance, and it gives access to all fun things like transforms, texture scaling, etc.
GE Dominator
rapso
Posts: 140
Joined: Mon Mar 28, 2005 6:35 am

Re: Vertex Performance (Revisiting GE)

Post by rapso »

nice benchmarks. u said, u've used pointsprites, could u try (backfacing) trianglestrips too? maybe pointsprites generate some overhead.
chp wrote:Why Sony added index-buffers is a complete mystery, since they really suck.
i guess indices are used to accelerate expensive vertices, if u enable all lights and all fancy stuff with really big vertices, indexed drawing could be a lot faster... maybe u could benchmark it.

i cannot try it,'cause i have no psp yet.
Post Reply