IBM XL C compiler slower than gcc C compiler?

kroe · Post by **kroe** » Tue Aug 21, 2007 1:31 pm

I am a bit baffled by what I am seeing switching between the IBM XL C compiler and the GCC C compiler.

The code I am running is heavily optimized using the SPU C intrinsics. Almost all operations are done through intrinsics. I know this doesn't leave the compiler much to do, but the difference I am seeing makes no sense to me.

With the GCC compiler I am getting roughly triple the performance out of the six SPUs as I can get running on the two cores of my Athlon 64 x2 3800+; it takes 47 seconds on the Cell with the app compiled through gcc. I read that the IBM XL compiler produces considerably faster binaries, so I tried it.... 118 seconds for the exact same code on the exact same target PS3.

Since the code was so laden with intrinsics I figured that it would be about the same with both compilers, with the IBM compiler having the advantage since they designed the processor, wrote the intrinsics, and wrote the compiler.

Any ideas why I am not seeing the expected results?

I am compiling through Eclipse using SDK 2.1 on my Athlon 64 running Fedora Core 6 and am running on my PS3 using SDK 2.1 running Fedora Core 6.

Thanks,
-Ken

ps2devman · Post by **ps2devman** » Tue Aug 21, 2007 5:08 pm

I may be wrong, but final speed of compiled code depends how well is understood the "time dependancies".

Sometimes you start an instruction and the result can't be obtained immediately. So you can start another instruction that will involve parallel unused parts of processor, instead of just waiting with a nop (or having the processor waiting without any compilation warning).

The result is an horrible code, very hard to understand for human brain, because most instructions are all sorted in different order, compared to the source, but it runs very very fast.

I guess two teams writing a compiler won't have the same level of understanding of these optimization technics...

Also, using direct assembly code in C may be a way to disable these automatic optimizations...

ldesnogu · Post by **ldesnogu** » Wed Aug 22, 2007 7:06 am

ps2devman is right: using intrinsics the compiler can schedule code.
You should post your code on IBM Cell forum, the xlc compiler team will surely do its best to beat gcc ;-)

demosuzki · Post by **demosuzki** » Mon Aug 27, 2007 9:17 pm

take a look at the spu_timing utility to examine the generated code.

off the top of my head the process is
in the link stage you add a -s to the linker options and then run the command spu_timing on the output. <modulename>.s files.
(I'll check this...I can't right now)

the output is an ascii file with the assembler and a nice graph of the cycle counts coupled with pipeline in which the dependency stalls are shown.
its then possible to rearrage (manually) your intrincics to see if you can remove the stalls and make more efficent pipeline use.

my expirence with some code i have optimised is that the xlc was faster (30%) than gcc. but i guess it all depends on context.

/ds

ldesnogu · Post by **ldesnogu** » Mon Aug 27, 2007 10:13 pm

demosuzki wrote:off the top of my head the process is
in the link stage you add a -s to the linker options and then run the command spu_timing on the output. <modulename>.s files.

No, passing -s to the linker instructs it to remove symbols (strip).

If you want to get assembly file, you replace -c with -S (capital).
For instance, gcc -S foo.c will create foo.s.

vi_vid · Post by **vi_vid** » Thu Sep 13, 2007 2:22 am

XLC has 6 (or 7???) optimization levels, GCC has 4.

to dump commented asm code in GCC, i usually use
--save-temps --verbose-asm