Misc Crashes

UsefulIdiot · Post by **UsefulIdiot** » Mon Mar 27, 2006 10:16 am

I have been having some problems with random crashes lately. When I say random, I mean during random builds, my game will just crash. What is strange, is removing or adding random printf's will make it go away for a few builds. If a build does not crash once, it will run fine, if it crashes once, itll crash always.

Heres what psplink tells me when it crashes:

Code: Select all

Exception - Bus error &#40;data&#41;
Thread ID - 0x04E92049
Th Name - user_main
Module ID - 0x04E92049
Mod Name - "FF3Engine"
EPC - 0x08903D68
Cause - 0x0000001C
Status - 0x20008613
BadVAddr - 0x00000000

I wont bother typing the rest, but It helps solve the problem just ask. Are there any tutorials on how to use gdb (along with the psp)?

dr_watson · Post by **dr_watson** » Mon Mar 27, 2006 4:51 pm

Most of the time when I got crashes like that was due to NULL pointers.

groepaz · Post by **groepaz** » Mon Mar 27, 2006 5:40 pm

Most of the time when I got crashes like that was due to NULL pointers.

...

BadVAddr - 0x00000000

o_O

hubevolution · Post by **hubevolution** » Mon Mar 27, 2006 7:09 pm

ehehe well that's a very NULL pointer :)

Brunni · Post by **Brunni** » Mon Mar 27, 2006 7:14 pm

If it just happens sometimes, then it's most likely a bug in your program, like a buffer overflow.
Sometimes it can also be a bug in GCC.

Saotome · Post by **Saotome** » Mon Mar 27, 2006 9:07 pm

...but without any code we will never know :P

turkeyman · Post by **turkeyman** » Fri Apr 14, 2006 11:50 pm

I've actually been experiencing the same thing..
I build and get a random crash in a bit of code thats always run just fine..
Then i start narrowing down the crash, isolate the function its in, then put in some logs to see why its crashing, and it stops crashing..

Its happened to me in 2 places that wouldnt normally crash... and it went away when i put logs in the code between specific lines..

I'm a little suspicious of something in a recent revision of the toolchain.. Its almost like GCC is scheduling the code badly..

I had another bug where i had an assert that was firing off because something was an invalid value, so i put logs around it to see what the value was, and it was definately the correct value, and the code around it that depended on it being correct was still working perfectly.. It appeared like a scheduling thing in the assert test.. like the value hadnt been set before the assert test or something....

:/

starman2049 · Post by **starman2049** » Sat Apr 15, 2006 2:17 am

I've been fighting this every day for two months. Every time I make a code change I have to add/subtract unused lines of sprintf's (outside my main loop) so that it does not crash and/or does not have graphic corruption.

I'm pretty sertain it has to do with alignment, because 90% of the time it just gives graphic corrution and only 10% does it actually crash. I'm not using PSPLINK so I don't get and crash dump info, but I'm thinking more and more about converting to PSPLINK, but was hoding off until after E3.

I got the sdk and toolchain on September 5, 2005 and have been too nervous about screwing up my dev environment to get a more recent version so at least for me I cannot blame sdk or toolchain.

I've spent at least a week trying to fix this, but was never able to track it down. If anybody finds a solution that works for me I'll buy you a beer, heck I'll buy you a whole keg and that's no joke.

turkeyman · Post by **turkeyman** » Sat Apr 15, 2006 4:53 am

I dont know if that sounds like that same thing :/

I have had numerous problems with alignment too, but this is very different..
Its totally unsolicited, and if you wiggle the instruction order around it just go's away.. also logging the values shows the wrong thing, when i know the right value is in the variable..

cheriff · Post by **cheriff** » Sat Apr 15, 2006 9:31 am

Random bugs that are there one minute and gone the next are always (in my experience) one of two things:
* Timing issues and/or race conditions. However since were arent running smp (unless your using the ME in which case you're on your own! :) and not in a preemptive environment, we can pretty much skip this case, leaving us with:
* Cache issues. Could very well be the case. When reading back from the cpu you get the correct value - since the cpu is able to read the most current one from the cache. Kicking the buffer to the gu happens behind the cache and perhaps there's a bad value in ram causing a bus error when dereferenced? Adding random loggings, moving code around also helps as it changes timing/cache usage patterns and gives the data a chance to eventually be flushed back to main ram.

I don't know if this is the case, but there may be just one instance you forgot to writeback the cache. Its what the symptoms seem to indicate and what I'd be looking for..

Good luck!
- cheriff

turkeyman · Post by **turkeyman** » Sat Apr 15, 2006 11:10 am

it sounds like that could be the case for starman2049, but i dont think my or UsefulIdiot's cases are likely the dcache..

Heres one of multiple examples (also remember they both used to work perfectly..

const char *MFStringCache_Add(MFStringCache *pCache, const char *pNewString)
{
MFCALLSTACK;

MFDebug_Assert(pCache, "NULL String cache!");
MFDebug_Assert(pNewString, "Cannot add NULL string");

// find the string
char *pCurr = pCache->pMem;
int newLength = MFString_Length(pNewString)+1;
printf("%p %p, %s\n", pCurr, pCache, pNewString);

while (pCurr[0] && pCurr < &pCache->pMem[pCache->size])
.....
}

heres the output when i run with the printf moved up one line (it dies on the 4th iteration):
host0:/> 0x08DBD18C 0x08DBD180, Material
0x08DBD18C 0x08DBD180, Arial
0x08DBD18C 0x08DBD180, diffusemap
0x616D6573 0x08DBD180, Arial <- notice the value of pCurr here
Hardware Exception!

you can see where i have placed the printf, this code now works..
with that printf where it is, all 3 values show correctly and the code works perfectly..
if i move that printf to be on the line before the strlen (still after pCurr=...), pCurr changes and is no longer valid (but pCache is still the same pointer), and remains that way until the while() line where it crashes because it cant dereference..

I've toyed with loads of dcache flushes and stuff, doesnt seem to help..

:/

this is one of multiple similar examples.. i can provide some others if you like.

turkeyman · Post by **turkeyman** » Sat Apr 15, 2006 11:37 am

oh, also worth noting, if i take the printf out altogether, it crashes exactly as in my example there.. but if i compile using -O0, it works..

huff... (another reason i started considering the possibility of it being gcc's scheduling).

turkeyman · Post by **turkeyman** » Sat Apr 15, 2006 11:40 am

infact, -O0 and -O1 both work.. -O2/-Os/-O3 all break it..
what optimisation techniques are introduced between -O1 and -O2?

Jim · Post by **Jim** » Sat Apr 15, 2006 5:48 pm

I suspect it's more likely that you have overwritten the stack. Optimisation often pushes variables from the stack into registers, and adding and removing lines of code often moves where things are on the stack too. Perhaps you have something like

char x[10];
sprintf(x, "DeadDeadDead");

or some other stack smashing code.

Jim

weak · Post by **weak** » Sun Apr 16, 2006 1:00 am

i've experienced graphical errors with -O2 too (while -Os works perfect).

chp told me that he made some fixes regarding this problem, and a toolchain update indeed corrected _most_ of the bugs.

but seems like there are still some problems with -O2

turkeyman · Post by **turkeyman** » Mon Apr 17, 2006 8:16 pm

Jim:

I dont think thats likely, i have made fairly thorough searched for buffer overflows..
Also, from my example, pCurr, and pCache are the only variables on the stack.. pCache is correct, pCurr is broken, and strlen() certainly doesnt write to the stack.. so i cant see any place where it could be overwritten...

:/

weak:
He said '_most_'? .. where was this announcement made?
Are there more known bugs, and do you know if anyones sorting them out?
I know nothing about GCC internally, so i'm not much use there.. :/

weak · Post by **weak** » Mon Apr 17, 2006 10:36 pm

turkeyman wrote: weak:
He said '_most_'? .. where was this announcement made?
Are there more known bugs, and do you know if anyones sorting them out?
I know nothing about GCC internally, so i'm not much use there.. :/

there was no annoucement or something like that. i just asked chp if he knows about problems with -O2 and he told me that he had had some problems too and that he had already fixed some stuff.

the "_most_" is just personal experience. my rendering was really f*d up with -O2, and after the toolchain update it took me a few builds to even notice that there were still some minor bugs. huge difference, but still errors

Brunni · Post by **Brunni** » Wed Jun 14, 2006 10:43 pm

Sorry to bump this topic, but are still people getting problems like this? I downloaded the latest version of the toolchain but I'm still having those problems :(
I've looked at my code carefully and it seems to be okay (althrough so complicated that I can't be 100% sure), it also runs under Windows in debug mode without problem, so chances that there's a stack overflow or anything in it are very small...
However, in some builds I'll get random crashes (game freezed, sound buffer is looped), in some others temporary graphic corruption (more rare). If I change anything in the code, it won't happen again... so it's almost impossible to debug...

I really want to find where the problem could be, but I haven't got any tool for this. Can some people tell me what I could do for debugging? I'm thinking to dump changes in RAM from one function to another to see where an eventual overflow could occur...

Thanks in advance

jimparis · Post by **jimparis** » Sat Jun 17, 2006 3:34 pm

Brunni wrote:I really want to find where the problem could be, but I haven't got any tool for this. Can some people tell me what I could do for debugging?

If you can run it on Linux I'd suggest running it under valgrind's memcheck tool.

Brunni · Post by **Brunni** » Mon Jun 19, 2006 12:28 am

No I cant :( But Visual Studio's debug mode checks for that quite good usually, I think ^^
I'm still getting those errors, it seems only in -O2, -Os or -O3 mode, and very strange things happen, for example my menu flickers at first (it should not be drawn), then should fade in => freezes at 50%. And when these bugs happen, it always freezes (I have installed a debug exception handler and it doesn't appears in these cases). Usually when it's my fault, the exception handler is fired.
That's why I'm wondering if it could more be a problem with GCC. But am I alone? Is this an official MIPS-GCC or is it tweaked (and so maybe a bug kicked in)?
Thanks in advance

phytoporg · Post by **phytoporg** » Tue Aug 29, 2006 8:54 am

Sorry to revive an older thread, but I'm running into the same problems. Seems like everything is working properly, but when I remove arbitrary pieces of debug output, the program just crashes with nothing from the exception handler.

Toying with the gcc flags doesn't seem to make any difference. Anyone have any other suggestions? This is crazy frustrating. D:

Jim · Post by **Jim** » Tue Aug 29, 2006 5:33 pm

The answers are the same, either you've
a) Overwritten the end of a memory allocation (corrupted the heap)
b) Overwritten the end of a local array (corrupted the stack)
c) You're not handling the PSP's cache correctly when addressing the video hardware.
d) Something else.

Changing the optimisation flags often moves things around on the stack and re-orders the code to make things just slightly different, same as removing of adding the odd debug statement.

No code. Not much anyone can do to help.

Jim

phytoporg · Post by **phytoporg** » Wed Aug 30, 2006 1:34 am

I'd provide some source but I'm not entirely certain where the problem is. I'd like to try PSPLink's gdb capabilities to narrow things down but I'm using 2.6 firmware. I'm actually not even 100% sure where this thing's crashing since I can't make use of an exception handler.

I'm attempting to write a dynamic recompiling/threaded interpreting SNES emulator (I'm not really clear on the distinction). My current problem is that when I comment out the code to dump blocks from the instruction cache-- the emulator's, not the PSP's, obviously-- to the memstick, the program seems to crash upon attempting to execute the generated code. If I leave the logging in there, I don't get any problems unless I completely comment out all of the other printf stuff that's hanging around there.

I'll post what I think could be relevant code:

from compileBlock():

Code: Select all

                      // ... some stuff ...

		/* Pass 2
		 * 	emit proper translated code
		 */
		skip = 0;
		//createDebugFile&#40; "ms0&#58;/exec.dump" &#41;;

		tempPtr = bankTable&#91; startPC >> 16 &#93; + &#40; startPC & 0xFFFF &#41;;
		*tempPtr = emitCode;

		/** Reset flag considerations for second pass **/
		if&#40; tempFlags != P &#41; setAll&#40; P &#41;;
		
		for&#40;; j <= i; ++j &#41; &#123;
			tempFlags = P | CFLAG | VFLAG | ZFLAG | NFLAG; /* Select flags later */
			cc += emitInstr&#40; &emitCode, &#40; byte * &#41;realPC + skip, tempFlags &#41;;
			
			switch&#40; *&#40; realPC + skip &#41; &#41; &#123;
				case	SEP&#58;
						P |= *&#40; realPC + skip + 1 &#41;;
						if&#40; P & MFLAG &#41; sepM&#40;&#41;;
						if&#40; P & XFLAG &#41; sepX&#40;&#41;;
					break;
				case	REP&#58;
						P &= ~&#40; *&#40; realPC + skip + 1 &#41; &#41;;
						if&#40; !&#40; P & MFLAG &#41; &#41; repM&#40;&#41;;
						if&#40; !&#40; P & XFLAG &#41; &#41; repX&#40;&#41;;
					break;
			&#125;

			skip += sizeTable&#91; *&#40; realPC + skip &#41; &#93;;
		&#125;

		emitUpdatePC&#40;  &emitCode, skip &#41;;
		emitUpdateCycles&#40; &emitCode, cc &#41;;
		
		emitReturn&#40; &emitCode &#41;;
		
		//writeDebugFile&#40; romCache, &#40; byte * &#41;emitCode - &#40; byte * &#41;romCache &#41;;
		//closeDebugFile&#40;&#41;;
		
                      // ... more stuff ...

And here's the code to either call the above or execute the code generated by the above (a good lot of the MIPS regs are statically allocated to represent 65c816 registers, please forgive the mess):

Code: Select all

dynarec&#58;
	la	A0, bankTable		# Get the value for bankTable&#91; bank &#93;
	srl	A1, PC, 16		# Get PC bank
	sll	A1, A1, 2		# word alignment
	addu	A0, A0, A1

	lw	V0, 0&#40; A0 &#41;

	bne	V0, ZERO, bankActive
	nop

	la	TEMPREG1, activeBanks	# bankTable&#91; bank &#93; = cacheEntries + 0xFFFF * activeBanks
	lw	TEMPREG2, 0&#40; TEMPREG1 &#41;	# if activeBanks < 2
	addi	A1, TEMPREG2, -2
	beq	A1, ZERO, reset
	nop

	li	TEMPREG3, 0xFFFF
	la	V0, cacheEntries
	mul	TEMPREG3, TEMPREG3, TEMPREG2

	sll	TEMPREG3, TEMPREG3, 2	# word alignment
	addu	V1, V0, TEMPREG3
	sw	V1, 0&#40; A0 &#41;

	addi	TEMPREG2, TEMPREG2, 1	# ++bankCount;
	sw	TEMPREG2, 0&#40; TEMPREG1 &#41;

	j	bankActive
	nop

reset&#58;
	SAVEREGS	flushCache, ZERO, ZERO, ZERO
	sw	V0, 0&#40; A0 &#41;

bankActive&#58;
	andi	TEMPREG1, PC, 0xFFFF
	sll	TEMPREG1, TEMPREG1, 2	# word alignment
	addu	TEMPREG1, TEMPREG1, V0

	lw	TEMPREG2, 0&#40; TEMPREG1 &#41;

	bne	TEMPREG2, ZERO, runCode
	nop

	SAVEREGS	compileBlock, PC, P, ZERO

	lw	TEMPREG2, 0&#40; TEMPREG1 &#41;
	
runCode&#58;

	jal	TEMPREG2
	nop

	j	dynarec
	nop

I've considered downgrading to 1.5 or 1.0 but I'm not sure I want to risk a brick. If anyone has any suggestions for step-debugging/memory monitoring tools available to 2.6 those of us with firmware, that would be appreciated as well.

ector · Post by **ector** » Wed Aug 30, 2006 8:28 pm

phytoporg, are you invalidating the instruction cache before trying to execute newly generated code? Doing this is essential.

phytoporg · Post by **phytoporg** » Fri Sep 01, 2006 3:35 am

Yeah, I was talking to another developer yesterday who suggested the same thing. It seems to have fixed the issue. I'm now also forcing a dcache writeback after the block is "emitted." Everything's in working order, now!

Aion · Post by **Aion** » Mon Jan 08, 2007 10:01 am

I've been trying to release a new version of PspKanji for a month, and I'm having a similar problem which is stomping me at the moment :/.

The code works just fine under Windows/DirectX.

I've read about the possibility of stack corruption, but I am unable to pin point anything that could cause it.

On the Psp, it will crash at different moment, depending on code changes and wether I'm running the .pbp or the elf with PspLink or under Gdbdebuguer/Eclipse. Compiler optimisation options also changes moment crash occur.

I've attached the source code of the project if anyone would be interested in helping me out on this. http://www.pqverdun.org/SvnPspKanji.rar

Steps :
1-Start the app
2-Press any key, ressources will load
3-Press Triangle and select Kanjilist1.txt (confirm with X)
4-Press X to start Quiz
5-Press O to quit Quiz
6-Repeat 4,5 until app crash.
(Note sometime crash on loading KanjiList, something when trying to load quiz for 1st time, some other time after 3 quizes...)

Possible result :

Code: Select all

host0&#58;/> Exception - Bus error &#40;data&#41;
Thread ID - 0x0463DB5B
Th Name   - user_main
Module ID - 0x04664B4D
Mod Name  - "PspKanji"
EPC       - 0x089201A0
Cause     - 0x1000001C
BadVAddr  - 0xE34C1B83
Status    - 0x20008613
zr&#58;0x00000000 at&#58;0x08A80000 v0&#58;0x00000000 v1&#58;0x00000004
a0&#58;0x48BD9AB8 a1&#58;0x00000064 a2&#58;0x09F7FD90 a3&#58;0x09F7FD88
t0&#58;0x00000082 t1&#58;0x00000000 t2&#58;0x09F7FD50 t3&#58;0x1E000102
t4&#58;0x00000003 t5&#58;0xC02E0001 t6&#58;0x00000000 t7&#58;0x00000001
s0&#58;0x00000000 s1&#58;0x403E0000 s2&#58;0x00000001 s3&#58;0x09F7FEE0
s4&#58;0x00000014 s5&#58;0x00000013 s6&#58;0xDEADBEEF s7&#58;0xDEADBEEF
t8&#58;0x00000000 t9&#58;0x1C788AFE k0&#58;0x09F7FF00 k1&#58;0x00000000
gp&#58;0x0898BCB0 sp&#58;0x09F7FD18 fp&#58;0x09F7FD18 ra&#58;0x0892213C

Any help would be greatly appreciated.

Thanks.

Aion · Post by **Aion** » Sat Jan 20, 2007 3:55 pm

When debugging in Gdb/Eclipse, I find that inside an object, I call a method of the same class, but suddently my this become 0.

I've found that not uncaching my pointers (adding 0x40000000) stop the crash. I really do not understand why at this point. And it seems to always be centered around the same piece of code.

Any pointer (no pun intended :) )?

Code: Select all

//=============================================================================
// SET ALPHA ALL
//-----------------------------------------------------------------------------
// &#40;FR&#41; Identique a 'SetAlpha' + assigne alpha de tous les objets enfants de celui-ci.
// &#40;EN&#41; Indentical to 'SetAlpha' + assign alpha of each child object of this one.
//=============================================================================
void LibGfx_CObjBase&#58;&#58;SetAlphaAll&#40; u8 au8_Alpha 	&#41;
&#123;

===>//&#40;Pointer 'this' valid here, game eventually crash if a 0x40000000 is added using object&#41;
===>SetAlpha&#40; au8_Alpha &#41;; 

	std&#58;&#58;list<LibGfx_CObjBase*>&#58;&#58;iterator iterObject;
	for&#40; iterObject = mlp_ListChildObj.begin&#40;&#41;; iterObject != mlp_ListChildObj.end&#40;&#41;; ++iterObject&#41;
		&#40;*iterObject&#41;->SetAlphaAll&#40; au8_Alpha &#41;;
&#125;

//-----------------------------------------------------------------------------
// &#40;FR&#41; Assigne la valeur Alpha de tous les vertex contenue dans les polygones
//		de cet objet.
// &#40;EN&#41; Set the alpha value of every vertex contains in this object's polygons
//=============================================================================
void LibGfx_CObjBase&#58;&#58;SetAlpha&#40; u8 au8_Alpha &#41;
&#123;
===>//Pointer this not valid anymore&#40;if Gdb is to be trusted
	std&#58;&#58;list<LibGfx_CPolyBase*>&#58;&#58;iterator iterPoly;
	LibGfx_CPolyBase* pPoly;

	for&#40; iterPoly = mlp_ListPoly.begin&#40;&#41;; iterPoly != mlp_ListPoly.end&#40;&#41;; ++iterPoly&#41;	
	&#123;
		pPoly = &#40;LibGfx_CPolyBase*&#41;UNCACHED&#40;*iterPoly&#41;;
		pPoly->SetAlpha&#40; au8_Alpha &#41;;
	&#125;
&#125;

Post by **TyRaNiD** » Sat Jan 20, 2007 5:35 pm

Well in general you probably shouldn't use uncached addresses except in very specific circumstances.

The only thing I can think of is some data you are using it being written out at the cached address and read out at the uncached address, you cannot assume that the cache will ever be flushed before you use the real data, and the CPU is too stupid (or I suppose too trusting) to check whether the cache through address you are using is actually backed by cache. If it did it would kind of defeat the point :P

If that makes any sense what so ever :)

Aion · Post by **Aion** » Sun Jan 21, 2007 1:16 am

Yeah,

Once I noticed that I was using uncached data for more than the gu, I realized that I should stop doing so.

Anyway, I'm really really glad that it seems that I've found the root of the evil :P It was really driving me insane with the amount of wasted time trying to figure it out.

I'm not familiar enough with the underlying architecture to really understand why accessing the data directly instead of cached would create problems other than potential slowdown.

Post by **TyRaNiD** » Sun Jan 21, 2007 1:37 am

Well the problem is simple, if data exists in the cache it might not be actually in main memory so reading uncached you cannot be sure what you end up reading.