Kogna debug suggestions

SJHardy · Post by **SJHardy** » Thu Mar 28, 2024 5:04 pm

If only it was so easy as to be some of my code which was looping! The evidence seems to be pointing to a call via a bad function pointer or something similar which starts it executing random data (not code - the disasm knew that the PC was in one of my floating point data tables). I guess the difference between -g and not is exactly where it ends up executing. But something ends up turning off the timer interrupt. That's interesting, since I wouldn't expect that to happen easily by accident.

If the boot ROM has no business executing after boot, then it would be nice to be able to set a hardware breakpoint whenever it executes from that range of addresses, but I haven't yet found a way to set that sort of trap.

Unfortunately, once it starts executing data, it doesn't know the stack frame layout any more so no traceback. For now, I'm just poking around the stack looking for "my" addresses (0x2......) Also, B3 is the return address, but alas it is not anything recognizable.

Are there any clues in the SP location? Can I relate the SP back to the thread that was executing at the time? I know the info is probably all out there already, but is there a convenient memory layout table for the Kogna? E.g. where thread code and stacks go, the firmware, boot loader etc.

Post by **TomKerekes** » Thu Mar 28, 2024 9:55 pm

Hi Steve,

If only it was so easy as to be some of my code which was looping! The evidence seems to be pointing to a call via a bad function pointer or something similar which starts it executing random data (not code - the disasm knew that the PC was in one of my floating point data tables). I guess the difference between -g and not is exactly where it ends up executing. But something ends up turning off the timer interrupt. That's interesting, since I wouldn't expect that to happen easily by accident.

Unfortunately its hard to say how much garbage was executed before it finally found a loop to get stuck in.

Note I could see that if the DSP remained stuck in valid code but with the Interrupt turned off that the root issue could be something disabling interrupts. But if executing garbage code then the root issue would be something else as turning off interrupts would stop communication but not result in executing garbage code.

If the boot ROM has no business executing after boot, then it would be nice to be able to set a hardware breakpoint whenever it executes from that range of addresses, but I haven't yet found a way to set that sort of trap.

Like I said in a previous post I don't see support for the normal address masks for memory watchpoints. You might at least set a watchpoint for address 0.

We might try to enable the Memory Protection unit. But it looks complicated and I can't find any example. The C6748 Technical Reference Manual (Google SPRUH79C) chapter 5 describes it. Looks like only External SDRAM 0xC0000000 - 0xDFFFFFFF can be protected. So I'm not sure that helps.
Up to 12 Memory regions can be protected with 64KB granularity.

Are there any clues in the SP location? Can I relate the SP back to the thread that was executing at the time? I know the info is probably all out there already, but is there a convenient memory layout table for the Kogna? E.g. where thread code and stacks go, the firmware, boot loader etc.

Not really.

This defines the Global Thread Spaces:
#define USER_PROG_ADDRESS_KOGNA 0xC0080000 // where first user program is loaded
#define MAX_USER_PROG_SIZE_KOGNA 0x40000 // space between each thread/user program
#define MAX_USER_PROG_SIZE_KOGNA_7 (4 * 0x40000) // space between each Thread #7 user program

Note User Code might call Kogna functions which might in turn call library functions which might be most anywhere.

The Stacks are defined in internal memory in an Array called "Stacks". 2KBytes per Thread starting with User Thread 1. Stacks is defined as an Array of doubles so to be 8byte aligned. so Stacks+0x100 would be the bottom of Thread #1 stack.

One other thought would be to add a User Callback. This is called between each Thread Switch. The sequence is:

90us interrupt occurs.
The interrupt routine pushes the entire DSP State (64 Registers + several other Regs) to the current stack (stack of preempted Thread)
IRP (Interrupt return pointer) contains where Thread was pre-empted
UserCallback is called
Servo/trajectory stuff for all 16 Axes
Stack switches to next Thread's stack and IRP switches to next Thread's return point
DSP State is popped from stack
return from Interrupt (to next Thread)

So the idea would be to do some sanity check. IER has bit 6 set? IRP is somewhere valid? Not sure how much this would help as if IRP was pointing to some invalid place it would mean the code had crashed without much help on why.

SJHardy · Post by **SJHardy** » Fri Mar 29, 2024 7:15 pm

I already have a user callback that is used to get accurate time stamps for certain input bit changes, such as touch probe measurements. I think it's a good idea to add a sanity check in there.

The only problem I can see is if IER[6] was turned off, then it would never get inside the user callback since that's the 90us periodic timeout. But it's certainly an option once I am closing in on a few possibilities. As of now, I think I'm making slight progress using a combination of CCS and my own function hook code + post mortem.

Here are a few [in]consistencies that I've noticed:

- The last thread executing is thread 7 (supervisor/housekeeping etc.)
- The second last thread executing is 2 (the axis homing routine.)
- Nothing is perfectly consistent, but thread 2 seems to prematurely end after it has modified chan soft limits. Since this board is not connected to a physical machine, motors etc., the homing state machine is mostly bypassed. This is a well-tested piece of code at least on the Kflop.
- I can't see any consistency in where thread 7 ends up, but it's just doing something perfectly routine.
- Interrupts get disabled and it finds some random piece of looping code to get stuck in. (The JTAG debugger seems to have a means of detecting this, and it seems to automatically break into the debugger and disasm where it's at. Maybe some sort of timeout?)

Homing routine is state-machine based, so all axes home at the same time, but we do have the ability to home one at a time, so I'm next going to try splitting it up with 1 sec delays between each, and see if any more clues are forthcoming. I did notice before that homing just XYZ allowed it to stagger on for a bit longer than homing XYZAB.

SJHardy · Post by **SJHardy** » Sat Mar 30, 2024 12:44 am

Q: how does one coax CCS into allowing breakpoints to be set by C program/line number? At present I can step through, and it highlights the C source line correctly, but breakpoints can only be set on known symbols, which forces me to create dumb little functions all over the place. If I try to set on a C source line, it complains that it doesn't know the source file (!)

More info: I inserted lots of calls to Delay_sec(0.4) in the homing thread (2), to give the supervisor (7) plenty of time to run though several cycles of its housekeeping. First axis to home (Z, axis 2) goes fine. Second axis (X, 0) runs up until it makes a call to Zero(i) where i is a local int containing the axis number, 0. I can single step over that call, but then the next time it calls Delay_sec(0.4), it never returns. This is the most consistent I've seen it. Previously, I was using Delay_sec(0.04) but that would make it fail at a later point. In theory, the supe housekeeping cycle takes 0.036, but maybe with all the debugging stuff enabled it's taking longer than expected.

Anyway, to get more detail I was stepping in assembler thru the Delay_sec() routine etc., and it actually returned, but while stepping through some more it entered then was returning from another Delay_sec() call and suddenly CCS locked up on me (but after going to a weird place after 0xC008000 i.e. first thread, which is also waiting for things to finish at that point - maybe an interrupt is happening and it switches context back to that thread). I wasn't mentally acute enough to keep track of all 64 regs in my head so missed where B3 was getting set incorrectly. Until now, I'd never encountered a context switch while stepping in asm. But it seems CCS get all bent out of shape if that happens.

Next thing to suspect is the compiler. Currently I'm using -g but also with -O2 --opt_for_space so it might be wise to turn all that stuff off. It's throwing off my debugging a bit.

Post by **TomKerekes** » Sat Mar 30, 2024 6:10 pm

The only problem I can see is if IER[6] was turned off, then it would never get inside the user callback since that's the 90us periodic timeout.

Good point I hadn't thought through. But I'm not sure if that is necessarily true. That's the quirk with this processor. Because instructions are pipelined 5 cycles if an interrupt occurs during the 5 cycles the interrupt will still happen, the pipeline will be flushed (turning off the interrupt), and then the interrupt routine will be entered with interrupts turned off. I know this can happen with global interrupts but not sure about the IER. Unlikely to be an issue if some random code is just turning off IER[6]. But with code frequently switching interrupts off momentarily it is quite likely, like certain library functions.

Interrupts get disabled and it finds some random piece of looping code to get stuck in. (The JTAG debugger seems to have a means of detecting this, and it seems to automatically break into the debugger and disasm where it's at. Maybe some sort of timeout?)

I can't think of how a break that wasn't set would occur. I think software breakpoints are some special instruction code so if that were somehow executed it might break.

Q: how does one coax CCS into allowing breakpoints to be set by C program/line number? At present I can step through, and it highlights the C source line correctly, but breakpoints can only be set on known symbols, which forces me to create dumb little functions all over the place. If I try to set on a C source line, it complains that it doesn't know the source file (!)

I don't have a problem with this. If I "Add Symbols" for my User Program. Then Open the User C Program as a file. Then right click on a line of C Code and select Breakpoint - hardware breakpoint. Breakpoint is set and if the C Program is loaded and executed it breaks. If the code is Optimized then the code might be removed, reordered, and whatnot to make setting breakpoints difficult.

Anyway, to get more detail I was stepping in assembler thru the Delay_sec() routine etc., and it actually returned, but while stepping through some more it entered then was returning from another Delay_sec() call and suddenly CCS locked up on me (but after going to a weird place after 0xC008000 i.e. first thread, which is also waiting for things to finish at that point - maybe an interrupt is happening and it switches context back to that thread). I wasn't mentally acute enough to keep track of all 64 regs in my head so missed where B3 was getting set incorrectly. Until now, I'd never encountered a context switch while stepping in asm. But it seems CCS get all bent out of shape if that happens.

I don't think interrupts are normally allowed when single stepping. You might set a breakpoint in your CallBack to check. Below is the C code and assembly for Delay_sec. I've also attached the C File timerCounter.c which contains Time_sec if that helps.

Note there is an array called SaveIRP which contains where each Thread was last pre-empted if that helps.

Code: Select all

// Delay time in seconds

void Delay_sec(double sec)
{
	register double tf=Time_sec()+sec;
	
	while (Time_sec() < tf) ;
}

Code: Select all

Delay_sec:
           STW     .D2T2   B3,*SP--(8)       ; |545| 
           STDW    .D2T1   A11:A10,*SP--     ; |545| 
           SUB     .D2     SP,16,SP          ; |545| 
           STDW    .D2T1   A5:A4,*+SP(8)     ; |545| 
           CALLP   .S2     Time_sec,B3
           LDDW    .D2T2   *+SP(8),B5:B4     ; |546| 
           NOP             4
           ADDDP   .L1X    B5:B4,A5:A4,A11:A10 ; |546| 
           NOP             6
           CALLP   .S2     Time_sec,B3
           CMPLTDP .S1     A5:A4,A11:A10,A0  ; |548| 
           NOP             1
   [!A0]   BNOP    .S1     $C$L31,5          ; |548| 
$C$L30:    
           CALLP   .S2     Time_sec,B3
           CMPLTDP .S1     A5:A4,A11:A10,A0  ; |548| 
           NOP             1
   [ A0]   BNOP    .S1     $C$L30,5          ; |548| 
$C$L31:    
           ADDK    .S2     16,SP             ; |549| 
           LDDW    .D2T1   *++SP,A11:A10
           NOP             4
           LDW     .D2T2   *++SP(8),B3       ; |549| 
           NOP             4
           RETNOP  .S2     B3,5

Code: Select all

double Time_sec(void)
{
    unsigned long long x=TIMERLL0;
    unsigned int lsw = x & 0xffffffff;
    unsigned int msw = x >> 32;

    double result = msw * 4294967296.0 + lsw;
    result *= (1.0/CLOCKFREQ);

    return result;
}

Next thing to suspect is the compiler. Currently I'm using -g but also with -O2 --opt_for_space so it might be wise to turn all that stuff off. It's throwing off my debugging a bit.

Yes if you can see the problem without anu Optimization things will be much simpler to debug.

SJHardy · Post by **SJHardy** » Tue Apr 02, 2024 7:00 pm

Hi Tom,

What is the resolution of TIMERLL0? I couldn't find doc. for it in the TI literature.

Until now, I didn't realize the resolution of Delay_sec() might be in the microsecond or better, but I'm thinking of replacing some ad-hoc delay loops in my code to consistently use the firmware function.

I have some SPI bit bang code which I think is giving some trouble. It needs some minimum sub-microsecond delays so that the SPI clock speed doesn't exceed 8MHz.

SJHardy · Post by **SJHardy** » Tue Apr 02, 2024 8:01 pm

I've finally had a bit of a breakthrough.

As described before, my homing procedure runs in thread 2, while the supervisor continues to run housekeeping in thread 7. But in addition, thread 1 is a stub program that launches the homing procedure (by executing thread 2 via the supe->ref_axes() call listed below) then waiting until thread 2 is done. The PC application is basically using the execution of thread 1 as a status indicator for the completion of homing.

If that all seems a bit roundabout, then I'd agree, but the whole algorithm evolved somewhat over several years and has accumulated a few inefficiencies.

Anyway, the thread 1 code is this (omitting some boilerplate etc for clarity):

Code: Select all

#define NOWAIT      // Debugging: define to return after launch

void main(void)
{
    INIT_SUPE;

    // Stub invoker for homing thread.
    if (supe->ref_axes())
        return; // could not start
    #ifndef NOWAIT
    for (;;) {
        Delay_sec(0.5);
        if (supe->thd_function != THD_FUNC_REF_AXES)
            break;
    }
    #endif
}

If NOWAIT is defined, then it works! Otherwise, the normal case, it fails after a while.

It seems that concurrent calls to Delay_sec() are confusing the firmware somehow. It does seem to explain why single stepping with CCS unexpectedly ends up in thread 1's Delay_sec() instead of thread 4's. There is probably some corner case where context switching is not working correctly.

I can certainly change my code to be more straightforward, but you might want to look into the context switching to see if what I'm saying has any relation to reality. The chances are I'm doing something to break it, but it would be good to know where to look.

What do you think?

Post by **TomKerekes** » Wed Apr 03, 2024 1:30 am

Hi Steve,

What is the resolution of TIMERLL0? I couldn't find doc. for it in the TI literature.

64-bit free running Timer 0 clocks at 228MHz. See CLOCKFREQ. Resolution ~ 4.4ns

Is thd_function defined to be volatile?

It seems that concurrent calls to Delay_sec() are confusing the firmware somehow.

You might put a breakpoint in Delay_sec to see what is wrong. Is the tf value reasonable?

I can't think of any reason why context switching would cause an issue with Time_sec(). Everything on the stack with no global references except the Timer. All registers should be saved/restored during a context switch. The timer is read with a 64-bit read (LDDW) so shouldn't be interruptible. Below is the assembly code. Its compiled with debug on and optimization off so a lot of unnecessary loading/storing to stack but that shouldn't cause a problem.

Code: Select all

;******************************************************************************
;* FUNCTION NAME: Time_sec                                                    *
;*                                                                            *
;*   Regs Modified     : A3,A4,A5,B4,B5,B6,B7,B8,SP                           *
;*   Regs Used         : A3,A4,A5,B3,B4,B5,B6,B7,B8,SP                        *
;*   Local Frame Size  : 0 Args + 28 Auto + 0 Save = 28 byte                  *
;******************************************************************************
Time_sec:
           ADDK    .S2     -32,SP            ; |109| 
           MVKL    .S1     0x1f0d010,A3
           MVKH    .S1     0x1f0d010,A3
           LDDW    .D1T1   *A3,A5:A4         ; |110| 
           NOP             4
           STDW    .D2T1   A5:A4,*+SP(8)     ; |110| 
           LDDW    .D2T2   *+SP(8),B5:B4     ; |111| 
           NOP             4
           AND     .L2     -1,B4,B4          ; |111| 
           STW     .D2T2   B4,*+SP(16)       ; |111| 
           LDDW    .D2T2   *+SP(8),B5:B4     ; |112| 
           NOP             4
           STW     .D2T2   B5,*+SP(20)       ; |112| 
           MV      .L2     B5,B4
           INTDPU  .L2     B4,B5:B4          ; |114| 
           ZERO    .L2     B7
           MVKH    .S2     0x41f00000,B7
           LDW     .D2T2   *+SP(16),B8       ; |114| 
           ZERO    .S2     B6                ; |114| 
           MPYDP   .M2     B7:B6,B5:B4,B5:B4 ; |114| 
           NOP             5
           INTDPU  .L1X    B8,A5:A4          ; |114| 
           NOP             5
           ADDDP   .L2X    A5:A4,B5:B4,B5:B4 ; |114| 
           NOP             6
           STDW    .D2T2   B5:B4,*+SP(24)    ; |114| 
           MVKL    .S1     0x3e32d66b,A5
           MVKL    .S1     0x5f1d1cb1,A4
           MVKH    .S1     0x3e32d66b,A5
           MVKH    .S1     0x5f1d1cb1,A4
           MPYDP   .M1X    A5:A4,B5:B4,A5:A4 ; |115| 
           NOP             9
           STDW    .D2T1   A5:A4,*+SP(24)    ; |115| 
           ADDK    .S2     32,SP             ; |118| 
           RETNOP  .S2     B3,5              ; |118|

SJHardy · Post by **SJHardy** » Wed Apr 03, 2024 3:58 am

Is thd_function defined to be volatile?

No, but I don't think the compiler should be optimizing that out since it doesn't really know what Delay_sec() might do. I probably should declare the "supe" pointer as pointer to volatile, which would remove any question in this and other cases, but at present doing so causes a whole bunch of compiler warnings. I would think that any such errors on my part would cause the thread to loop endlessly, but should not end up with disabled interrupts.

I just tried replacing the Delay_sec(0.5) with a WNTS and it failed in the same way.

After 3 weeks of looking at this, I'm fairly sure it's not my code doing anything wrong, especially since exactly the same code has been running for years on the kflop. I've meticulously searched for the usual suspects like uninitialized data, bad function pointers etc. and never found anything suspect. What I have found is that when 3 threads are all in a Delay_sec() or WNTS, then things go all pear shaped - but not immediately. I'd break it down into a simple example, but since we've moved towards a more specialized platform, it's not that easy.

Probably, where I'll go from here is to find all such cases where unnecessary stub threads are waiting, and change to something more efficient. It does the same thing with touch probing, and some other M codes, so that'll be fun

Post by **TomKerekes** » Wed Apr 03, 2024 6:54 am

Hi Steve,

No, but I don't think the compiler should be optimizing that out since it doesn't really know what Delay_sec() might do.

I'm not sure. The TI compiler can be crazy aggressive in some cases. Once I had a hard to find bug (regarding lathe Threading with KFLOP) where code would fill in a structure of parameters for the Threading and then the code would set a flag to begin the Threading. I found that the servo interrupt would sometimes begin the Threading before all the parameters were set even though the line of C Code to set the flag was last. The Compiler was optimizing and setting the parameters and flag in a re-ordered and parallel manner. This was very hard to find because it would only happen if the Servo interrupt happened right in the middle of setting the structure and flag. I tried a number of things such as declaring the flag volatile and setting the flag in a separate function and the compiler still set it too early. This was obvious looking at the assembly code. Researching the C specification the Compiler has the freedom to do this. As long as everything that is specified in the C Code gets performed there is no guarantee on order. I finally created a special assembly routine called SetBarrier(int *p) to set the flag to 1. The Optimizer doesn't seem to be able to see through this. Here is a good article.

On another occasion I recall trying to write a benchmark to sum an array of numbers. I'd fill the array with a count, then sum the array, then print the result. When I turned on optimization it would just print the result! Even though the summing was moved to a separate function.

I don't know if any of this is related to your problem but wanted to mention it. It isn't clear if you are testing with optimization turned off or not.

I did try a test with multiple Threads looping Delay_sec() and I don't see any issue. I just put variations of:

Code: Select all

#pragma TI_COMPILER(3)
#include "KMotionDef.h"

void main() 
{
	for (;;)
	{
		Delay_sec(0.115);
		ch0->Dest++;
	}
}

into Threads #1, #2, and #7 and I don't see an issue. They all keep looping and no crashes. I also tried all 7 Threads for over an hour. First with all 0.5 sec delays, then mix of random times.

When you said Thread #1 "failed" without the NOWAIT define what did you mean? Does it hang in Thread #1's Delay_sec()? Crash? Shut off interrupts?

I would think that any such errors on my part would cause the thread to loop endlessly, but should not end up with disabled interrupts.

I would agree.

It might be usefult to know what INIT_SUPE and supe->ref_axes() actually do. Could you post the assembly code for that simple Thread #1 code?

I guess I still don't understand why you can't just remove chunks of code that don't have any effect on the failure until you eventually get to a minimal simple case that demonstrates the problem.

Probably, where I'll go from here is to find all such cases where unnecessary stub threads are waiting, and change to something more efficient. It does the same thing with touch probing, and some other M codes, so that'll be fun

It would be unfortunate if we need to resort to changes as some sort of workaround without a full understanding of the issue.

Sorry for no simple answer.

Dynomotion Forum

Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions

Re: Kogna debug suggestions