Dynomotion

Group: DynoMotion Message: 11377 From: Hardy Family Date: 4/13/2015
Subject: Using assembler code
I have 9 kflop inputs that need to be monitored for state changes (index pulse on encoders, probe inputs etc.).  The index pulses in particular are fleeting, since the linear scales are 1um resolution, so if the axis is moving at 10mm/sec, then the pulse only lasts for 100us.  Currently, I can only sample at about 1ms intervals, which limits the axis speed to 1mm/sec or else I can miss a pulse.  I would like to increase the speed to 10mm/sec, which means that I would have to sample in the servo loop (90us).  Trouble is, TCC generates fairly slow code so I cannot get all the sampling done within 5us.

So, I was thinking of writing some C67 assembler, mainly to take advantage of the delay slots and VLIW parallelism.  I see that tcc has some assembler support, but haven't yet actually tried it.  Does it work for C67?

The DSP actually has more regs than tcc uses (A16-31, B16-31).  I was wondering if those regs are available for threads and/or the servo loop task.  Or are they reserved for use by the BIOS?

I think a good, fairly simple, optimization that tcc could be made to do would be to support the "register" keyword.  (This would be a project for me).  Although it would probably not be worth doing unless those extra regs were available.

Regards,
SJH

Group: DynoMotion Message: 11379 From: Tom Kerekes Date: 4/13/2015
Subject: Re: Using assembler code
Hi SJH,

I think it would be a daunting task to develop an optimizing C compiler or write assembly code for this.

Usually the speed of TCC67 isn't important because it is just calling optimized routines anyway.

Consider using the TI Optimizing C compiler.  See:


Another optimization is that you can read 8 Input bits with one read of the FPGA instead of performing individual ReadBit Calls if that is what you are doing.

Do you really need to monitor all 9 all the time?  Do you know what is causing your loop to take 1ms?

Regards
TK




Group: DynoMotion Message: 11380 From: Hardy Family Date: 4/13/2015
Subject: Re: Using assembler code
Thanks for the info.  I wasn't aware that there was a free TI compiler, so it certainly makes sense to look at that first!  It doesn't worry me too much that the TI compiler will be a lot slower than TCC, since for our machine the critical "supervisor" code will be flashed in.

Also the thread on programming was interesting - I actually used some of those techniques to "link" to functions in the supervisor thread, except I use a big chunk at the end of the gather buffer instead of persist vars.

I wasn't really thinking of an optimizer, but just allowing "manual" optimization using the C "register" storage class.  That would not have been an excessive amount of work; I think the improvement could be quite dramatic.  Register ops are 5ns vs. 30ns for memory accesses.

The code already optimizes FPGA access, but it was the load and store of axis positions which was a bit of a killer.  The first code attempt was overly general; I subsequently cut it down to just what was needed, but I could do more.  For example, the index pulses only need to be found at init time, so it could obviously include a flag which enables/disables those particular inputs.  So yes, you are right that all 9 inputs don't need constant monitoring.  It was simpler to just monitor them all the time.  Well, as Einstein was reputed to say, things should be as simple as possible, but no simpler.

The loop time of 1ms is because I am doing it in the main supervisor loop, which is continually doing a bunch of other stuff, such as handling the I2C bus master emulation, responding to interlocks and fault conditions, baby-sitting the spindle and so on.

Regards,
SJH


On Mon, Apr 13, 2015 at 3:02 PM, Tom Kerekes tk@... [DynoMotion] <DynoMotion@yahoogroups.com> wrote:
 

Hi SJH,

I think it would be a daunting task to develop an optimizing C compiler or write assembly code for this.

Usually the speed of TCC67 isn't important because it is just calling optimized routines anyway.

Consider using the TI Optimizing C compiler.  See:


Another optimization is that you can read 8 Input bits with one read of the FPGA instead of performing individual ReadBit Calls if that is what you are doing.

Do you really need to monitor all 9 all the time?  Do you know what is causing your loop to take 1ms?

Regards
TK




Group: DynoMotion Message: 11425 From: Hardy Family Date: 4/17/2015
Subject: Re: Using assembler code
The Texas Instruments compiler is basically working for me now.  I set up some Python code to automate the process.

One thing that I am curious about is what the "-ml3" option does with cl6x.  Initially, I had removed that option since it wasn't documented in the compiler manual, but without that option refs to symbols in the bss get relocated to the wrong place (even though there are explicit symbol defs in the linker commands file).

Regards,
SJH


On Mon, Apr 13, 2015 at 7:31 PM, Hardy Family <hardy.woodland.cypress@...> wrote:
Thanks for the info.  I wasn't aware that there was a free TI compiler, so it certainly makes sense to look at that first!  It doesn't worry me too much that the TI compiler will be a lot slower than TCC, since for our machine the critical "supervisor" code will be flashed in.

Also the thread on programming was interesting - I actually used some of those techniques to "link" to functions in the supervisor thread, except I use a big chunk at the end of the gather buffer instead of persist vars.

I wasn't really thinking of an optimizer, but just allowing "manual" optimization using the C "register" storage class.  That would not have been an excessive amount of work; I think the improvement could be quite dramatic.  Register ops are 5ns vs. 30ns for memory accesses.

The code already optimizes FPGA access, but it was the load and store of axis positions which was a bit of a killer.  The first code attempt was overly general; I subsequently cut it down to just what was needed, but I could do more.  For example, the index pulses only need to be found at init time, so it could obviously include a flag which enables/disables those particular inputs.  So yes, you are right that all 9 inputs don't need constant monitoring.  It was simpler to just monitor them all the time.  Well, as Einstein was reputed to say, things should be as simple as possible, but no simpler.

The loop time of 1ms is because I am doing it in the main supervisor loop, which is continually doing a bunch of other stuff, such as handling the I2C bus master emulation, responding to interlocks and fault conditions, baby-sitting the spindle and so on.

Regards,
SJH


On Mon, Apr 13, 2015 at 3:02 PM, Tom Kerekes tk@... [DynoMotion] <DynoMotion@yahoogroups.com> wrote:
 

Hi SJH,

I think it would be a daunting task to develop an optimizing C compiler or write assembly code for this.

Usually the speed of TCC67 isn't important because it is just calling optimized routines anyway.

Consider using the TI Optimizing C compiler.  See:


Another optimization is that you can read 8 Input bits with one read of the FPGA instead of performing individual ReadBit Calls if that is what you are doing.

Do you really need to monitor all 9 all the time?  Do you know what is causing your loop to take 1ms?

Regards
TK




Group: DynoMotion Message: 11429 From: Tom Kerekes Date: 4/18/2015
Subject: Re: Using assembler code
Hi SJH,

I think -ml3 has to do with the near/far option that has been depreciated in newer versions.  Near data and functions can be faster when grouped into a 16-bit pages so their constant address can be loaded with one instruction instead of two.  So the compiler just assumes it can load the address with one instruction.  At link time the linker puts in the actual address and for some reason doesn't complain if its more than 16-bits. KFLOP is compiled to use all 32-bit "far" addresses.

HTH
Regards
TK

Group: DynoMotion Message: 11530 From: Hardy Family Date: 5/14/2015
Subject: Re: Using assembler code
I have been doing some more work with cl6x.  It works for small programs, but for more complex programs the kflop will crash or give incorrect results.  I was wondering if you could provide some insight.

Here are the command lines being used to compile and link:

cl6x "/home/steve/DM6/DM6-SCL-Rev01/user5.c" -i"/home/steve/KMotionX/DSP_KFLOP" -i"/home/steve/DM6/DM6-SCL-Rev01" --output_file="/tmp/user5.c.o" -mv6700+ -ml3 -mu -O2

cl6x -z -c -o "/home/steve/DM6/DM6-SCL-Rev01/user5(1).out" -x -e _main "/tmp/user5.c.o" "/tmp/user5(1).out.cmd"

And the linker command file is:


-c
-heap 15700000
-stack 0x800
_BitDirShadow = 0x10019208;
_fast_fabs = 0x100019bc;
_EnableAxis = 0x800111d8;
_WaitNextTimeSlice = 0x10001180;
_SetBit = 0x10010d20;
_PauseThread = 0x800238b8;
_EnableAxisDest = 0x80010e18;
_Time_sec = 0x10012340;
_Zero = 0x80010b14;
_memset = 0x80025de0;
_ClearBit = 0x10010f28;
_ReadBit = 0x100111b4;
__divf = 0x1000e000;
_gather_buffer = 0x10017c10;
__divd = 0x1000e840;
_chan = 0x10016d98;
_SetBitDirection = 0x80022aa0;
_printf = 0x80023140;
_DefineCoordSystem6 = 0x80024048;
_persist = 0x10015c18;
_DisableAxis = 0x1000dad0;
MEMORY {
IRAM: o = 0x1001c000, l = 0x00004000
THREAD_MEM: o = 0x80050000, l = 0x00010000
SDRAM: o = 0x80100000, l = 0x00f00000
}
SECTIONS {
.placeholder: palign(8), fill = 0xaaaaaaaa {. += 4;} > THREAD_MEM
.text > THREAD_MEM
.far > THREAD_MEM
.const > THREAD_MEM
}


This code works fine with TCC, it also works fine with cl6x with optimization turned off, but crashes the kflop with optimization (-O1 or 2 with or without software pipelining -mu).

Testing with various modifications to the code seems to hint that some statically initialized data is not getting initialized to the correct values, or possibly not at the correct address.

Is there anything in the way that cl6x generates an ELF file that might be confusing the object loader on the kflop?  Does the kflop loader code fully implement the ELF format?

Is -mv6700+ the correct architecture?

If the --symdebug:dwarf option is added to the compiler command line, it runs out of space for the .far section:

"/tmp/user5(1).out.cmd", line 34: error: program will not fit into available
   memory.  run placement with alignment fails for section ".far" size 0x13c0 .
   Available memory ranges:
   THREAD_MEM   size: 0x10000      unused: 0x115c       max hole: 0x1158   

(Yes, the program is rather large, since it completely unrolls a fairly complex state machine).  So I was wondering if I can redirect the .far section to use SDRAM memory.  Is this area somehow "managed", or am I going to have to manually make sure that different programs don't clobber each other's SDRAM memory?

Finally, I got the -stack and -heap values from your original sample.  Not sure what to make of the heap 15.7M magic number, but I don't use malloc() so no drama.  But I am curious about the stack.  Where does the stack get allocated?  Is there a separate stack per thread?  Does it get taken out of the thread memory or IRAM?  Can I specify more than 2k?

Sorry about all the questions; I realize it is a bit beyond the call of duty, but we are anticipating adding some very cool features to our machine which will need quite a bit of grunt from the DSP.

Regards,
SJH




On Sat, Apr 18, 2015 at 9:48 AM, Tom Kerekes tk@... [DynoMotion] <DynoMotion@yahoogroups.com> wrote:
 

Hi SJH,

I think -ml3 has to do with the near/far option that has been depreciated in newer versions.  Near data and functions can be faster when grouped into a 16-bit pages so their constant address can be loaded with one instruction instead of two.  So the compiler just assumes it can load the address with one instruction.  At link time the linker puts in the actual address and for some reason doesn't complain if its more than 16-bits. KFLOP is compiled to use all 32-bit "far" addresses.

HTH
Regards
TK

Group: DynoMotion Message: 11531 From: Tom Kerekes Date: 5/15/2015
Subject: Re: Using assembler code
Hi SJH,

Not sure how much I can help. 

I assume you are only running one User Thread which is the one you are compiling here?

Here is an example of compiler options for a module we want fully optimized
[SuperFast.c] "C:\CCStudio_v3.1\C6000\cgtools\bin\cl6x" -k -q -al -as -o3 -fr"c:/KMotionSrc/DSP_KFLOP/Debug" -i"include" -i"mklib" -d"_DEBUG" -mu -ml3 -mv6710 -@"Debug.lkf" "SuperFast.c"


Here is an example of compiler options for a module don't need optimized and with debug information
[Print.c] "C:\CCStudio_v3.1\C6000\cgtools\bin\cl6x" -g -k -q -al -as -fr"c:/KMotionSrc/DSP_KFLOP/Debug" -i"include" -i"mklib" -d"_DEBUG" -mu -ml3 -mv6710 -@"Debug.lkf" "Print.c"


With optimized code the compiler is very aggressive.   Things are out of order and whatnot.  It important to declare volatile and such to force the behavior you need.

You might want to place some of your small critical core routines into IRAM.  That usually makes a tremendous difference - often 10X.  The 256bit wide single cycle IRAM is what makes the DSP fast.  

Regarding the Stack and the Heap those should really not even be there.  Those are all maintained and allocated separately by the main KFLOP Program.  There shouldn't even be and C Initialization generated.  Basically your generated code and data is just loaded into memory, KFLOP already has the Stack setup and so forth and then KFLOP just jumps directly to your main function.

Note that KFLOP doesn't really even have a program loader.  The COFF loader is on the PC.  You have all the source code for that so you can step throug it if you wish.  It parses the .out file and determines where data should go into memory and then passes simple blocks of data (in the form of Address/Data) to KFLOP. 

If you would have posted your link map we might be able to see something.


See my comments below:



Group: DynoMotion Message: 11534 From: Hardy Family Date: 5/15/2015
Subject: Re: Using assembler code
Great help, thanks.  I'm just as confused if not more about ELF vs COFF.

I think I found the cause of the problem: the linker was set to "rom model" by default, however the kflop would not call the .cinit section to initialize data.  Changing the linker model to "ram" makes it just copy the .cinit section to wherever required, which seems to work.

Now my code runs like a rocket.  It is certainly useful that thread 7 has an extended (320k) area.

So the updated linker commands file is something like:


--entry_point _main
--output_file=/home/steve/DM6/DM6-SCL-Rev01/user5(1).out
--map_file=/home/steve/DM6/DM6-SCL-Rev01/user5(1).out.map
--ram_model
/tmp/user5.c.o
_BitDirShadow = 0x10019208;
<blah blah for other kflop.out symbols>
_DisableAxis = 0x1000dad0;
MEMORY {
IRAM: o = 0x1001c000, l = 0x00004000
THREAD_MEM: o = 0x80050000, l = 0x00010000
SDRAM: o = 0x80100000, l = 0x00f00000
}
SECTIONS {
.placeholder: palign(8), fill = 0xaaaaaaaa {. += 4;} > THREAD_MEM
.text: > THREAD_MEM
.far: > THREAD_MEM
.const: > THREAD_MEM
.cinit: > SDRAM
.switch: > THREAD_MEM
}


The map file will look something like:

MEMORY CONFIGURATION

         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  IRAM                  1001c000   00004000  00000000  00004000  RWIX
  THREAD_MEM            80050000   00010000  00006070  00009f90  RWIX
  SDRAM                 80100000   00f00000  00000000  00f00000  RWIX


SECTION ALLOCATION MAP

 output                                  attributes/
section   page    origin      length       input sections
--------  ----  ----------  ----------   ----------------
.text      0    80050000    00004ea0    
                  80050000    00004ea0     user5.c.o (.text)

.far       0    80054ea0    00001018     UNINITIALIZED
                  80054ea0    00001018     user5.c.o (.far)

.const     0    80055eb8    0000019c    
                  80055eb8    0000019c     user5.c.o (.const:.string)

.switch    0    80056054    00000014    
                  80056054    00000014     user5.c.o (.switch:_get_input)

.placeholder
*          0    80056068    00000008    
                  80056068    00000008     --HOLE-- [fill = aaaaaaaa]

.cinit     0    80100000    00000434     COPY SECTION
                  80100000    0000042c     user5.c.o (.cinit)
                  8010042c    00000004     --HOLE--
                  80100430    00000004     --HOLE-- [fill = 0]


Regards,
SJH


On Fri, May 15, 2015 at 12:00 AM, Tom Kerekes tk@... [DynoMotion] <DynoMotion@yahoogroups.com> wrote:
 

Hi SJH,

Not sure how much I can help. 

I assume you are only running one User Thread which is the one you are compiling here?

Here is an example of compiler options for a module we want fully optimized
[SuperFast.c] "C:\CCStudio_v3.1\C6000\cgtools\bin\cl6x" -k -q -al -as -o3 -fr"c:/KMotionSrc/DSP_KFLOP/Debug" -i"include" -i"mklib" -d"_DEBUG" -mu -ml3 -mv6710 -@"Debug.lkf" "SuperFast.c"


Here is an example of compiler options for a module don't need optimized and with debug information
[Print.c] "C:\CCStudio_v3.1\C6000\cgtools\bin\cl6x" -g -k -q -al -as -fr"c:/KMotionSrc/DSP_KFLOP/Debug" -i"include" -i"mklib" -d"_DEBUG" -mu -ml3 -mv6710 -@"Debug.lkf" "Print.c"


With optimized code the compiler is very aggressive.   Things are out of order and whatnot.  It important to declare volatile and such to force the behavior you need.

You might want to place some of your small critical core routines into IRAM.  That usually makes a tremendous difference - often 10X.  The 256bit wide single cycle IRAM is what makes the DSP fast.  

Regarding the Stack and the Heap those should really not even be there.  Those are all maintained and allocated separately by the main KFLOP Program.  There shouldn't even be and C Initialization generated.  Basically your generated code and data is just loaded into memory, KFLOP already has the Stack setup and so forth and then KFLOP just jumps directly to your main function.

Note that KFLOP doesn't really even have a program loader.  The COFF loader is on the PC.  You have all the source code for that so you can step throug it if you wish.  It parses the .out file and determines where data should go into memory and then passes simple blocks of data (in the form of Address/Data) to KFLOP. 

If you would have posted your link map we might be able to see something.


See my comments below: