Dev Chronicles: 2006

Monday, October 30, 2006

More often than not, you would come across a situation where you have to copy a chunk of data using DMA. But DAT_copy(...) function has a size limitation where the copy length has to be less than 65,535 bytes. Inorder to copy a larger size you have to issue multiple DAT_copy(...) calls.

Before you can use DAT_copy you have to open the DAT module using DAT_open(...). This allocates 4 Transfer Completion Codes (TCC). This means we can issue 4 DAT_copy(...) calls before having to wait for copy completion using DAT_wait(...).

When creating a general purpose copy routine, we have to consider few scenarios to make use of the features of the DAT_copy.

Scenario 1: payload size is less than 65,535
In this case it's simply a single DAT_copy(...) call and DAT_wait(...)

Scenario 2: payload size is more than 65,535 but less than (65,535 * 4)
In this case we issue 'n' number of DAT_copy(...) calls, where n is between 1 and 5

Scenario 3: payload size is more than (65,535 *4)
In this case we issue integer number of blocks of 4 DAT_Copy(...) calls and for the problem would be reduced to a Scenario 2 or Scenario 1.

Wednesday, July 05, 2006

Cycle Timing Mystery

Past few days I've been involved in optimising a video sampling format conversion code on DM642 DSP. Algorithm itself is simple enough but when it comes to speeding things up, there's always weird things you have to do to save a milisecond. Basic algorithm looks a bit like this.


....
for half the lines
{
   for all the u samples in line
   {
      dst_u = avg(u_f1 + u_f2)
      dst_v = avg(v_f1 + v_f2)
   }
}

u,v being the chroma samples; f1,f2 being the fields

Texas Instruments has kindly created a processor intrinsic library for DM642 or any C64xx DSP. These intrinsics directly translates to an assembly instruction. The advantage being you could use the intrinsics in your C/C++ code without having to resort to writing the whole thing in assembly. There are intrinsics to do packed operations such as taking average of 2 sets of 4 consequetive bytes. The instruction operates on each byte individually and the result would appear in 4 consecutive bytes. This is a sort of SIMD operation. Usually the most taxing operation would be to read memory. So streaming data into the kernel of the operation is vital. By using packed operations we could save time by reading 4 bytes at a time. But there's also an instruction for 8 byte memory read and write. This makes things even more interesting.

Since DSP registers are all 32 bits, when operating on 64 bits of data the compiler automatically pair up registers to make up a longer virtual register. The good thing about this is that after you read in the data, the operations can be done on each 32 bits seperately. Thus eliminating the need for bit shifting and re-assembling before writing 64 bit data back.

This is all good and dandy and increases the parallism in the generated code. It's always a good idea to generate assembly listing with extra information and keep it after compilation, so that you can take a look at how you can optimise more.

When you read the assembly listing you can calculate the cycle time for the algorithm fairly accurately, since every instruction represent a cycle. When one instruction is taking more than one cycle and there are no other instructions that can be scheduled to be dispatched in the next cycle, compiler inserts nop instructions. Parallel instructions are indicated by and each block of parallel instructions can be dispatched in a single cycle! This makes very easy to figure out what actually happens in the cpu. Big contrast from Intel cpus where the scheduling is speculative and happens inside the chip at runtime rather than at compile time.

When I finished optimising the code I found out that conversion takes something close to 8ms. Which is a very long time considering the theoritical time expected being in micro second range.
I was more intrigued to find out that the same code runs close to 2ms on the DM642 Eval board. What makes it run faster in one board and slower in another? Now that's a real mystery!
The answer was elluding me for sometime until it occured that the memory I was accessing is on SDRAM rather than ISRAM. The cycle times are given for the memory operations in ISRAM. Each time the SDRAM memory is read or written, there would be extra waits, which would not appear on your assembly listing. So why would one board run faster. Well, it was attributed to the fact that one board had faster SDRAM than the other. Mystery solved! Anyway work is not finished yet.

This would mean that I have to prefetch the data into the ISRAM before the conversion operations are done. ISRAM being precious , scarse resource you can't have the whole frame of video data in there. So the answer would be to prefetch 2 lines and work on them store the resulting lines back to the SDRAM. We tested the prefetching theory by using memcpy(...) routine and it was as expected speeded things up down to 1.3ms.

Next step is to use DMA for prefetching. Since the DMA can continue in the background while we are busy doing the conversion, we can have two sets of buffers. One would be used to prefetch data in and the other would be used to work on the data. And alternate between sets.

Tuesday, May 23, 2006

Failing Ping Mystery

I was trying to get one of my controller PCs to talk to one of the hardware units we are developing. The hardware unit is a complex mixture of video processor hardware and controlling software. It runs VxWorks and was trying to find the controller PC by IP. The PC could ping most of the IPs external/internal to the subnet but could not ping the hardware unit. And the hardware unit could everything but this particular PC.

The IT guy came around and tried loads of different things including nameserver settings, installing drivers and physical cable checks. I thought this could be a router problem. But apparently there was no router between these two machines.

After much wasted hours and sanity, he decided to replace the NIC. And that was that. It was a faulty NIC which worked for most things other than pinging my hardware unit. Strange.

Friday, May 19, 2006

GEL file mystery

GEL stands for General Extension Language. That's a vague enough description. It's an interpreted language similar to C. This language consists of a list functions that could be used to configure the Code Composer Studio Development environment. They can also be used to initialize the target DSP by emulating the host processor command interface functionality. When you set up the Code Composer Studio (CCS), you can specify a start-up GEL for each of the DSPs you have on your board. When the CCS starts up, it loads the GEL files on to the PC's memory that CCS runs and if it contains a StartUp(...) function, the code withing that function is executed.

This function could have code to describe te DSP memory map, configure EMIF registers etc., . The StartUp(...) function can be used to initialize the target DSP as well as the host ( this case it's the CCS). In later versions (v2.4 and higher) CCS doesn't connect to the target board on start up. The user has to manually connect to the target using Debug/Connect menu. There's a function callback called OnTargetConnect(...) which would be called by the CCS when it connect to the target(1). You can put all the target initialization code in there.

So what is this host processor we are talking about here?

In a typical DSP development there's a host processor which controls the DSP. And this host processor configures the DSP for program loading and execution. When the program executes, usually it will feed the DSP with data and suck some other data out of DSP memory space. Basically the host processor keeps the DSP in order. All this is done through the HPI [ Host Processor Interface ]

The HPI is a parallel port through which the host processor can directly access te DSP memory space. This access can be through a DMA or a EDMA controller. Host device functions as a master to the HPI and the host and the DSP can exchange data through internal or external memory. Host also has direct access to memory mapped peripherals(2).

The good thing about this, is that while the processor team are busy developing the electronics for the host processor, the DSP team can concentrate on developing the algorithms for the DSP (3). Once you get your board from the processor dev team, you'd want to run your code on the board under CCS. GEL file comes to aid in this situation by letting the CCS emulate the task of the host processor. You can connect to your board, compile and load the program through CCS. But sometimes if you have not configured the EMIF [External Memory Interface] settings properly you could get an error message like, "Data verification failed at address 0x... Please verify target memory and memory map". This is rather misleading error message but what goes on here is that the CCS writes the data on to the target board's memory and then runs a verification to see if the data write was successful. But when the wrong EMIF settings are there it can't access the external properly, and thus the error message.

In this case the most important part is the emif_init(...) function, where you set the EMIF settings.

EMIF registers
GBLCTL   EMIF global control register
CECTL0-3 EMIF CE space control registers
CESEC0-3 EMIF CE space secondary control registers
SDCTL    EMIF SDRAM control register
SDTIM    EMIF SDRAM refresh control register
SDEXT    EMIF SDRAM extension register
PDTCTL   EMIF peripheral device transfer control register

You have to acquire the proper settings from the processor development team how they have connected the memory and timings and such. Once you have the correct settings the CCS happily loads the program on to the target DSP and it would be ready to run under CCS.

References:

(1)Creating Device Initialization GEL files
http://focus.ti.com/lit/an/spraa74a/spraa74a.pdf

(2) TMS320C6000 DSP Host Port Interface (HPI) Reference Guide
http://focus.ti.com/lit/ug/spru578c/spru578c.pdf

(3) Using GEL file for parallel development and testing of DSP applications
S. Jaisimha, Singapore Design Engineering Centre, Delphi Delco Electronics Systems
http://www.embeddedstar.com/articles/2004/2/article20040209-1.html

Dev Chronicles