Past few days I've been involved in optimising a video sampling format conversion code on DM642 DSP. Algorithm itself is simple enough but when it comes to speeding things up, there's always weird things you have to do to save a milisecond. Basic algorithm looks a bit like this.
....
for half the lines
{
for all the u samples in line
{
dst_u = avg(u_f1 + u_f2)
dst_v = avg(v_f1 + v_f2)
}
}
u,v being the chroma samples; f1,f2 being the fields
Texas Instruments has kindly created a processor intrinsic library for DM642 or any C64xx DSP. These intrinsics directly translates to an assembly instruction. The advantage being you could use the intrinsics in your C/C++ code without having to resort to writing the whole thing in assembly. There are intrinsics to do packed operations such as taking average of 2 sets of 4 consequetive bytes. The instruction operates on each byte individually and the result would appear in 4 consecutive bytes. This is a sort of SIMD operation. Usually the most taxing operation would be to read memory. So streaming data into the kernel of the operation is vital. By using packed operations we could save time by reading 4 bytes at a time. But there's also an instruction for 8 byte memory read and write. This makes things even more interesting.
Since DSP registers are all 32 bits, when operating on 64 bits of data the compiler automatically pair up registers to make up a longer virtual register. The good thing about this is that after you read in the data, the operations can be done on each 32 bits seperately. Thus eliminating the need for bit shifting and re-assembling before writing 64 bit data back.
This is all good and dandy and increases the parallism in the generated code. It's always a good idea to generate assembly listing with extra information and keep it after compilation, so that you can take a look at how you can optimise more.
When you read the assembly listing you can calculate the cycle time for the algorithm fairly accurately, since every instruction represent a cycle. When one instruction is taking more than one cycle and there are no other instructions that can be scheduled to be dispatched in the next cycle, compiler inserts nop instructions. Parallel instructions are indicated by and each block of parallel instructions can be dispatched in a single cycle! This makes very easy to figure out what actually happens in the cpu. Big contrast from Intel cpus where the scheduling is speculative and happens inside the chip at runtime rather than at compile time.
When I finished optimising the code I found out that conversion takes something close to 8ms. Which is a very long time considering the theoritical time expected being in micro second range.
I was more intrigued to find out that the same code runs close to 2ms on the DM642 Eval board. What makes it run faster in one board and slower in another? Now that's a real mystery!
The answer was elluding me for sometime until it occured that the memory I was accessing is on SDRAM rather than ISRAM. The cycle times are given for the memory operations in ISRAM. Each time the SDRAM memory is read or written, there would be extra waits, which would not appear on your assembly listing. So why would one board run faster. Well, it was attributed to the fact that one board had faster SDRAM than the other. Mystery solved! Anyway work is not finished yet.
This would mean that I have to prefetch the data into the ISRAM before the conversion operations are done. ISRAM being precious , scarse resource you can't have the whole frame of video data in there. So the answer would be to prefetch 2 lines and work on them store the resulting lines back to the SDRAM. We tested the prefetching theory by using memcpy(...) routine and it was as expected speeded things up down to 1.3ms.
Next step is to use DMA for prefetching. Since the DMA can continue in the background while we are busy doing the conversion, we can have two sets of buffers. One would be used to prefetch data in and the other would be used to work on the data. And alternate between sets.