28 February 2008

Convolution and Decimation in the C Language

If you are working with DSP systems, or learning about them, you are going to want to know how to use filters to shape and select the signals you work with. For example, you might know that in order to only pass low frequency signals, you create an ideal low-pass filter (which is a square wave), and multiply that by the corresponding frequency spectrum of the input signal. Simple enough, but how does this actually translate to embedded systems?

You are most likely going to be doing convolution. You probably already know about the relationship between convolution in the time-domain, and multiplication in the frequency-domain. And if you are like me, the concept of convolution is much more abstract than multiplying, so the task of convolving two signals on a DSP seems daunting at first. It turns out it's not that bad, let's take a look at a quick example with two square waves. You can easily replace the two square waves with real life signals. Click the image to see the following code:


Let's look at the top-half of the code. The arrays signal_x, and filter_h correspond to the signal to be filtered and the filter respectively. And the array result_y will hold the result of the convolution. Note that the size of the results array will need to be large enough to hold the combination of the sizes of signal_x and filter_h.

Next the signal_x values are generated on the fly to be a large square wave, obviously this is specific to this example and signal_x will be different in the real world. Then the convolution is performed using the appropriate arrays. The 'for' loop will increment from 0 to the size of the input array minus the size of the filter array. This is because if you let it proceed further, the fetched value for signal_x within the accumulation loop will be trash. As you can see, the so-called accumulation 'for' loop simply multiplies signal_x and filter_h from the current location through to the end of the filter taps. After the accumulation the result is stored in result_y. Remember that normally you would 'flip' one of the signals before convolving them, but since the filter is symmetrical you can ignore that step. Figures of signal_x, filter_h, and the resulting convolution are shown below (using GNU Octave):



Note the 4th figure talks about decimation. The last half of the example code demonstrates another useful technique in DSP and that is reducing the sampling rate of some signal by using decimation. The most important thing to remember about decimation is that you need to filter it before hand to avoid aliasing. As you may have guessed, this code is really more related to decimation than convolution if you replace filter_h with some anti-aliasing filter. The code is identical except for two important points; one is that now you are incrementing the index by the decimation factor. It's that simple. But, you need to leave the accumulation loop incrementing by 1 though so you are filtering using useful information. The second important point isn't shown but it is that your results array, here result_y, can be reduced in size when you allocate it beforehand. It's a waste to filter a signal and store it in some array A with size X and then decimate it in another loop and save it in another array B with size X/2? Obviously this code is a great space saver, and can be done because by definition decimation is throwing away output data.

That's it! I hope this has helped you understand implementing convolution and decimation a little better.

26 February 2008

Whole System Timing

Sometimes when developing on DSPs we need to keep precise timing records between inputs and outputs. For example, the time between an incoming sinusoid, work performed on it, and output sinusoid might be expected to be 5ms in a requirement. But care should be taken to be clear on what the requirements mean by time between "input" and "output". It could be referring to the actual DSP or it could be referring to the entire system. Each component has a delay attributed to it. For example just because the signal is output from the DSP doesn't mean it won't go through filters, upconverters, digital to analog converters, splitters, amplifiers, etc. These could be in the form of FPGAs from Xilinx, other DSPs, or separate "black boxes". You need to ensure that the timing development reflects the entire system. Otherwise the system may be completely non-functional.

Cache Coherency and DMA Transfers

When performing DMA transfers on modern computers (or DSPs in my case), it is important to remember some important points in order to be successful. I will review the points I know are important to Texas Instruments DSP development. I'm sure they are applicable to other hardware as well.

One of the most important points is the following: DMA transfers do not automatically handle cache coherency. The first time I attempted to use DMA transfers in development I was moving data directly from one external memory to another external memory (this is also a bad idea, which I'll go over later). Upon completion when the destination data was read, it was corrupt, why? Because when you request a DMA transfer from a cacheable memory location in external memory, the L2 controller pulls the data directly from that external memory location. This is a problem if you've been writing to that location previously, because the most recent data will only be held in L2 cache and only old data will reside in external memory. You need to issue a cache writeback or "flush" before you issue a DMA request that reads from a cacheable memory location. Likewise, if you are writing to external memory you need to invalidate cache lines, otherwise old data will be retrieved from the cache.

To make things more efficient, you need to consider the size of your cache lines and how they relate to your data in external memory. This is important because cache operations are entire line operations, not partial lines. Don't forget about internal DMA requests here (requests issued in the background by the L2 memory controller). On a cache miss of an external memory location, an entire L2 line's worth of data is fetched via an internal DMA request. So you want to "align" your data to the cache lines so each cache operation is as quick as possible, ie it only performs one DMA transfer and not two. On TI DSPs this is done with #pragma DATA_ALIGN(buffer, section).

One more thing, like was noted before, moving data from one external memory to another using DMA is extremely inefficient. Remember that the memory bus speed is very slow compared to the DSP/CPU clock. So that means that not only are you extremely slowed by one bus (read), but two (write), because the data must pass through the L2 memory controller. If at all possible use internal RAM. On TI DSPs, the leftover L2 after cache configuration becomes IRAM (aka ISRAM). To use IRAM on the DSP you would create a new section in the linker file that points to IRAM, and then in your code you would make a buffer reside in that section by #pragma DATA_SECTION(buffer, section).