When performing DMA transfers on modern computers (or DSPs in my case), it is important to remember some important points in order to be successful. I will review the points I know are important to Texas Instruments DSP development. I'm sure they are applicable to other hardware as well.
One of the most important points is the following: DMA transfers do not automatically handle cache coherency. The first time I attempted to use DMA transfers in development I was moving data directly from one external memory to another external memory (this is also a bad idea, which I'll go over later). Upon completion when the destination data was read, it was corrupt, why? Because when you request a DMA transfer from a cacheable memory location in external memory, the L2 controller pulls the data directly from that external memory location. This is a problem if you've been writing to that location previously, because the most recent data will only be held in L2 cache and only old data will reside in external memory. You need to issue a cache writeback or "flush" before you issue a DMA request that reads from a cacheable memory location. Likewise, if you are writing to external memory you need to invalidate cache lines, otherwise old data will be retrieved from the cache.
To make things more efficient, you need to consider the size of your cache lines and how they relate to your data in external memory. This is important because cache operations are entire line operations, not partial lines. Don't forget about internal DMA requests here (requests issued in the background by the L2 memory controller). On a cache miss of an external memory location, an entire L2 line's worth of data is fetched via an internal DMA request. So you want to "align" your data to the cache lines so each cache operation is as quick as possible, ie it only performs one DMA transfer and not two. On TI DSPs this is done with #pragma DATA_ALIGN(buffer, section).
One more thing, like was noted before, moving data from one external memory to another using DMA is extremely inefficient. Remember that the memory bus speed is very slow compared to the DSP/CPU clock. So that means that not only are you extremely slowed by one bus (read), but two (write), because the data must pass through the L2 memory controller. If at all possible use internal RAM. On TI DSPs, the leftover L2 after cache configuration becomes IRAM (aka ISRAM). To use IRAM on the DSP you would create a new section in the linker file that points to IRAM, and then in your code you would make a buffer reside in that section by #pragma DATA_SECTION(buffer, section).