LeafLabs Garden » Topic: DMA memcpy mem2mem

LeafLabs Garden » Topic: DMA memcpy mem2mem http://forums.leaflabs.com/topic.php?id=1849 A place to share, learn, and grow... en-US Fri, 22 Jan 2016 00:08:45 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://forums.leaflabs.com/search.php Vasudev on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849&page=2#post-105411 Tue, 27 May 2014 10:52:05 +0000 Vasudev 105411@http://forums.leaflabs.com/ Thanks Manitou, Yes but those functions were for embedded systems. I was looking for some simple dma_memcpy implementation for x86 system from userspace, i searched a lot about this but because for lake of device driver experience i was enable to do dma memcpy. If you have done dma memcpy for x86 systems, that will be very very helpful. Thanks manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849&page=2#post-105401 Fri, 23 May 2014 16:25:55 +0000 manitou 105401@http://forums.leaflabs.com/ The first entry in this thread has the functions I used for DMA memcpy()... Performance comparisons with teensy and Due are here <a href="https://github.com/manitou48/DUEZoo/blob/master/mem2mem.txt" rel="nofollow">https://github.com/manitou48/DUEZoo/blob/master/mem2mem.txt</a> Vasudev on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849&page=2#post-105398 Thu, 22 May 2014 05:04:58 +0000 Vasudev 105398@http://forums.leaflabs.com/ Hi, Can you tell me how to implement dma_memcpy? Many Thanks in adv. gbulmer on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849&page=2#post-11319 Tue, 19 Jun 2012 07:57:36 +0000 gbulmer 11319@http://forums.leaflabs.com/ manitou - this is a fascinating set of experiments you've come up with. Thank you for sharing. <blockquote> baseline: 1024 32-bit word DMA memcpy: 96us (microsceconds) lib memcpy: 58 us (total 154us) dueling memcpy: gather micros() timestamp in DMA isr. Start DMA memcpy followed by lib memcpy ... DMA memcpy: 119us lib memcpy: 64us </blockquote> So lib memcpy is about 6us, about 10% slower with DMA running, and DMA memcpy is about 25us slower, about 25%, with lib memcpy running for 64us of its run time. Have you tried sruuning two lib memcpy's, one straight after the other, so that DMA memcpy overlaps with lib memcpy for all/most of its run time? Is DMA memcpy and lib memcpy reading or writing to the same memory? Is it feasible to run two DMA memcpy's on different DMA controllers concurrently? manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849&page=2#post-11202 Mon, 11 Jun 2012 07:29:16 +0000 manitou 11202@http://forums.leaflabs.com/ > When do you start counting for the DMA memcpy? Starting the clock right before dma_enable knocks off a few microseconds. blackswords on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11200 Mon, 11 Jun 2012 02:27:40 +0000 blackswords 11200@http://forums.leaflabs.com/ manitou > When do you start counting for the DMA memcpy? If it's before memcpy32() you lose some time to initialize the DMA so you don't only measure the copy time but also the initialization time. It would be better to use a DMA init function then a start function with just this inside of it dma_enable(DMAn, DMA_CHn); //enable it.. while(!DMADONE); dma_disable(DMAn, DMA_CHn); manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11197 Sun, 10 Jun 2012 16:44:22 +0000 manitou 11197@http://forums.leaflabs.com/ update >DMA memcpy: 119us lib memcpy: 64us (total 183us) Actually, the elapsed time for the concurrent copies was 119us (551 megabits/sec) which is faster than the serial baseline time of 154us. Though two serial lib mcmcpy's would take 116us. manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11195 Sun, 10 Jun 2012 10:29:24 +0000 manitou 11195@http://forums.leaflabs.com/ > running the DMA memcpy at the same time as ordinary memcpy baseline: 1024 32-bit word DMA memcpy: 96us (microsceconds) lib memcpy: 58 us (total 154us) dueling memcpy: gather micros() timestamp in DMA isr. Start DMA memcpy followed by lib memcpy ... DMA memcpy: 119us lib memcpy: 64us (total 183us) Though DMA memcpy started first, it completed after the lib memcpy. I used 4 1024 32-bit word vectors. gbulmer on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11186 Sat, 09 Jun 2012 16:27:24 +0000 gbulmer 11186@http://forums.leaflabs.com/ manitou - interesting. Have you tried running the DMA memcpy at the same time as ordinary memcpy? Do they slow each other down? manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11184 Sat, 09 Jun 2012 14:12:57 +0000 manitou 11184@http://forums.leaflabs.com/ I took a look at disassembled code for memcpy() for IDE 0.12 which uses newlib 1.17, i think. If it can, it uses an unrolled loop of ldr.w/str.w, 16 of these word ld/st, to move 64 bytes per loop iteration. That memcpy() takes about 59 microseconds to move 1024 32-bit words. I then built a version using the ARM memcpy.S from newlib 1.20 and that took 53us, where it uses an unrolled loop of ldrd/strd to move 64 bytes per loop iteration. so newlib in IDE 0.12 is doing just fine. mbolivar on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11101 Wed, 06 Jun 2012 03:40:47 +0000 mbolivar 11101@http://forums.leaflabs.com/ <blockquote> dma.c doesn't have a function for querying DMA_CNDTRx, so i hacked volatile uint32_t *dmacnt = (uint32_t *) 0x4002000C; ... while(*dmacnt); </blockquote> that's what register maps are for. use <code>DMAx_BASE->CNDTRy</code>, e.g. <code>DMA1_BASE->CNDTR1</code>. gbulmer on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11084 Tue, 05 Jun 2012 13:29:47 +0000 gbulmer 11084@http://forums.leaflabs.com/ 'my little "b" in mbs is megabits/second' - Mb is mega-bits, MB is mega-bytes. 'mbs'? I guessed wrong :-) 'No change' - that is surprising. 348 megabits/sec = 43.5MBytes/sec seems a bit of a weird number, and I'm surprised it isn't faster; assuming 4 byte transfers, that is under 11M reads, and 11M writes. I wonder if the processor and DMA are contending for the bus matrix? The bus-architecture diagrams show the two DMA controllers as having separate connections to the bus matrix. Have you tried setting up both DMA controllers, and using them simultaneously? The other thought is to 'wait for event' to get the processor off the bus, but I've never tried that. manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11083 Tue, 05 Jun 2012 13:13:07 +0000 manitou 11083@http://forums.leaflabs.com/ Re: newlib > There doesn't seem to be a specialised version of memcpy or memset for arm ... the newlib-1.20 at redhat does have arm assembler versions (memcpy.S), which uses LDRD/STRD assuming suitable sizes/alignment. <pre><code>1: .irp offset, #0, #8, #16, #24, #32, #40, #48, #56 ldrd r4, r5, [r1, \offset] strd r4, r5, [r0, \offset] .endr add r0, r0, #64 add r1, r1, #64 subs r2, r2, #64 bge 1b</code></pre> that loop runs close to 940 megabits/sec at 72MHz, if I counted cycles correctly... manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11079 Tue, 05 Jun 2012 08:09:48 +0000 manitou 11079@http://forums.leaflabs.com/ >while(!DMADONE); might have consumed significant memory bandwidth, and caused DMA to be slower dma.c doesn't have a function for querying DMA_CNDTRx, so i hacked volatile uint32_t *dmacnt = (uint32_t *) 0x4002000C; ... while(*dmacnt); No change, DMA's of 1000 4-byte words still 93us (348 megabits/sec) manitou on "DMA memcpy mem2mem" http://forums.leaflabs.com/topic.php?id=1849#post-11078 Tue, 05 Jun 2012 05:59:56 +0000 manitou 11078@http://forums.leaflabs.com/ >Assuming 72MHz, and 4 bytes per clock cycle, would give a maximum of 288MBytes/second of bandwidth. my little "b" in mbs is megabits/second, so 288MBs = 2304mbs >while(!DMADONE); might have consumed significant memory bandwidth, and caused DMA to be slower than it should be. good thought. i'll check it out > Maple 'standard C library' is newlib Linux kernel's often have good hard-coded memcpy/memset. i think fastest use LDM/STM. Here is a discussion of memcpy on Cortex A8 <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html" rel="nofollow">http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html</a> thanks for insights