Implement Thumb-2 optimized memcpy/memset #67

jserv · 2014-02-06T20:32:29Z

Directory kernel/lib contains the implementation of memcpy and memset, but it is too generic. We can utilize several ARM Cortex-M3/M4 specific features to optimize:

Thumb-2
- apply 32-bit aligned data copy in inner loop, which is not necessary to Cortex-M3/M4, but it could be better for the external memory access depending on memory controller.
unaligned memory access
PLD instruction to preload cache with memory source

The text was updated successfully, but these errors were encountered:

jserv · 2014-02-06T21:10:13Z

Reference:

jserv · 2014-03-13T20:24:29Z

lk implements arm-m optimized memcpy and memset routines in git commit littlekernel/lk@33b94d9

gapry · 2014-07-29T03:31:48Z

@jserv The profile result:

unalignment
alignment

jserv · 2014-07-29T03:53:50Z

It looks so weird. Can you explain?

gapry · 2014-07-29T04:01:09Z

@jserv The implementation is the branch.
https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

My approach is that measure the case, alignment and unalignment, five times and take the avg time. Assume my approach is correct, the data imply the conclusion is the unalignment case is better than alignment after the optimized on the stm32F407.

jserv · 2014-07-29T04:13:32Z

@gapry In order to clarify the performance gain, please compare the optimized memcpy routines with plain byte-oriented C version.

gapry · 2014-07-29T04:15:01Z

@jserv What does plain byte-oriented mean ?

jserv · 2014-07-29T04:20:00Z

The simplest and inefficient implementation of memcpy:

void memcpy(void* src, void* dst, size_t len)
{
    char* p = (char*)src;
    char* q = (char*)dst;
    while(len--) *p++ = *q++;
}

gapry · 2014-07-29T10:14:02Z

@jserv For now, I use DWT to measure the elapsed clock cycles. You can check the commit: https://github.com/gapry/f9-kernel/commit/33e58dfcb1105140365132269c596763531e9ede

and the completed Implementation: https://github.com/gapry/f9-kernel/blob/benchmark_memcpy/benchmark/benchmark.c

The profile result:
unalignment:

alignment:

jserv · 2014-07-29T12:49:33Z

@gapry I don't think your benchmarking is valid since it doesn't represent the variance. There must be something wrong.

jserv assigned gapry Jul 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Thumb-2 optimized memcpy/memset #67

Implement Thumb-2 optimized memcpy/memset #67

jserv commented Feb 6, 2014

jserv commented Feb 6, 2014

jserv commented Mar 13, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

Implement Thumb-2 optimized memcpy/memset #67

Implement Thumb-2 optimized memcpy/memset #67

Comments

jserv commented Feb 6, 2014

jserv commented Feb 6, 2014

jserv commented Mar 13, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014

gapry commented Jul 29, 2014

jserv commented Jul 29, 2014