|
| memset in a loop ..or not? Fra : Jake |
Dato : 15-09-10 15:08 |
|
I have a workbuffer with values that needs to be re-arranged,
so...initially...I did it like this:
for (i = (N - 1); i >= 0; i--)
{
workbuffer[Q * i] = workbuffer[i];
memset(&workbuffer[(Q * i) + 1], 0, (Q-1) * sizeof(int16));
}
but I was told not to use memset. I don't know exactly why I am not allowed
to use memset in a loop.
I guess it's not efficient enough? So I changed the code to this:
for (i = (N - 1); i >= 0; i--)
{
workbuffer[Q * i] = workbuffer[i];
for (j = 0; j < (Q - 1); j++)
{
workbuffer[(Q * i) + 1 + j] = 0;
}
}
Let's say we have a 16 cell workbuffer B.
Four values have been stored in the first 4 cells in the workbuffer: B[0],
B[1], B[2] and B[3] the remaining B[k] for k=4 to k=15 are undefined. The
code must re-arrange the 4 values so the workbuffer looks like this:
B[0],0,0,0,B[1],0,0,0,B[2],0,0,0,B[3],0,0,0
The code is for an interpolator and in the above example N is the number of
samples in the workbuffer before re-arrangement. So N would be 4! And Q
would be an interpolation factor equal to 4.
Any suggestions for improvement?
Comments about not using memset in a loop are also welcomed.
Thank you.
| |
Bertel Brander (15-09-2010)
| Kommentar Fra : Bertel Brander |
Dato : 15-09-10 19:00 |
|
Den 15-09-2010 16:07, Jake skrev:
> I have a workbuffer with values that needs to be re-arranged,
> so...initially...I did it like this:
>
> for (i = (N - 1); i >= 0; i--)
> {
> workbuffer[Q * i] = workbuffer[i];
> memset(&workbuffer[(Q * i) + 1], 0, (Q-1) * sizeof(int16));
> }
>
> but I was told not to use memset. I don't know exactly why I am not allowed
> to use memset in a loop.
The compiler has every chance to make memset at least as efficient
as anything else you can do, so if you need to set some memory
to something, go ahead and use memset.
It is in general not a good idea to loop backwards from N to 0.
> I guess it's not efficient enough? So I changed the code to this:
>
> for (i = (N - 1); i >= 0; i--)
> {
> workbuffer[Q * i] = workbuffer[i];
> for (j = 0; j < (Q - 1); j++)
> {
> workbuffer[(Q * i) + 1 + j] = 0;
> }
> }
>
> Let's say we have a 16 cell workbuffer B.
>
> Four values have been stored in the first 4 cells in the workbuffer: B[0],
> B[1], B[2] and B[3] the remaining B[k] for k=4 to k=15 are undefined. The
> code must re-arrange the 4 values so the workbuffer looks like this:
>
> B[0],0,0,0,B[1],0,0,0,B[2],0,0,0,B[3],0,0,0
>
> The code is for an interpolator and in the above example N is the number of
> samples in the workbuffer before re-arrangement. So N would be 4! And Q
> would be an interpolation factor equal to 4.
>
> Any suggestions for improvement?
For small blocks of memory, it can be a good idea to
"unroll" loops, so if Q in your case is small, it might
be better to:
for(i = 0; i < N; ++i)
{
workbuffer[Q * i] = workbuffer[i];
workbuffer[(Q * i) + 1 + 0] = 0;
workbuffer[(Q * i) + 1 + 1] = 0;
workbuffer[(Q * i) + 1 + 2] = 0;
workbuffer[(Q * i) + 1 + 3] = 0;
}
But as for any optimization, first check if you need
to do the optimization and then measure what is the
most efficient solution.
| |
Arne Vajhøj (16-09-2010)
| Kommentar Fra : Arne Vajhøj |
Dato : 16-09-10 02:31 |
|
On 15-09-2010 13:59, Bertel Brander wrote:
> Den 15-09-2010 16:07, Jake skrev:
>> I guess it's not efficient enough? So I changed the code to this:
>>
>> for (i = (N - 1); i >= 0; i--)
>> {
>> workbuffer[Q * i] = workbuffer[i];
>> for (j = 0; j < (Q - 1); j++)
>> {
>> workbuffer[(Q * i) + 1 + j] = 0;
>> }
>> }
>>
>> Let's say we have a 16 cell workbuffer B.
>>
>> Four values have been stored in the first 4 cells in the workbuffer:
>> B[0],
>> B[1], B[2] and B[3] the remaining B[k] for k=4 to k=15 are undefined. The
>> code must re-arrange the 4 values so the workbuffer looks like this:
>>
>> B[0],0,0,0,B[1],0,0,0,B[2],0,0,0,B[3],0,0,0
>>
>> The code is for an interpolator and in the above example N is the
>> number of
>> samples in the workbuffer before re-arrangement. So N would be 4! And Q
>> would be an interpolation factor equal to 4.
>>
>> Any suggestions for improvement?
>
> For small blocks of memory, it can be a good idea to
> "unroll" loops, so if Q in your case is small, it might
> be better to:
>
> for(i = 0; i < N; ++i)
> {
> workbuffer[Q * i] = workbuffer[i];
> workbuffer[(Q * i) + 1 + 0] = 0;
> workbuffer[(Q * i) + 1 + 1] = 0;
> workbuffer[(Q * i) + 1 + 2] = 0;
> workbuffer[(Q * i) + 1 + 3] = 0;
> }
I consider manual loop unrolling as a thing of the
past (late 80's early 90's).
Today I would expect the compiler to do that type
of optimizations.
(possible controlled by a compiler directive)
Arne
| |
Arne Vajhøj (16-09-2010)
| Kommentar Fra : Arne Vajhøj |
Dato : 16-09-10 02:29 |
|
On 15-09-2010 10:07, Jake wrote:
> I have a workbuffer with values that needs to be re-arranged,
> so...initially...I did it like this:
>
> for (i = (N - 1); i >= 0; i--)
> {
> workbuffer[Q * i] = workbuffer[i];
> memset(&workbuffer[(Q * i) + 1], 0, (Q-1) * sizeof(int16));
> }
>
> but I was told not to use memset. I don't know exactly why I am not allowed
> to use memset in a loop.
I think you should ask why.
> I guess it's not efficient enough? So I changed the code to this:
>
> for (i = (N - 1); i >= 0; i--)
> {
> workbuffer[Q * i] = workbuffer[i];
> for (j = 0; j < (Q - 1); j++)
> {
> workbuffer[(Q * i) + 1 + j] = 0;
> }
> }
>
> Let's say we have a 16 cell workbuffer B.
>
> Four values have been stored in the first 4 cells in the workbuffer: B[0],
> B[1], B[2] and B[3] the remaining B[k] for k=4 to k=15 are undefined. The
> code must re-arrange the 4 values so the workbuffer looks like this:
>
> B[0],0,0,0,B[1],0,0,0,B[2],0,0,0,B[3],0,0,0
>
> The code is for an interpolator and in the above example N is the number of
> samples in the workbuffer before re-arrangement. So N would be 4! And Q
> would be an interpolation factor equal to 4.
>
> Any suggestions for improvement?
I am skeptical about this being faster than memset.
I think it is safe to assume that the memset code has been
optimized - it can not be less optimized than your loop.
memset may use a special instruction for the specific CPU
architecture instead of a loop.
The only drawback of memset I can think of is function call
overhead. But then many compilers allow inlining of that call.
Arne
| |
|
|