Thursday, October 23, 2008

OSLEC code optimization

While working on my first CUDA app, i had a look at OSLEC as i though it would be a good candidate for parallel execution..
OSLEC is an Open Source echo canceler, written by David Row and used by asterisk.
After profiling the code,I found 2 places that i thought could benefit from some optimization. These were 2 loops that OSLEC spends considerable time in.
The first looked like that, and its part of the FIR filter in OSLEC

int i;
int offset1;
int offset2;

fir->history[fir->curr_pos] = sample;

offset2 = fir->curr_pos;
offset1 = fir->taps - offset2;
y = 0;
for (i = fir->taps - 1; i >= offset1; i--)
y += fir->coeffs[i]*fir->history[i - offset1];
for ( ; i >= 0; i--)
y += fir->coeffs[i]*fir->history[i + offset2];



By unrolling it it becomes like that

int i;
int offset1;
int offset2;
int position;

fir->history[fir->curr_pos] = sample;

offset2 = fir->curr_pos;
offset1 = fir->taps - offset2;
y = 0;
// for (i = fir->taps - 1; i >= offset1; i--)
// y += fir->coeffs[i]*fir->history[i - offset1];


i=fir->taps - 1;
while ( i >= offset1 )
{
position=i-offset1;
y += fir->coeffs[i]*fir->history[position];
y += fir->coeffs[i-1]*fir->history[position-1];
y += fir->coeffs[i-2]*fir->history[position-2];
y += fir->coeffs[i-3]*fir->history[position-3];
y += fir->coeffs[i-4]*fir->history[position-4];
y += fir->coeffs[i-5]*fir->history[position-5];
y += fir->coeffs[i-6]*fir->history[position-6];
y += fir->coeffs[i-7]*fir->history[position-7];

y += fir->coeffs[i-8]*fir->history[position-8];
y += fir->coeffs[i-9]*fir->history[position-9];
y += fir->coeffs[i-10]*fir->history[position-10];
y += fir->coeffs[i-11]*fir->history[position-11];
y += fir->coeffs[i-12]*fir->history[position-12];
y += fir->coeffs[i-13]*fir->history[position-13];
y += fir->coeffs[i-14]*fir->history[position-14];
y += fir->coeffs[i-15]*fir->history[position-15];

i=i-16;
}



and guess what,after recompiling there was 10% speed increase in OSLEC execution.
Then i moved to the second part that OSLEC spends a lot of time, the coefficient calculation.
Unrolling this loop gave roughly another 10% speed increase.
Roughly 20% speed increase in total, that's not bad :)

To verify that the increase were not an x86 "fluke" i added the patches to the oslec code in E-IPBX and recompiled the WARP version.
(WARP is a powerpc based PBX appliance from PIKA that we are working on)

I kept the original speedtest program and build 2 more adding each patch.
Bellow are the results first the unmodified speetest,and then with the first patch and the last with both patches.


root@e-ipbx:/$ speedtest

Testing OSLEC with 128 taps (16 ms tail)
CPU executes 527.63 MIPS
-------------------------

Method 1: gettimeofday() at start and end
601 ms for 10s of speech
31.71 MIPS
16.64 instances possible at 100% CPU load
Method 2: samples clock cycles at start and end
31.71 MIPS
16.64 instances possible at 100% CPU load
Method 3: samples clock cycles for each call, IIR average
cycles_worst 140815 cycles_last 3535 cycles_av: 4036
32.29 MIPS
16.34 instances possible at 100% CPU load

root@e-ipbx:/$ ./speedtest2

Testing OSLEC with 128 taps (16 ms tail)
CPU executes 528.91 MIPS
-------------------------

Method 1: gettimeofday() at start and end
497 ms for 10s of speech
26.29 MIPS
20.12 instances possible at 100% CPU load
Method 2: samples clock cycles at start and end
26.29 MIPS
20.12 instances possible at 100% CPU load
Method 3: samples clock cycles for each call, IIR average
cycles_worst 100679 cycles_last 3046 cycles_av: 3933
31.46 MIPS
16.81 instances possible at 100% CPU load
root@e-ipbx:/$ ./speedtest3

Testing OSLEC with 128 taps (16 ms tail)
CPU executes 527.67 MIPS
-------------------------

Method 1: gettimeofday() at start and end
485 ms for 10s of speech
25.59 MIPS
20.62 instances possible at 100% CPU load
Method 2: samples clock cycles at start and end
25.59 MIPS
20.62 instances possible at 100% CPU load
Method 3: samples clock cycles for each call, IIR average
cycles_worst 142386 cycles_last 2824 cycles_av: 2781
22.25 MIPS
23.72 instances possible at 100% CPU load



Not bad for a few lines of code :)

No comments: