Hi all, back again with some optimisations on MPlayer. I've promised you this some time ago with the IDCT article, so better get started.
YUV2RGB colorspace conversion is needed for displaying a TV/video stream via a GFX framebuffer. TV streams are usually encoded as Y (luminance) Cr(U/Chrominance redtobluegreen) Cb(V/Chrominance bluetoyellow). Because the human eye can discern contrast much better than color, the TV/video data is usually compressed into an Y value for each pixel and Cr and Cb values for every two*two pixelblock. This pixelformat is called YUV422 or YCrCb422 and reduces the datastream while maintaining a good balance in colors and brightness.
http://www.fourcc.org/ has some nice info on pixel formats
Now the only SGI machine who can display YUV422 as native format is, surprise surprise, the O2. Vegac has made a nifty new video output plugin for mplayer, called vo_crm.c which uses YUV422 as output for the O2.
For mere mortals not having an O2, like me, i still must rely on a software colorspace converter, because we are still busy with the hardware colorspace conversion testing with functions like SGIS_color_matrix or SGIX_pixel_texture (i posted about this in the graphics forum)
So how expensive is the colorspace conversion when running an AVI or MPEG?
here comes ssrun again: (machine is my I2 HI+TRAM 6.5.22m with MIPSpro 7.4.2)
ssrun -exp fpcsampx ./mplayer -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.avi
and
prof -lines mplayer.fpcsampx.m<number>
Code:
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Aug 16 00:31:44 2004
prof -lines mplayer.fpcsampx.m16544
mplayer (n32): Target program
fpcsampx: Experiment name
pc,4,1000,0:cu: Marching orders
R10000 / R10010: CPU / FPU
1: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes--
From file mplayer.fpcsampx.m16544:
Caliper point 0 at target begin, PID 16544
/usr/local/src/MPlayer-1.0pre5/mplayer -vo gl2 -vf format=rgb24 -nosound /usr/people/frank/movies/courtyard.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsampx)--
69696: Total samples
69.696: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
[index] secs % cum.% samples function (dso: file, line)
[1] 22.542 32.3% 32.3% 22542 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
[2] 7.901 11.3% 43.7% 7901 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 7.273 10.4% 54.1% 7273 __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
[4] 6.737 9.7% 63.8% 6737 idctSparseColAdd (mplayer: simple_idct.c, 209)
[5] 5.066 7.3% 71.0% 5066 __ioctl (libc.so.1: stat.c, 32; compiled in ioctl.s)
[6] 2.532 3.6% 74.7% 2532 msmpeg4_decode_block (mplayer: msmpeg4.c, 1661)
[7] 1.920 2.8% 77.4% 1920 idctRowCondDC (mplayer: simple_idct.c, 104)
[8] 1.647 2.4% 79.8% 1647 put_pixels16_xy2_c (mplayer: dsputil.c, 891)
[9] 1.557 2.2% 82.0% 1557 simple_idct_put (mplayer: simple_idct.c, 313)
Whoa, 1/3rd of it's time is spent in that routine!
So lets get the code for libvo/yuv2rgb.c:
Code:
PROLOG(yuv2rgb_c_24_rgb, uint8_t)
RGB(0);
DST1RGB(0);
DST2RGB(0);
RGB(1);
DST2RGB(1);
DST1RGB(1);
RGB(2);
DST1RGB(2);
DST2RGB(2);
RGB(3);
DST2RGB(3);
DST1RGB(3);
EPILOG(24)
Gaack, macro's. Let's write them out, so we can distinguish the flow clearer:
Code:
static int yuv2rgb_c_24_rgb(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY,
int srcSliceH, uint8_t* dst[], int dstStride[]){
int y;
if(c->srcFormat == IMGFMT_422P){
srcStride[1] *= 2;
srcStride[2] *= 2;
}
for(y=0; y<srcSliceH; y+=2){
uint8_t *dst_1= (uint8_t*)(dst[0] + (y+srcSliceY )*dstStride[0]);
uint8_t *dst_2= (uint8_t*)(dst[0] + (y+srcSliceY+1)*dstStride[0]);
uint8_t *r, *g, *b;
uint8_t *py_1= src[0] + y*srcStride[0];
uint8_t *py_2= py_1 + srcStride[0];
uint8_t *pu= src[1] + (y>>1)*srcStride[1];
uint8_t *pv= src[2] + (y>>1)*srcStride[2];
unsigned int h_size= c->dstW>>3;
while (h_size--) {
int U, V, Y;
U = pu[0];
V = pv[0];
r = (char*)c->table_rV[V];
g = (char*)c->table_gU[U] + c->table_gV[V];
b = (char*)c->table_bU[U];
Y = py_1[1];
dst_1[0] = r[Y]; dst_1[1] = g[Y]; dst_1[2] = b[Y];
Y = py_1[2];
dst_1[3] = r[Y]; dst_1[4] = g[Y]; dst_1[5] = b[Y];
Y = py_2[0];
dst_2[0] = r[Y]; dst_2[1] = g[Y]; dst_2[2] = b[Y];
Y = py_2[1];
dst_2[3] = r[Y]; dst_2[4] = g[Y]; dst_2[5] = b[Y];
U = pu[1];
V = pv[1];
r = (char*)c->table_rV[V];
g = (char*)c->table_gU[U] + c->table_gV[V];
b = (char*)c->table_bU[U];
DST2RGB(1); <== didn't write them out, lazy bum me
DST1RGB(1); <== didn't write them out, lazy bum me
U = pu[2];
V = pv[2];
r = (char*)c->table_rV[V];
g = (char*)c->table_gU[U] + c->table_gV[V];
b = (char*)c->table_bU[U];
DST1RGB(2); <== still not written out
DST2RGB(2);
U = pu[3];
V = pv[3];
r = (char*)c->table_rV[V];
g = (char*)c->table_gU[U] + c->table_gV[V];
b = (char*)c->table_bU[U];
DST2RGB(3); <== Are you still reading this?
DST1RGB(3); <== Man, eat some hot dung strudel. (BloodhoundGang)
pu += 4;
pv += 4;
py_1 += 8;
py_2 += 8;
dst_1 += 24;
dst_2 += 24;
}
}
return srcSliceH;
}
Fabulous, lots of array pointers, load/stores from memory, some redundant recalculations within a loop, not exactly a lost cause, but needs some care. The pointers dst_1 and dst_2 are two adjacent scanline RGB pixels.
The first and most obvious idea Vegac and i had, was combining the dst_1 and dst_2 stores to 16, 32 or even 64 bit. With all the array data prepared for the store i had to define two additional variables as shift containers. So you get something like this, for pu[0] and pv[0]:
Code:
uint32_t *dst_1= (uint32_t*)(dst[0] + (y+srcSliceY )*dstStride[0]);
uint32_t *dst_2= (uint32_t*)(dst[0] + (y+srcSliceY+1)*dstStride[0]);
uint32_t acc1,acc2;
.
.
.
U = pu[0];
V = pv[0];
r = (uint8_t*)c->table_rV[V];
g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
b = (uint8_t*)c->table_bU[U];
Y1 = py_1[0];
Y2 = py_1[1];
acc1 = r[Y1];
acc1 = acc1 <<8;
acc1 += g[Y1];
acc1 = acc1 <<8;
acc1 += b[Y1];
acc1 = acc1 <<8;
acc1 += r[Y2];
dst_1[0]=acc1;
acc1 = g[Y2];
acc1 = acc1 <<8;
acc1 += b[Y2];
acc1 = acc1 <<8;
Y1 = py_2[0];
Y2 = py_2[1];
acc2 = r[Y1];
acc2 = acc2 <<8;
acc2 += g[Y1];
acc2 = acc2 <<8;
acc2 += b[Y1];
acc2 = acc2 <<8;
acc2 += r[Y2];
dst_2[0]=acc2;
acc2 = g[Y2];
acc2 = acc2 <<8;
acc2 += b[Y2];
acc2 = acc2 <<8;
.
.
.
pu += 4;
pv += 4;
py_1 += 8;
py_2 += 8;
dst_1 += 6;
dst_2 += 6;
Going from an 8 bit store to a 32 bit store saves you a factor of 4 in writing to the cache. Look also at the acc1 and acc2 bitshift and stores. Because the dst_1 and dst_2 are now 32 bit, and we need to process 8 pixels per line in this inner loop (24 dst byte stores per line/ 3 bytes per pixel) we store in dst_1[0] and dst_2[0] RGBR, in dst_1[1] and dst_2[1] GBRG, in dst_1[2] and dst_2[2] BRGB, andsoforth, which is effectively storing 4 pixels in three 32-bit stores per line. BigEndian byte ordering ofcouse :) Looking at the speedshop proves that we are on the right track:
Code:
66347: Total samples
.
[1] 19.651 29.6% 29.6% 19651 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
[2] 8.014 12.1% 41.7% 8014 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 7.272 11.0% 52.7% 7272 __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
That's 2.7% better, which translates to 3 second speedup. Now what happens as we, well Vegac, move the pointer definition stuff out of the outer loop? Speedshop:
Code:
66563: Total samples
.
[1] 19.161 28.8% 28.8% 19161 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
[2] 8.003 12.0% 40.8% 8003 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 7.325 11.0% 51.8% 7325 __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
Very nice. Vegac has made some good progress with this. Can we improve on this by using 64bit pointers for dst_1 and dst_2? Like this?:
Code:
static int yuv2rgb_c_24_rgb(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY,
int srcSliceH, uint8_t* dst[], int dstStride[]){
int y;
int U, V, Y1, Y2;
uint64_t acc1,acc2;
uint8_t *r, *g, *b;
if(c->srcFormat == IMGFMT_422P){
srcStride[1] *= 2;
srcStride[2] *= 2;
}
int puoff = srcStride[1]-(c->dstW>>1);
int pvoff = srcStride[2]-(c->dstW>>1);
int pyoff = (srcStride[0]<<1)-c->dstW;
int dstoff = (dstStride[0]<<0)/sizeof(uint64_t*)-3*(c->dstW>>3);
uint8_t *pu= src[1];
uint8_t *pv= src[2];
uint8_t *py_1= src[0];
uint8_t *py_2= py_1 + srcStride[0];
uint64_t *dst_1= (uint64_t*)(dst[0] + (srcSliceY*dstStride[0]));
uint64_t *dst_2= dst_1 + (dstStride[0]>>3);
unsigned int slice_size= srcSliceH>>1;
unsigned int h_size= c->dstW>>3;
while (slice_size--){
for(y=0;y<h_size;y++){
U = pu[0];
V = pv[0];
r = (uint8_t*)c->table_rV[V];
g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
b = (uint8_t*)c->table_bU[U];
Y1 = py_1[0];
Y2 = py_1[1];
acc1 = r[Y1]<<24;
acc1 += g[Y1]<<16;
acc1 += b[Y1]<<8;
acc1 += r[Y2];
acc1<<=32;
acc1 += g[Y2]<<24;
acc1 += b[Y2]<<16;
Y1 = py_2[0];
Y2 = py_2[1];
acc2 = r[Y1]<<24;
acc2 += g[Y1]<<16;
acc2 += b[Y1]<<8;
acc2 += r[Y2];
acc2<<=32;
acc2 += g[Y2]<<24;
acc2 += b[Y2]<<16;
U = pu[1];
V = pv[1];
r = (uint8_t*)c->table_rV[V];
g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
b = (uint8_t*)c->table_bU[U];
Y1 = py_1[2];
Y2 = py_1[3];
acc1 += r[Y1]<<8;
acc1 += g[Y1];
dst_1[0]=acc1;
acc1 = b[Y1]<<24;
acc1 += r[Y2]<<16;
acc1 += g[Y2]<<8;
acc1 += b[Y2];
acc1<<=32;
Y1 = py_2[2];
Y2 = py_2[3];
acc2 += r[Y1]<<8;
acc2 += g[Y1];
dst_2[0]=acc2;
acc2 = b[Y1]<<24;
acc2 += r[Y2]<<16;
acc2 += g[Y2]<<8;
acc2 += b[Y2];
acc2<<=32;
U = pu[2];
V = pv[2];
r = (uint8_t*)c->table_rV[V];
g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
b = (uint8_t*)c->table_bU[U];
Y1 = py_1[4];
Y2 = py_1[5];
acc1 += r[Y1]<<24;
acc1 += g[Y1]<<16;
acc1 += b[Y1]<<8;
acc1 += r[Y2];
dst_1[1]=acc1;
acc1 = g[Y2]<<24;
acc1 += b[Y2]<<16;
Y1 = py_2[4];
Y2 = py_2[5];
acc2 += r[Y1]<<24;
acc2 += g[Y1]<<16;
acc2 += b[Y1]<<8;
acc2 += r[Y2];
dst_2[1]=acc2;
acc2 = g[Y2]<<24;
acc2 += b[Y2]<<16;
U = pu[3];
V = pv[3];
r = (uint8_t*)c->table_rV[V];
g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
b = (uint8_t*)c->table_bU[U];
Y1 = py_1[6];
Y2 = py_1[7];
acc1 += r[Y1]<<8;
acc1 += g[Y1];
acc1<<=32;
acc1 += b[Y1]<<24;
acc1 += r[Y2]<<16;
acc1 += g[Y2]<<8;
acc1 += b[Y2];
dst_1[2]=acc1;
Y1 = py_2[6];
Y2 = py_2[7];
acc2 += r[Y1]<<8;
acc2 += g[Y1];
acc2<<=32;
acc2 += b[Y1]<<24;
acc2 += r[Y2]<<16;
acc2 += g[Y2]<<8;
acc2 += b[Y2];
dst_2[2]=acc2;
pu += 4;
pv += 4;
py_1 += 8;
py_2 += 8;
dst_1 += 3;
dst_2 += 3;
}
pu += puoff;
pv += pvoff;
py_1 += pyoff;
py_2 += pyoff;
dst_1 += dstoff;
dst_2 += dstoff;
}
return srcSliceH;
}
And speedshop tells us:
Code:
66722: Total samples
.
[1] 19.119 28.7% 28.7% 19119 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
[2] 7.999 12.0% 40.6% 7999 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 7.328 11.0% 51.6% 7328 __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
Damn! only a 0.1% decrease.
BTW, notice i inserted a
acc1<<=32; Some folks might comment "Why not use
acc1 += b[Y1]<<56; and some more of those larger bitshifts and get rid of that 32 bit shift of acc1?" Well you're smarter than mipspro, because it doesn't reliably shift beyond 32 bits. Maybe it's pointer wizardry, maybe it's because i compiled with -n32 and not with -64... ( MPlayer coredumps compiled with -64 ;) )
So are there any other things we need to know? Well maybe some cache stuff could be important. The lookup tables are large and the cache can't store the destination buffers all in one, so ther must be some cache clash going on. Going back to the old code and look at the line info of the prof output
Code:
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
secs % cum.% samples function (dso: file, line)
0.056 0.1 0.1 56 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313) PROLOG(yuv2rgb_c_24_rgb, uint8_t)
0.927 1.3 1.4 927 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 314) RGB(0);
2.444 3.5 4.9 2444 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315) DST1RGB(0);
3.527 5.1 10.0 3527 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 316) DST2RGB(0);
0.510 0.7 10.7 510 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 318) RGB(1);
2.218 3.2 13.9 2218 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319) DST2RGB(1);
3.405 4.9 18.8 3405 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 320) DST1RGB(1);
0.430 0.6 19.4 430 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 322) RGB(2);
2.244 3.2 22.6 2244 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 323) DST1RGB(2);
2.836 4.1 26.7 2836 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 324) DST2RGB(2);
0.387 0.6 27.2 387 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 326) RGB(3);
1.781 2.6 29.8 1781 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 327) DST2RGB(3);
1.777 2.5 32.3 1777 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 328) DST1RGB(3);
Lines 315,316,319,320,323,324 and 327 and 328 correspond to the DST1RGB() and DST2RGB() macro's, so it looks like the stores are causing large delays. Also in the 64 bit version these take the longest time:
Code:
0.012 0.0 0.0 12 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
0.005 0.0 0.0 5 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 325)
0.002 0.0 0.0 2 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 326)
0.001 0.0 0.0 1 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 331)
0.014 0.0 0.1 14 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 338)
1.854 2.8 2.8 1854 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 340) for(y=0;y<h_size;y++){
0.144 0.2 3.0 144 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 342)
0.199 0.3 3.3 199 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 343)
0.060 0.1 3.4 60 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 344)
0.280 0.4 3.9 280 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 345)
0.239 0.4 4.2 239 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 346)
0.029 0.0 4.3 29 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 347)
0.058 0.1 4.3 58 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 348)
0.051 0.1 4.4 51 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 356) Y1 = py_2[0];
0.655 1.0 5.4 655 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 357) Y2 = py_2[1];
0.733 1.1 6.5 733 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 364) acc2 += b[Y2]<<16;
0.050 0.1 6.6 50 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 366) U = pu[1];
0.014 0.0 6.6 14 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 367) V = pv[1];
0.044 0.1 6.7 44 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 368) r = (uint8_t*)c->table_rV[V];
0.093 0.1 6.8 93 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 369) g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
0.120 0.2 7.0 120 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 370) b = (uint8_t*)c->table_bU[U];
0.133 0.2 7.2 133 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 371) Y1 = py_1[2];
0.065 0.1 7.3 65 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 372) Y2 = py_1[3];
1.941 2.9 10.2 1941 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 375) dst_1[0]=acc1;
0.366 0.5 10.7 366 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 380) acc1<<=32;
0.107 0.2 10.9 107 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 381) Y1 = py_2[2];
0.226 0.3 11.2 226 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 382) Y2 = py_2[3];
0.221 0.3 11.6 221 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 383)
0.215 0.3 11.9 215 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 384)
1.505 2.3 14.1 1505 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 385) dst_2[0]=acc2;
0.424 0.6 14.8 424 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 390)
0.043 0.1 14.8 43 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 392)
0.232 0.3 15.2 232 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 393)
.
Andsoforth. Look at some of the codelines. Definately the dst_1 and dst_2 stores are costly. To understand a bit more we have to look at perfex and a different speedshop. First the original code:
perfex -a -x -y ./mplayer -really-quiet -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.aviCode:
Based on 195 MHz IP28
MIPS R10000 CPU
CPU revision 2.x
Typical Minimum Maximum
Event Counter Name Counter Value Time (sec) Time (sec) Time (sec)
===================================================================================================================
16 Cycles...................................................... 11919539456 61.125843 61.125843 61.125843
0 Cycles...................................................... 11919465120 61.125462 61.125462 61.125462
2 Issued loads................................................ 3523560416 18.069541 18.069541 18.069541
26 Secondary data cache misses................................. 15575232 12.495330 12.495330 13.029680
18 Graduated loads............................................. 2311594736 11.854332 11.854332 11.854332
3 Issued stores............................................... 1779294832 9.124589 9.124589 9.124589
19 Graduated stores............................................ 1770233744 9.078122 9.078122 9.078122
7 Quadwords written back from scache.......................... 61243216 5.712893 5.226088 5.712893
25 Primary data cache misses................................... 93600272 4.224012 1.401604 4.224012
21 Graduated floating point instructions....................... 646951488 3.317700 1.658850 172.520397
6 Decoded branches............................................ 465257680 2.385937 2.385937 2.385937
22 Quadwords written back from primary data cache.............. 102250624 2.065987 1.578330 2.301950
10 Secondary instruction cache misses.......................... 1280336 1.027158 1.027158 1.071083
9 Primary instruction cache misses............................ 8263472 0.745408 0.247480 0.745408
24 Mispredicted branches....................................... 29519200 0.214960 0.083259 0.806858
23 TLB misses.................................................. 454256 0.111724 0.111724 0.111724
30 Store/prefetch exclusive to clean block in scache........... 144576 0.000741 0.000741 0.000741
4 Issued store conditionals................................... 53312 0.000273 0.000273 0.000273
20 Graduated store conditionals................................ 12304 0.000063 0.000063 0.000063
5 Failed store conditionals................................... 16 0.000000 0.000000 0.000000
1 Issued instructions......................................... 11858107248 0.000000 0.000000 60.810806
8 Correctable scache data array ECC errors.................... 0 0.000000 0.000000 0.000000
11 Instruction misprediction from scache way prediction table.. 1267216 0.000000 0.000000 0.006499
12 External interventions...................................... 0 0.000000 0.000000 0.000000
13 External invalidations...................................... 0 0.000000 0.000000 0.000000
14 Virtual coherency conditions................................ 0 0.000000 0.000000 0.000000
15 Graduated instructions...................................... 11144438752 0.000000 0.000000 57.150968
17 Graduated instructions...................................... 11285333280 0.000000 0.000000 57.873504
27 Data misprediction from scache way prediction table......... 2610640 0.000000 0.000000 0.013388
28 External intervention hits in scache........................ 0 0.000000 0.000000 0.000000
29 External invalidation hits in scache........................ 0 0.000000 0.000000 0.000000
31 Store/prefetch exclusive to shared block in scache.......... 0 0.000000 0.000000 0.000000
Statistics
=========================================================================================
Graduated instructions/cycle................................................ 0.934972
Graduated floating point instructions/cycle................................. 0.054277
Graduated loads & stores/cycle.............................................. 0.342449
Graduated loads & stores/floating point instruction......................... 6.309327
Mispredicted branches/Decoded branches...................................... 0.063447
Graduated loads/Issued loads................................................ 0.656039
Graduated stores/Issued stores.............................................. 0.994907
Data mispredict/Data scache hits............................................ 0.033459
Instruction mispredict/Instruction scache hits.............................. 0.181468
L1 Cache Line Reuse......................................................... 42.609152
L2 Cache Line Reuse......................................................... 5.009559
L1 Data Cache Hit Rate...................................................... 0.977069
L2 Data Cache Hit Rate...................................................... 0.833598
Time accessing memory/Total time............................................ 0.617800
L1--L2 bandwidth used (MB/s, average per process)........................... 75.765314
Memory bandwidth used (MB/s, average per process)........................... 48.645892
MFLOPS (average per process)................................................ 10.583927
Hm, 12 seconds lost due to secondary cache misses. And next the ssrun run which counts these cache misses:
ssrun -exp fdsc_hwc ./mplayer -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.aviCode:
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Aug 16 03:44:27 2004
prof -lines mplayer.fdsc_hwc.m16892
mplayer (n32): Target program
fdsc_hwc: Experiment name
hwc,26,29:cu: Marching orders
R10000 / R10010: CPU / FPU
1: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes--
From file mplayer.fdsc_hwc.m16892:
Caliper point 0 at target begin, PID 16892
/usr/local/src/MPlayer-1.0pre5/mplayer -vo gl2 -vf format=rgb24 -nosound /usr/people/frank/movies/courtyard.avi
Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of perf. counter overflow PC sampling data (fdsc_hwc)--
530523: Total samples
Secondary cache D misses (26): Counter name (number)
29: Counter overflow value
15385167: Total counts
-------------------------------------------------------------------------
Function list, in descending order by counts
-------------------------------------------------------------------------
[index] counts % cum.% samples function (dso: file, line)
[1] 7606004 49.4% 49.4% 262276 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
[2] 4538645 29.5% 78.9% 156505 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 646439 4.2% 83.1% 22291 put_pixels16_xy2_c (mplayer: dsputil.c, 891)
[4] 512169 3.3% 86.5% 17661 put_pixels8_xy2_c (mplayer: dsputil.c, 891)
[5] 182294 1.2% 87.7% 6286 put_pixels16_x2_c (mplayer: dsputil.c, 891)
.
.
.
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
counts % cum.% samples function (dso: file, line)
1566 0.0 0.0 54 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313) PROLOG(yuv2rgb_c_24_rgb, uint8_t)
28507 0.2 0.2 983 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 314) RGB(0);
2118276 13.8 14.0 73044 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315) DST1RGB(0);
478877 3.1 17.1 16513 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 316) DST2RGB(0);
3306 0.0 17.1 114 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 318) RGB(1);
2158731 14.0 31.1 74439 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319) DST2RGB(1);
405797 2.6 33.8 13993 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 320) DST1RGB(1);
1740 0.0 33.8 60 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 322) RGB(2);
2003088 13.0 46.8 69072 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 323) DST1RGB(2);
.
Brrrr, almost 50% of all secondary Data cache misses are from yuv2rgb! That routine indeed has some serious issues with the secondary cache. The dst macro's are really sticking out. And how is it with the 64 bit version?
Code:
15269863: Total counts
.
[1] 7603684 49.8% 49.8% 262196 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
[2] 4496856 29.4% 79.2% 155064 put_pixels8_c (mplayer: dsputil.c, 891)
[3] 648556 4.2% 83.5% 22364 put_pixels16_xy2_c (mplayer: dsputil.c, 891)
[4] 488128 3.2% 86.7% 16832 put_pixels8_xy2_c (mplayer: dsputil.c, 891)
.
.
.
-------------------------------------------------------------------------
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
counts % cum.% samples function (dso: file, line)
87 0.0 0.0 3 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
367140 2.4 2.4 12660 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 340)
116 0.0 2.4 4 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 342)
21663 0.1 2.5 747 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 343)
2001 0.0 2.6 69 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 344)
2668 0.0 2.6 92 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 345)
957 0.0 2.6 33 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 346)
2552 0.0 2.6 88 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 347)
5133 0.0 2.6 177 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 348)
3828 0.0 2.7 132 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 356)
19140 0.1 2.8 660 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 357)
22794 0.1 2.9 786 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 364)
348 0.0 2.9 12 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 366)
58 0.0 2.9 2 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 367)
2001 0.0 3.0 69 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 368)
16211 0.1 3.1 559 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 369)
1334 0.0 3.1 46 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 370)
28855 0.2 3.3 995 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 371)
1392 0.0 3.3 48 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 372)
2201245 14.4 17.7 75905 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 375) dst_1[0]=acc1;
928 0.0 17.7 32 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 380)
145 0.0 17.7 5 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 381)
87 0.0 17.7 3 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 382)
58 0.0 17.7 2 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 383)
87 0.0 17.7 3 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 384)
524262 3.4 21.1 18078 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 385) dst_2[0]=acc2;
145 0.0 21.1 5 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 390)
290 0.0 21.1 10 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 392)
319 0.0 21.1 11 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 393)
754 0.0 21.1 26 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 394)
232 0.0 21.1 8 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 395)
551 0.0 21.1 19 yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 396)
So no improvement cache wise. Still the same cache penalties with dst stores.
Maybe the R10000 auto prefetch is not working well. This could be tested on an R12K machine like an Octane, it should have better prefetching.
Well, now you know. Lookup tables are costly with respect to cache. We still need some fast colorspace conversion, but i can safely say that we need lots of time before we can come up with a hardware accelerated alternative. Vegac's crm plugin is also handling output for Impact and VPRO, so i guess we're converging to a point where we can mail the mplayer guys our findings and get the first implementation work done on the plugins and the faster routines. On a side note, i also managed to accelerate the quicktime IDCT coded, but the results were nowhere as spectacular as the mpeg IDCT. Oh well.
Hope you enjoyed my little post and learned a little bit from it. I will think on the cache stuff some more.
Cheers. (bedtime)