Nekochan Net

Official Chat Channel: #nekochan // irc.nekochan.net
It is currently Wed Oct 22, 2014 10:54 pm

All times are UTC - 8 hours [ DST ]


Forum rules


Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.



Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
Unread postPosted: Sun Aug 15, 2004 7:53 pm 
Offline
Moderator
Moderator
User avatar

Joined: Thu Feb 20, 2003 7:57 am
Posts: 2062
Location: Voorburg, The Netherlands
Hi all, back again with some optimisations on MPlayer. I've promised you this some time ago with the IDCT article, so better get started.
YUV2RGB colorspace conversion is needed for displaying a TV/video stream via a GFX framebuffer. TV streams are usually encoded as Y (luminance) Cr(U/Chrominance redtobluegreen) Cb(V/Chrominance bluetoyellow). Because the human eye can discern contrast much better than color, the TV/video data is usually compressed into an Y value for each pixel and Cr and Cb values for every two*two pixelblock. This pixelformat is called YUV422 or YCrCb422 and reduces the datastream while maintaining a good balance in colors and brightness.

http://www.fourcc.org/ has some nice info on pixel formats

Now the only SGI machine who can display YUV422 as native format is, surprise surprise, the O2. Vegac has made a nifty new video output plugin for mplayer, called vo_crm.c which uses YUV422 as output for the O2.

For mere mortals not having an O2, like me, i still must rely on a software colorspace converter, because we are still busy with the hardware colorspace conversion testing with functions like SGIS_color_matrix or SGIX_pixel_texture (i posted about this in the graphics forum)

So how expensive is the colorspace conversion when running an AVI or MPEG?
here comes ssrun again: (machine is my I2 HI+TRAM 6.5.22m with MIPSpro 7.4.2)

ssrun -exp fpcsampx ./mplayer -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.avi
and
prof -lines mplayer.fpcsampx.m<number>
Code:
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Aug 16 00:31:44 2004

   prof -lines mplayer.fpcsampx.m16544

                 mplayer (n32): Target program
                      fpcsampx: Experiment name
                pc,4,1000,0:cu: Marching orders
               R10000 / R10010: CPU / FPU
                             1: Number of CPUs
                           195: Clock frequency (MHz.)
  Experiment notes--
          From file mplayer.fpcsampx.m16544:
        Caliper point 0 at target begin, PID 16544
                        /usr/local/src/MPlayer-1.0pre5/mplayer -vo gl2 -vf format=rgb24 -nosound /usr/people/frank/movies/courtyard.avi
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of statistical PC sampling data (fpcsampx)--
                         69696: Total samples
                        69.696: Accumulated time (secs.)
                           1.0: Time per sample (msecs.)
                             4: Sample bin width (bytes)
-------------------------------------------------------------------------
Function list, in descending order by time
-------------------------------------------------------------------------
 [index]      secs    %    cum.%   samples  function (dso: file, line)

     [1]    22.542  32.3%  32.3%     22542  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
     [2]     7.901  11.3%  43.7%      7901  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]     7.273  10.4%  54.1%      7273  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)
     [4]     6.737   9.7%  63.8%      6737  idctSparseColAdd (mplayer: simple_idct.c, 209)
     [5]     5.066   7.3%  71.0%      5066  __ioctl (libc.so.1: stat.c, 32; compiled in ioctl.s)
     [6]     2.532   3.6%  74.7%      2532  msmpeg4_decode_block (mplayer: msmpeg4.c, 1661)
     [7]     1.920   2.8%  77.4%      1920  idctRowCondDC (mplayer: simple_idct.c, 104)
     [8]     1.647   2.4%  79.8%      1647  put_pixels16_xy2_c (mplayer: dsputil.c, 891)
     [9]     1.557   2.2%  82.0%      1557  simple_idct_put (mplayer: simple_idct.c, 313)
 

Whoa, 1/3rd of it's time is spent in that routine!

So lets get the code for libvo/yuv2rgb.c:
Code:
PROLOG(yuv2rgb_c_24_rgb, uint8_t)
        RGB(0);
        DST1RGB(0);
        DST2RGB(0);

        RGB(1);
        DST2RGB(1);
        DST1RGB(1);

        RGB(2);
        DST1RGB(2);
        DST2RGB(2);

        RGB(3);
        DST2RGB(3);
        DST1RGB(3);
EPILOG(24)

Gaack, macro's. Let's write them out, so we can distinguish the flow clearer:
Code:
static int yuv2rgb_c_24_rgb(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY,
             int srcSliceH, uint8_t* dst[], int dstStride[]){
    int y;

    if(c->srcFormat == IMGFMT_422P){
        srcStride[1] *= 2;
        srcStride[2] *= 2;
    }
    for(y=0; y<srcSliceH; y+=2){
        uint8_t *dst_1= (uint8_t*)(dst[0] + (y+srcSliceY  )*dstStride[0]);
        uint8_t *dst_2= (uint8_t*)(dst[0] + (y+srcSliceY+1)*dstStride[0]);
        uint8_t *r, *g, *b;
        uint8_t *py_1= src[0] + y*srcStride[0];
        uint8_t *py_2= py_1 + srcStride[0];
        uint8_t *pu= src[1] + (y>>1)*srcStride[1];
        uint8_t *pv= src[2] + (y>>1)*srcStride[2];
        unsigned int h_size= c->dstW>>3;
        while (h_size--) {
            int U, V, Y;

        U = pu[0];
        V = pv[0];
        r = (char*)c->table_rV[V];
        g = (char*)c->table_gU[U] + c->table_gV[V];
        b = (char*)c->table_bU[U];
        Y = py_1[1];
        dst_1[0] = r[Y]; dst_1[1] = g[Y]; dst_1[2] = b[Y];
        Y = py_1[2];
        dst_1[3] = r[Y]; dst_1[4] = g[Y]; dst_1[5] = b[Y];
        Y = py_2[0];
        dst_2[0] = r[Y]; dst_2[1] = g[Y]; dst_2[2] = b[Y];
        Y = py_2[1];
        dst_2[3] = r[Y]; dst_2[4] = g[Y]; dst_2[5] = b[Y];
      
      
        U = pu[1];
        V = pv[1];
        r = (char*)c->table_rV[V];
        g = (char*)c->table_gU[U] + c->table_gV[V];
        b = (char*)c->table_bU[U];
        DST2RGB(1); <== didn't write them out, lazy bum me
        DST1RGB(1); <== didn't write them out, lazy bum me

        U = pu[2];
        V = pv[2];
        r = (char*)c->table_rV[V];
        g = (char*)c->table_gU[U] + c->table_gV[V];
        b = (char*)c->table_bU[U];
        DST1RGB(2); <== still not written out
        DST2RGB(2);

        U = pu[3];
        V = pv[3];
        r = (char*)c->table_rV[V];
        g = (char*)c->table_gU[U] + c->table_gV[V];
        b = (char*)c->table_bU[U];
        DST2RGB(3); <== Are you still reading this?
        DST1RGB(3); <== Man, eat some hot dung strudel. (BloodhoundGang)
            pu += 4;
            pv += 4;
            py_1 += 8;
            py_2 += 8;
            dst_1 += 24;
            dst_2 += 24;
        }
    }
    return srcSliceH;
}

Fabulous, lots of array pointers, load/stores from memory, some redundant recalculations within a loop, not exactly a lost cause, but needs some care. The pointers dst_1 and dst_2 are two adjacent scanline RGB pixels.
The first and most obvious idea Vegac and i had, was combining the dst_1 and dst_2 stores to 16, 32 or even 64 bit. With all the array data prepared for the store i had to define two additional variables as shift containers. So you get something like this, for pu[0] and pv[0]:
Code:
        uint32_t *dst_1= (uint32_t*)(dst[0] + (y+srcSliceY  )*dstStride[0]);
        uint32_t *dst_2= (uint32_t*)(dst[0] + (y+srcSliceY+1)*dstStride[0]);
        uint32_t acc1,acc2;
.
.
.
            U = pu[0];                             
            V = pv[0];                             
            r = (uint8_t*)c->table_rV[V];             
            g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
            b = (uint8_t*)c->table_bU[U];
            Y1 = py_1[0];                                                 
            Y2 = py_1[1];                                               
            acc1 =  r[Y1];
            acc1 = acc1 <<8;
            acc1 += g[Y1];
            acc1 = acc1 <<8;
            acc1 += b[Y1];
            acc1 = acc1 <<8;
            acc1 += r[Y2];
            dst_1[0]=acc1;
            acc1 =  g[Y2];
            acc1 = acc1 <<8;
            acc1 += b[Y2];
            acc1 = acc1 <<8;
            Y1 = py_2[0];                                         
            Y2 = py_2[1];                                       
            acc2 =  r[Y1];
            acc2 = acc2 <<8;
            acc2 += g[Y1];
            acc2 = acc2 <<8;
            acc2 += b[Y1];
            acc2 = acc2 <<8;
            acc2 += r[Y2];
            dst_2[0]=acc2;
            acc2 =  g[Y2];
            acc2 = acc2 <<8;
            acc2 += b[Y2];
            acc2 = acc2 <<8;
.
.
.
            pu += 4;
            pv += 4;
            py_1 += 8;
            py_2 += 8;
            dst_1 += 6;
            dst_2 += 6;
      

Going from an 8 bit store to a 32 bit store saves you a factor of 4 in writing to the cache. Look also at the acc1 and acc2 bitshift and stores. Because the dst_1 and dst_2 are now 32 bit, and we need to process 8 pixels per line in this inner loop (24 dst byte stores per line/ 3 bytes per pixel) we store in dst_1[0] and dst_2[0] RGBR, in dst_1[1] and dst_2[1] GBRG, in dst_1[2] and dst_2[2] BRGB, andsoforth, which is effectively storing 4 pixels in three 32-bit stores per line. BigEndian byte ordering ofcouse :) Looking at the speedshop proves that we are on the right track:
Code:
                         66347: Total samples
.         
     [1]    19.651  29.6%  29.6%     19651  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
     [2]     8.014  12.1%  41.7%      8014  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]     7.272  11.0%  52.7%      7272  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)

That's 2.7% better, which translates to 3 second speedup. Now what happens as we, well Vegac, move the pointer definition stuff out of the outer loop? Speedshop:
Code:
                         66563: Total samples
.
     [1]    19.161  28.8%  28.8%     19161  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
     [2]     8.003  12.0%  40.8%      8003  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]     7.325  11.0%  51.8%      7325  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)

Very nice. Vegac has made some good progress with this. Can we improve on this by using 64bit pointers for dst_1 and dst_2? Like this?:
Code:
static int yuv2rgb_c_24_rgb(SwsContext *c, uint8_t* src[], int srcStride[], int srcSliceY,
             int srcSliceH, uint8_t* dst[], int dstStride[]){
   int y;
   int U, V, Y1, Y2;
   uint64_t acc1,acc2;
   uint8_t *r, *g, *b;

   if(c->srcFormat == IMGFMT_422P){
      srcStride[1] *= 2;
      srcStride[2] *= 2;
   }
   int puoff = srcStride[1]-(c->dstW>>1);
   int pvoff = srcStride[2]-(c->dstW>>1);
   int pyoff = (srcStride[0]<<1)-c->dstW;
   int dstoff = (dstStride[0]<<0)/sizeof(uint64_t*)-3*(c->dstW>>3);
   uint8_t *pu= src[1];
   uint8_t *pv= src[2];
   uint8_t *py_1= src[0];
   uint8_t *py_2= py_1 + srcStride[0];
   uint64_t *dst_1= (uint64_t*)(dst[0] + (srcSliceY*dstStride[0]));
   uint64_t *dst_2= dst_1 + (dstStride[0]>>3);
   unsigned int slice_size= srcSliceH>>1;
   unsigned int h_size= c->dstW>>3;

   while (slice_size--){

      for(y=0;y<h_size;y++){

                        U = pu[0];
                        V = pv[0];
                        r = (uint8_t*)c->table_rV[V];
                        g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
                        b = (uint8_t*)c->table_bU[U];
                        Y1 = py_1[0];
                        Y2 = py_1[1];
                        acc1 =  r[Y1]<<24;
                        acc1 += g[Y1]<<16;
                        acc1 += b[Y1]<<8;
                        acc1 += r[Y2];
                        acc1<<=32;
                        acc1 += g[Y2]<<24;
                        acc1 += b[Y2]<<16;
                        Y1 = py_2[0];
                        Y2 = py_2[1];
                        acc2 =  r[Y1]<<24;
                        acc2 += g[Y1]<<16;
                        acc2 += b[Y1]<<8;
                        acc2 += r[Y2];
                        acc2<<=32;
                        acc2 += g[Y2]<<24;
                        acc2 += b[Y2]<<16;
             
                        U = pu[1];
                        V = pv[1];
                        r = (uint8_t*)c->table_rV[V];
                        g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
                        b = (uint8_t*)c->table_bU[U];
                        Y1 = py_1[2];
                        Y2 = py_1[3];
                        acc1 += r[Y1]<<8;
                        acc1 += g[Y1];
                        dst_1[0]=acc1;
                        acc1 =  b[Y1]<<24;
                        acc1 += r[Y2]<<16;
                        acc1 += g[Y2]<<8;
                        acc1 += b[Y2];
                        acc1<<=32;
                        Y1 = py_2[2];
                        Y2 = py_2[3];
                        acc2 += r[Y1]<<8;
                        acc2 += g[Y1];
                        dst_2[0]=acc2;
                        acc2 =  b[Y1]<<24;
                        acc2 += r[Y2]<<16;
                        acc2 += g[Y2]<<8;
                        acc2 += b[Y2];
                        acc2<<=32;
              
                        U = pu[2];
                        V = pv[2];
                        r = (uint8_t*)c->table_rV[V];          
                        g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
                        b = (uint8_t*)c->table_bU[U];
                        Y1 = py_1[4];
                        Y2 = py_1[5];
                        acc1 += r[Y1]<<24;
                        acc1 += g[Y1]<<16;
                        acc1 += b[Y1]<<8;
                        acc1 += r[Y2];
                        dst_1[1]=acc1;
                        acc1 =  g[Y2]<<24;
                        acc1 += b[Y2]<<16;
                        Y1 = py_2[4];                                         
                        Y2 = py_2[5];                                       
                        acc2 += r[Y1]<<24;
                        acc2 += g[Y1]<<16;
                        acc2 += b[Y1]<<8;
                        acc2 += r[Y2];
                        dst_2[1]=acc2;
                        acc2 =  g[Y2]<<24;
                        acc2 += b[Y2]<<16;
             
                        U = pu[3];                             
                        V = pv[3];                             
                        r = (uint8_t*)c->table_rV[V];          
                        g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
                        b = (uint8_t*)c->table_bU[U];
                        Y1 = py_1[6];
                        Y2 = py_1[7];
                        acc1 += r[Y1]<<8;
                        acc1 += g[Y1];
                        acc1<<=32;
                        acc1 += b[Y1]<<24;
                        acc1 += r[Y2]<<16;
                        acc1 += g[Y2]<<8;
                        acc1 += b[Y2];
                        dst_1[2]=acc1;
                        Y1 = py_2[6];                                         
                        Y2 = py_2[7];                                       
                        acc2 += r[Y1]<<8;
                        acc2 += g[Y1];
                        acc2<<=32;
                        acc2 += b[Y1]<<24;
                        acc2 += r[Y2]<<16;
                        acc2 += g[Y2]<<8;
                        acc2 += b[Y2];
                        dst_2[2]=acc2;

         pu += 4;
         pv += 4;
         py_1 += 8;
         py_2 += 8;
         dst_1 += 3;
         dst_2 += 3;
      }

      pu += puoff;
      pv += pvoff;
      py_1 += pyoff;
      py_2 += pyoff;
      dst_1 += dstoff;
      dst_2 += dstoff;
   }
   return srcSliceH;
}

And speedshop tells us:
Code:
                         66722: Total samples
.
     [1]    19.119  28.7%  28.7%     19119  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
     [2]     7.999  12.0%  40.6%      7999  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]     7.328  11.0%  51.6%      7328  __glMgrWaitForDMAWrite (libGLcore.so: mgras_pxdma.c, 368)

Damn! only a 0.1% decrease.

BTW, notice i inserted a acc1<<=32; Some folks might comment "Why not use acc1 += b[Y1]<<56; and some more of those larger bitshifts and get rid of that 32 bit shift of acc1?" Well you're smarter than mipspro, because it doesn't reliably shift beyond 32 bits. Maybe it's pointer wizardry, maybe it's because i compiled with -n32 and not with -64... ( MPlayer coredumps compiled with -64 ;) )

So are there any other things we need to know? Well maybe some cache stuff could be important. The lookup tables are large and the cache can't store the destination buffers all in one, so ther must be some cache clash going on. Going back to the old code and look at the line info of the prof output
Code:
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
          secs     %   cum.%   samples  function (dso: file, line)

         0.056    0.1    0.1        56  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)  PROLOG(yuv2rgb_c_24_rgb, uint8_t)
         0.927    1.3    1.4       927  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 314)  RGB(0);
         2.444    3.5    4.9      2444  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)  DST1RGB(0);
         3.527    5.1   10.0      3527  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 316)  DST2RGB(0);
         0.510    0.7   10.7       510  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 318)  RGB(1);
         2.218    3.2   13.9      2218  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)  DST2RGB(1);
         3.405    4.9   18.8      3405  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 320)  DST1RGB(1);
         0.430    0.6   19.4       430  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 322)  RGB(2);
         2.244    3.2   22.6      2244  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 323)  DST1RGB(2);
         2.836    4.1   26.7      2836  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 324)  DST2RGB(2);
         0.387    0.6   27.2       387  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 326)  RGB(3);
         1.781    2.6   29.8      1781  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 327)  DST2RGB(3);
         1.777    2.5   32.3      1777  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 328)  DST1RGB(3);

Lines 315,316,319,320,323,324 and 327 and 328 correspond to the DST1RGB() and DST2RGB() macro's, so it looks like the stores are causing large delays. Also in the 64 bit version these take the longest time:
Code:
         0.012    0.0    0.0        12  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
         0.005    0.0    0.0         5  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 325)
         0.002    0.0    0.0         2  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 326)
         0.001    0.0    0.0         1  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 331)
         0.014    0.0    0.1        14  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 338)
         1.854    2.8    2.8      1854  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 340)  for(y=0;y<h_size;y++){
         0.144    0.2    3.0       144  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 342)
         0.199    0.3    3.3       199  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 343)
         0.060    0.1    3.4        60  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 344)
         0.280    0.4    3.9       280  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 345)
         0.239    0.4    4.2       239  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 346)
         0.029    0.0    4.3        29  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 347)
         0.058    0.1    4.3        58  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 348)
         0.051    0.1    4.4        51  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 356)  Y1 = py_2[0];
         0.655    1.0    5.4       655  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 357)  Y2 = py_2[1];
         0.733    1.1    6.5       733  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 364)  acc2 += b[Y2]<<16;
         0.050    0.1    6.6        50  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 366)  U = pu[1];
         0.014    0.0    6.6        14  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 367)  V = pv[1];
         0.044    0.1    6.7        44  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 368)  r = (uint8_t*)c->table_rV[V];
         0.093    0.1    6.8        93  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 369)  g = (uint8_t*)c->table_gU[U] + c->table_gV[V];
         0.120    0.2    7.0       120  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 370)  b = (uint8_t*)c->table_bU[U];
         0.133    0.2    7.2       133  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 371)  Y1 = py_1[2];
         0.065    0.1    7.3        65  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 372)  Y2 = py_1[3];
         1.941    2.9   10.2      1941  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 375)  dst_1[0]=acc1;
         0.366    0.5   10.7       366  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 380)  acc1<<=32;
         0.107    0.2   10.9       107  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 381)  Y1 = py_2[2];
         0.226    0.3   11.2       226  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 382)  Y2 = py_2[3];
         0.221    0.3   11.6       221  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 383)
         0.215    0.3   11.9       215  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 384)
         1.505    2.3   14.1      1505  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 385)  dst_2[0]=acc2;
         0.424    0.6   14.8       424  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 390)
         0.043    0.1   14.8        43  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 392)
         0.232    0.3   15.2       232  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 393)
.   


Andsoforth. Look at some of the codelines. Definately the dst_1 and dst_2 stores are costly. To understand a bit more we have to look at perfex and a different speedshop. First the original code:
perfex -a -x -y ./mplayer -really-quiet -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.avi
Code:
                                                                    Based on 195 MHz IP28
                                                                          MIPS R10000 CPU
                                                                        CPU revision 2.x
                                                                                  Typical      Minimum      Maximum
   Event Counter Name                                          Counter Value   Time (sec)   Time (sec)   Time (sec)
===================================================================================================================
16 Cycles......................................................  11919539456    61.125843    61.125843    61.125843
 0 Cycles......................................................  11919465120    61.125462    61.125462    61.125462
 2 Issued loads................................................   3523560416    18.069541    18.069541    18.069541
26 Secondary data cache misses.................................     15575232    12.495330    12.495330    13.029680
18 Graduated loads.............................................   2311594736    11.854332    11.854332    11.854332
 3 Issued stores...............................................   1779294832     9.124589     9.124589     9.124589
19 Graduated stores............................................   1770233744     9.078122     9.078122     9.078122
 7 Quadwords written back from scache..........................     61243216     5.712893     5.226088     5.712893
25 Primary data cache misses...................................     93600272     4.224012     1.401604     4.224012
21 Graduated floating point instructions.......................    646951488     3.317700     1.658850   172.520397
 6 Decoded branches............................................    465257680     2.385937     2.385937     2.385937
22 Quadwords written back from primary data cache..............    102250624     2.065987     1.578330     2.301950
10 Secondary instruction cache misses..........................      1280336     1.027158     1.027158     1.071083
 9 Primary instruction cache misses............................      8263472     0.745408     0.247480     0.745408
24 Mispredicted branches.......................................     29519200     0.214960     0.083259     0.806858
23 TLB misses..................................................       454256     0.111724     0.111724     0.111724
30 Store/prefetch exclusive to clean block in scache...........       144576     0.000741     0.000741     0.000741
 4 Issued store conditionals...................................        53312     0.000273     0.000273     0.000273
20 Graduated store conditionals................................        12304     0.000063     0.000063     0.000063
 5 Failed store conditionals...................................           16     0.000000     0.000000     0.000000
 1 Issued instructions.........................................  11858107248     0.000000     0.000000    60.810806
 8 Correctable scache data array ECC errors....................            0     0.000000     0.000000     0.000000
11 Instruction misprediction from scache way prediction table..      1267216     0.000000     0.000000     0.006499
12 External interventions......................................            0     0.000000     0.000000     0.000000
13 External invalidations......................................            0     0.000000     0.000000     0.000000
14 Virtual coherency conditions................................            0     0.000000     0.000000     0.000000
15 Graduated instructions......................................  11144438752     0.000000     0.000000    57.150968
17 Graduated instructions......................................  11285333280     0.000000     0.000000    57.873504
27 Data misprediction from scache way prediction table.........      2610640     0.000000     0.000000     0.013388
28 External intervention hits in scache........................            0     0.000000     0.000000     0.000000
29 External invalidation hits in scache........................            0     0.000000     0.000000     0.000000
31 Store/prefetch exclusive to shared block in scache..........            0     0.000000     0.000000     0.000000

Statistics
=========================================================================================
Graduated instructions/cycle................................................     0.934972
Graduated floating point instructions/cycle.................................     0.054277
Graduated loads & stores/cycle..............................................     0.342449
Graduated loads & stores/floating point instruction.........................     6.309327
Mispredicted branches/Decoded branches......................................     0.063447
Graduated loads/Issued loads................................................     0.656039
Graduated stores/Issued stores..............................................     0.994907
Data mispredict/Data scache hits............................................     0.033459
Instruction mispredict/Instruction scache hits..............................     0.181468
L1 Cache Line Reuse.........................................................    42.609152
L2 Cache Line Reuse.........................................................     5.009559
L1 Data Cache Hit Rate......................................................     0.977069
L2 Data Cache Hit Rate......................................................     0.833598
Time accessing memory/Total time............................................     0.617800
L1--L2 bandwidth used (MB/s, average per process)...........................    75.765314
Memory bandwidth used (MB/s, average per process)...........................    48.645892
MFLOPS (average per process)................................................    10.583927

Hm, 12 seconds lost due to secondary cache misses. And next the ssrun run which counts these cache misses:
ssrun -exp fdsc_hwc ./mplayer -vo gl2 -vf format=rgb24 -nosound ~/movies/courtyard.avi
Code:
-------------------------------------------------------------------------
SpeedShop profile listing generated Mon Aug 16 03:44:27 2004

   prof -lines mplayer.fdsc_hwc.m16892

                 mplayer (n32): Target program
                      fdsc_hwc: Experiment name
                  hwc,26,29:cu: Marching orders
               R10000 / R10010: CPU / FPU
                             1: Number of CPUs
                           195: Clock frequency (MHz.)
  Experiment notes--
          From file mplayer.fdsc_hwc.m16892:
        Caliper point 0 at target begin, PID 16892
                        /usr/local/src/MPlayer-1.0pre5/mplayer -vo gl2 -vf format=rgb24 -nosound /usr/people/frank/movies/courtyard.avi
        Caliper point 1 at exit(0)
-------------------------------------------------------------------------
Summary of perf. counter overflow PC sampling data (fdsc_hwc)--
                        530523: Total samples
 Secondary cache D misses (26): Counter name (number)
                            29: Counter overflow value
                      15385167: Total counts
-------------------------------------------------------------------------
Function list, in descending order by counts
-------------------------------------------------------------------------
 [index]        counts     %   cum.%   samples  function (dso: file, line)

     [1]       7606004  49.4%  49.4%    262276  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)
     [2]       4538645  29.5%  78.9%    156505  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]        646439   4.2%  83.1%     22291  put_pixels16_xy2_c (mplayer: dsputil.c, 891)
     [4]        512169   3.3%  86.5%     17661  put_pixels8_xy2_c (mplayer: dsputil.c, 891)
     [5]        182294   1.2%  87.7%      6286  put_pixels16_x2_c (mplayer: dsputil.c, 891)
.
.
.
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
        counts     %   cum.%   samples  function (dso: file, line)

          1566    0.0    0.0        54  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 313)  PROLOG(yuv2rgb_c_24_rgb, uint8_t)
         28507    0.2    0.2       983  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 314)  RGB(0);
       2118276   13.8   14.0     73044  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)  DST1RGB(0);
        478877    3.1   17.1     16513  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 316)  DST2RGB(0);
          3306    0.0   17.1       114  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 318)  RGB(1);
       2158731   14.0   31.1     74439  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 319)  DST2RGB(1);
        405797    2.6   33.8     13993  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 320)  DST1RGB(1);
          1740    0.0   33.8        60  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 322)  RGB(2);
       2003088   13.0   46.8     69072  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 323)  DST1RGB(2);
.

Brrrr, almost 50% of all secondary Data cache misses are from yuv2rgb! That routine indeed has some serious issues with the secondary cache. The dst macro's are really sticking out. And how is it with the 64 bit version?
Code:
                      15269863: Total counts
.
     [1]       7603684  49.8%  49.8%    262196  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
     [2]       4496856  29.4%  79.2%    155064  put_pixels8_c (mplayer: dsputil.c, 891)
     [3]        648556   4.2%  83.5%     22364  put_pixels16_xy2_c (mplayer: dsputil.c, 891)
     [4]        488128   3.2%  86.7%     16832  put_pixels8_xy2_c (mplayer: dsputil.c, 891)
.
.
.
-------------------------------------------------------------------------
Line list, in descending order by function-time and then line number
-------------------------------------------------------------------------
        counts     %   cum.%   samples  function (dso: file, line)

            87    0.0    0.0         3  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 315)
        367140    2.4    2.4     12660  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 340)
           116    0.0    2.4         4  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 342)
         21663    0.1    2.5       747  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 343)
          2001    0.0    2.6        69  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 344)
          2668    0.0    2.6        92  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 345)
           957    0.0    2.6        33  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 346)
          2552    0.0    2.6        88  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 347)
          5133    0.0    2.6       177  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 348)
          3828    0.0    2.7       132  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 356)
         19140    0.1    2.8       660  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 357)
         22794    0.1    2.9       786  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 364)
           348    0.0    2.9        12  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 366)
            58    0.0    2.9         2  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 367)
          2001    0.0    3.0        69  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 368)
         16211    0.1    3.1       559  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 369)
          1334    0.0    3.1        46  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 370)
         28855    0.2    3.3       995  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 371)
          1392    0.0    3.3        48  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 372)
       2201245   14.4   17.7     75905  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 375)  dst_1[0]=acc1;
           928    0.0   17.7        32  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 380)
           145    0.0   17.7         5  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 381)
            87    0.0   17.7         3  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 382)
            58    0.0   17.7         2  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 383)
            87    0.0   17.7         3  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 384)
        524262    3.4   21.1     18078  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 385)  dst_2[0]=acc2;
           145    0.0   21.1         5  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 390)
           290    0.0   21.1        10  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 392)
           319    0.0   21.1        11  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 393)
           754    0.0   21.1        26  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 394)
           232    0.0   21.1         8  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 395)
           551    0.0   21.1        19  yuv2rgb_c_24_rgb (mplayer: yuv2rgb.c, 396)

So no improvement cache wise. Still the same cache penalties with dst stores.
Maybe the R10000 auto prefetch is not working well. This could be tested on an R12K machine like an Octane, it should have better prefetching.

Well, now you know. Lookup tables are costly with respect to cache. We still need some fast colorspace conversion, but i can safely say that we need lots of time before we can come up with a hardware accelerated alternative. Vegac's crm plugin is also handling output for Impact and VPRO, so i guess we're converging to a point where we can mail the mplayer guys our findings and get the first implementation work done on the plugins and the faster routines. On a side note, i also managed to accelerate the quicktime IDCT coded, but the results were nowhere as spectacular as the mpeg IDCT. Oh well.

Hope you enjoyed my little post and learned a little bit from it. I will think on the cache stuff some more.

Cheers. (bedtime)


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 5:25 am 
Offline
User avatar

Joined: Thu Apr 10, 2003 5:33 pm
Posts: 146
Location: Sherbrooke, Quebec, Canada
I have not fellowed the previous thread about mplayer but im not sure if you have made some patch available yet. If yes can you post them to the mplayer-dev mailling list or tell me where to get them, i will test and commit
these change to the cvs asap.


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 9:14 am 
Offline

Joined: Thu Jan 23, 2003 12:34 pm
Posts: 706
Unfortunately a number of the changes we've made will probably be rejected, because we've done SGI specific optimizations, rewriting code to work well on SGI's that will probably be slower on x86 and other architectures (though I can't be sure...)

I suppose the BIGGEST thing at this point would be vo_crm (I guess I should rename it vo_sgi since it DOES support a number of SGI chipsets with best-guess features by default).

DEX: I could put a copy of our yuv2rgb conversion in here anyways...this way we could stlil put this plugin into mplayer and those using it will use our yuv2rgb conversion which should work better on all mips4 hardware...
---

Speaking of vo_crm, I LOVE DMBUFFERS...
Many thanks to Lewis's sample code he put in another thread, and to some of SGI's samples...

The O2 path is now almost fully optimized...atleast the video out portion.
Video is taken from seperate buffers (1 for Y, 1 for U, 1 for V) and put into a single linear UYVY buffer via the cpu. This is the current slow-point and what I'm still experimenting with optimizing. From there it's drawn to a dmbuffer which does hardware colorspace conversion, then copied (by reference, finally this O2 shows some use) to a texture and drawn with a texture, allowing smoother scaling and better performance.

The performance hit of using this over just drawing to screen with glDrawPixels is there (sample benchmark plays about 1 second slower using dmbuffers)...but the second we go fullscreen it all changes. glDrawPixels doesn't scale too well to arbitrary sizes...and the time it takes to run the benchmark doubles, whereas the time to run it fullscreen using dmbuffers only goes up by about 1 second!

---

In other news, I've found the bug with my multithreading setup I had before...so that will be reenabled and hopefully it will give decent speedups. If nothing else it should allow the machine to begin decoding the next frame while waiting for the drawing to finish up...there will most likely be the biggest gains on multiprocessor machines, so dual proc Octanes should be gaining something here... I'll add more posts when I get to experiment, but I only have a single proc in my Octane (and a slow one at that...175mhz r10k, and no tram) so my octane/mgras testing is limited. At this point I fully believe O2's to be the best SGI's for video watching, atleast from a price standpoint...unless you havea 400mhz+ Octane, preferably dual, with tram, preferably vpro...but I don't :)


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 9:33 am 
Offline
Moderator
Moderator
User avatar

Joined: Fri May 09, 2003 6:10 am
Posts: 2931
Location: Maryland, USA
you could bracket the new code with #define SGI_OPT or somesuch that could be detected and set by the autoconfig script.


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 10:46 am 
Offline

Joined: Thu Aug 21, 2003 11:47 am
Posts: 560
Location: Southern PA
squeen wrote:
you could bracket the new code with #define SGI_OPT or somesuch that could be detected and set by the autoconfig script.


In an unrelated project of mine (fixing up a different free software package to build with vendor compilers when available), I need to make changes based not on what platform the code is being built, but on what compiler is used. Mostly the changes need to be made to the generated source files, but one or two of them should be made to the source code as well.[/b]


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 11:01 am 
Offline
User avatar

Joined: Tue Oct 21, 2003 2:07 am
Posts: 4226
Location: Rosario / Santa Fe / República Argentina
vegac wrote:

Speaking of vo_crm, I LOVE DMBUFFERS...
Many thanks to Lewis's sample code he put in another thread...



...Could anyone give me a pointer to such thread? I could be interested to see his dmBuffers implementation. I'm interested to get optimized some operations between glCopyPixels and dmBuffers on my current project.

Thanks in advance! ;)

_________________
Oh!, let me write that!

https://www.facebook.com/GeekTronix
https://geekli.st/GeekTronixShop
https://www.rebelmouse.com/GeekTronixShop/
http://twitter.com/GeekTronixShop
http://www.youtube.com/GeekTronixStream


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 3:10 pm 
Offline

Joined: Thu Jan 23, 2003 12:34 pm
Posts: 706
Check the IDCT thread on I think the second page (or ending of first page).

A few notes, for it to be fast to copy from dmbuffers to textures, the dmbuffer has to by in RGBA format, as does the texture, and you have to copy with glTexSubImage2dEXT, copying the entire pbuffer (with associated dmbuffer) to the texture...

I will (shortly) put my video-out plugin up somewhere you can grab it - need to fix a few bugs first (figure you don't want those). Right now it's effectively doing memory->texture conversion of YCrCb data to RGBA data...and does so pretty quickly :)


Top
 Profile  
 
 Post subject:
Unread postPosted: Mon Aug 16, 2004 4:31 pm 
Offline
User avatar

Joined: Tue Oct 21, 2003 2:07 am
Posts: 4226
Location: Rosario / Santa Fe / República Argentina
vegac wrote:
Check the IDCT thread on I think the second page (or ending of first page).


Hi Vegac; I'll chek it soon!

vegac wrote:
A few notes, for it to be fast to copy from dmbuffers to textures, the dmbuffer has to by in RGBA format, as does the texture, and you have to copy with glTexSubImage2dEXT, copying the entire pbuffer (with associated dmbuffer) to the texture...


Yes; I'm using already RGBA colorspace matrixes, but the work on my engine is currently done manually with my own routines, using: glDrawPixels, glReadPixels, glTexImage2D, glXAssociateDMPbufferSGIX, glCopyTexSubImage2D... (and GL_READ_BUFFER)

(I was edited the previous list, since was an error on my description)

Now that you mention it... not sure why I'm taking all these extra job... I can't remember why, but I've never tried with glTexSubImage2dEXT (usage on main memory) ...seems that I could be saving the processor time consumed to arrange the 2D matrixes... :roll:

vegac wrote:
I will (shortly) put my video-out plugin up somewhere you can grab it - need to fix a few bugs first (figure you don't want those).


I'm an ecologycal guy: I can offer my hospitality to these bugs if they are not too bad! :D

vegac wrote:
Right now it's effectively doing memory->texture conversion of YCrCb data to RGBA data...and does so pretty quickly :)


Cool!; I'm not working on these phase even; all my work is currently on the RGBA basis, since I'm working even with *.BMP / *.TGA sequences on my NLE editor, and even when tried some RGB/YUV conversions, I was switching quickly to my DigitalMedia C++ wrappers, that currently keeps hidden good part of all these process...

But since I'm working on both projects since more than a year, I think a new point of view and/or help can't kill me! :lol:

Thanks in advance! ;)

_________________
Oh!, let me write that!

https://www.facebook.com/GeekTronix
https://geekli.st/GeekTronixShop
https://www.rebelmouse.com/GeekTronixShop/
http://twitter.com/GeekTronixShop
http://www.youtube.com/GeekTronixStream


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Aug 17, 2004 3:14 pm 
Offline
User avatar

Joined: Thu Nov 27, 2003 1:30 pm
Posts: 547
Location: london
vegac wrote:
In other news, I've found the bug with my multithreading setup I had before...so that will be reenabled and hopefully it will give decent speedups. If nothing else it should allow the machine to begin decoding the next frame while waiting for the drawing to finish up...there will most likely be the biggest gains on multiprocessor machines, so dual proc Octanes should be gaining something here...

Cool! Did ya' get it to work with slices properly?

Quote:
At this point I fully believe O2's to be the best SGI's for video watching, atleast from a price standpoint...unless you havea 400mhz+ Octane, preferably dual, with tram, preferably vpro...but I don't :)

I have to disagree... I have a 400Mhz O2 and a dual 300Mhz Octane SSE, and the Octane absolutely kicks the O2's butt, even with one CPU disabled! The faster memory more than makes up for the YUV magic, it seems. Obviously it can't scale stuff, but I just switch resolutions.

Dexter, I gotta say, I looked at that yuv2rgb.c stuff a while back and it made no sense to me whatever. So good going :)


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Aug 17, 2004 4:54 pm 
Offline
User avatar

Joined: Wed Feb 19, 2003 2:54 pm
Posts: 976
I can now watch divx on my 600Mhz O2, THANKS!

I'm really really hoping to use the video_out plugin soon!


Thanks


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Aug 17, 2004 5:17 pm 
Offline

Joined: Thu Jan 23, 2003 12:34 pm
Posts: 706
A little bit of an update - today I added multithreading to vo_crm (the sgi-specific video out plugin I've been working on) and have begun adding SGIX_pixel_texture support...which might help accelerate YUV2RGB conversion on MGras and VPro hardware more than SGI_color_matrix does. In theory it should be much faster...but theory is a funny thing some times :)

Multithreading proved pretty useful, and while not fully finished it's showing promise. Videos that would previously drop frames now run full speed on Schleusel's dual 300mhz Octane w/ VPro. As of now there's only 1 video he's unable to watch without dropped frames (I think that's what he told me) so progress is definately being made, and in another few days I may have to change my previous statement, and say that a dual r12k Octane with tram will be the preferred DivX box for Irix users :)

Unfortunately my multithreading and mgras/vpro optimizations are slower going, because my Octane only has SI graphics (without tram...there goes mgras optimizations) and a single (slow) 175mhz r10k...but Dex and Sch have been great about testing the plugin every revision I do and helping try to optimize it all...

And lewis:
Ya it's working properly with slices, and texturing is fine as well as glDrawPixels, so with tram scaling is fine :) But as of now, software yuv2rgb conversion (on higher res videos) is still faster than color_matrix conversion on VPro/CRM atleast - probably also MGras. Hopefully SGI(S/X)_pixel_texture will be faster :)


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Aug 17, 2004 5:50 pm 
Offline
User avatar

Joined: Wed Feb 19, 2003 2:54 pm
Posts: 976
vegac: any progress on the video_out plugin?


Top
 Profile  
 
 Post subject:
Unread postPosted: Sat Aug 21, 2004 2:39 pm 
Offline
User avatar

Joined: Thu Nov 27, 2003 1:30 pm
Posts: 547
Location: london
vegac wrote:
But as of now, software yuv2rgb conversion (on higher res videos) is still faster than color_matrix conversion on VPro/CRM atleast - probably also MGras.

Bummer. But the software conversion is never going to be multithreaded, wheras the color_matrix can be, right?


Top
 Profile  
 
 Post subject:
Unread postPosted: Sun Aug 22, 2004 3:22 pm 
Offline

Joined: Thu Jan 23, 2003 12:34 pm
Posts: 706
So very much has happened lately...

vo_crm, is now known as vo_sgi, as it supports a wide variety of machines beyond O2s...

Colorspace conversion can be done in a number of routines:
1) YCrCb hardware conversion (O2 only),
2) SGI_color_matrix (O2, MGRAS, VPro, and also Indy?)
3) SGIX_pixel_texture (MGRAS)
4) SGIS_pixel_texture (VPro)
5) Software conversion

Different hardware has different settings that works best, with the help of Dexter and Schleusel we're trying to nail down what settings is best for what hardware...

All of the above conversion routines are now in the vo, instead of relying on external code (software conversion used to). What this means is, they can all be ran multithreaded. Of course, on a single CPU machine this is slower, but for those of you with multi proc Octanes (or say, an Onyx?) - lucky you. Yes dexter/schleusel, software conversion is in the VO now :) And it appears (in my tests) to be faster than using it via -vf format=RGB24 like we used to!

When multithreading is enabled, the code splits up in 3 threads
1) Main mplayer thread, mostly running the codec
2) YV12->YUV/YUVA/RGB/UYVY unpacking
3) Rendering (glDrawPixels, drawing via textures, etc.)

While I doubt all 3 keep busy, multithreading has shown big speedups for Schleusel on his dual proc Octane. No way of knowing if it shows gains beyond 2 cpus, but who knows, a few people out there have Onyx's that might want to give it a go. It should also be noted that multithreading is a bit...unstable right now, so don't try resizing the window once the movie is playing :)


Top
 Profile  
 
 Post subject:
Unread postPosted: Sun Aug 22, 2004 6:35 pm 
Offline

Joined: Mon May 31, 2004 10:27 pm
Posts: 101
Location: QLD, Australia
Where can us mere mortals download these state of the art binaries to tryout?

Cheers
barefoot


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group