Nekochan Net

Official Chat Channel: #nekochan // irc.nekochan.net
It is currently Mon Jul 28, 2014 8:25 am

All times are UTC - 8 hours


Forum rules


Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.



Post new topic Reply to topic  [ 5 posts ] 
Author Message
Unread postPosted: Mon Nov 22, 2004 5:59 pm 
Offline
User avatar

Joined: Fri Sep 03, 2004 10:45 pm
Posts: 108
Location: SF Bay Area, CA
This code is executing at about 1/100th of the speed I think it should be executing at, on an Onyx2 running 6.5.26 (with 7.3 mipspro compilers)

Any ideas why? Very very strange. (note, I am not a programmer, sorry :D)

Code:
colour 5# vi NUC_LoopTest.cpp
#include "stdio.h"
#include "stdlib.h"
#include "string.h"

int pixel[640][512];
float gain[640][512];
float offset[640][512];
int N=0;

int i,j,k;

int main(int argc, char* argv[])
{
        N=1000;
        for (j=0;j<640; j++) {
                for (k=0;k<512;k++)  {
                        pixel[j][k] = 312;
                        gain[j][k] = (float)2.345;
                        offset[j][k] = (float)25.456;
                }
        printf(" %d structures initialized\n",N);
        }
        printf(" %d loops started\n",N);
        for (i=0; i<N; i++)  {
                printf(" %d loops completed\n",i);
                for (j=0;j<640; j++) {
                        for (k=0;k<512;k++)  {
                                pixel[j][k] = (int)(gain[j][k]*(float)pixel[j][k] + offset[j][k]);
                        }
                }
        }
        printf("loops done");
        return (0);
}

_________________
My first Indy is still my favourite SGI.
CDglobal Networks: http://www.cdglobal.net/


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Nov 23, 2004 12:06 am 
Offline
Moderator
Moderator
User avatar

Joined: Thu Feb 20, 2003 6:57 am
Posts: 2062
Location: Voorburg, The Netherlands
I have an idea. I noticed that bringing N down to 10 and stripping all the printf's (extremely bad for loop optimisation!) it really flies:
Code:
irene /tmp> cc -n32 -Ofast=ip28 -r10000 -mips4 -o loop loop.c
irene /tmp> time loop
pixel 0,0  195fe0 pixel xmax,ymax 195fe0
0.127u 0.061s 0:00.27 66.6% 0+0k 0+0io 0pf+0w

I have added those pixel printf's at the end, otherwise the compiler decides that nothing get's done with the result, so it might optimise the whole loop, by throwing it away. heheh

Good, now increase N by 2 and watch what happens:
Code:
N=10
pixel 0,0  195fe0 pixel xmax,ymax 195fe0
0.127u 0.061s 0:00.27 66.6% 0+0k 0+0io 0pf+0w

N=12
pixel 0,0  8b894f pixel xmax,ymax 8b894f
0.143u 0.061s 0:00.29 68.9% 0+0k 0+0io 0pf+0w

N=14
pixel 0,0  2ff50b4 pixel xmax,ymax 2ff50b4
0.160u 0.064s 0:00.29 75.8% 0+0k 0+0io 0pf+0w

N=16
pixel 0,0  107b7cc0 pixel xmax,ymax 107b7cc0
0.177u 0.064s 0:00.37 62.1% 0+0k 0+0io 0pf+0w

N=18
pixel 0,0  5aa31180 pixel xmax,ymax 5aa31180
0.193u 0.065s 0:00.45 55.5% 0+0k 0+0io 0pf+0w

N=20
pixel 0,0  7fffffff pixel xmax,ymax 7fffffff
1.189u 3.528s 0:06.67 70.4% 0+0k 0+0io 0pf+0w


Kablam! Integer overflow. That will cost you deerly. Now if you change the pixel array to float instead of int and remove the casts, it performs much better:
Code:
N=20
pixel 0,0  8367358464.000000 pixel xmax,ymax 8367358464.000000
0.083u 0.063s 0:00.20 70.0% 0+0k 0+0io 0pf+0w

You still have to do something about your results going to infinity and apply some sort of so-called clamping of the pixel array, cause you only want a restricted range of colors to display, right?

:P This advice ofcourse doesn't come cheap :P I'll expect that Onyx2 on my doorstep tomorrow, okay :P


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Nov 23, 2004 11:34 pm 
Offline
User avatar

Joined: Fri Sep 03, 2004 10:45 pm
Posts: 108
Location: SF Bay Area, CA
Dexter: She's in the mail, hope your mailbox can hold 400lbs! ;P

Here's what we're running now, this is running about as expected. Actually a bit better than expected :D) Single R10k-250 is posting 104sec execution time, which equals the p4-2.4gig (windows though) system. I didn't expect to do that good even on mult*add.

Here's our compile options, hopefully they're not optomizing out all the math :)

Code:
cc NUC_LoopTest.cpp -o NUC_LoopTest-OLD -n32 -Ofast=ip27 -r10000 -mips4


Code:
#include "stdio.h"
#include "stdlib.h"
#include "string.h"

int pixel[640][512];
int pixelout[640][512];
float gain[640][512];
float offset[640][512];
long N=0;

long i,j,k;

int main(int argc, char* argv[])
{
        N=10000;
        for (j=0;j<640; j++) {
                for (k=0;k<512;k++)  {
                        pixel[j][k] = 312;
                        gain[j][k] = (float)2.345;
                        offset[j][k] = (float)25.456;
                }
        printf(" %d structures initialized\n",N);
        }
        printf(" %d loops started\n",N);
        for (i=0; i<N; i++)  {
//              printf(" %d loops completed\n",i);
                for (j=0;j<640; j++) {
                        for (k=0;k<512;k++)  {
                                pixelout[j][k] = (int)(gain[j][k]*(float)pixel[j][k] + offset[j][k]);
                        }
                }
        }
        printf("loops done\n");
                printf("pixelout %d\n",pixelout[1][1]*i);

        return (0);
}
// EOF


time ./NUC_LoopTest returns:
Code:
104.083u 0.122s 1:44.91 99.3% 0+0k 0+0io 0pf+0w

_________________
My first Indy is still my favourite SGI.
CDglobal Networks: http://www.cdglobal.net/


Top
 Profile  
 
 Post subject:
Unread postPosted: Tue Nov 23, 2004 11:37 pm 
Offline
User avatar

Joined: Fri Sep 03, 2004 10:45 pm
Posts: 108
Location: SF Bay Area, CA
Hrm, something is wrong with the system time. Looks like we're still optomizing out the loop, maybe?

_________________
My first Indy is still my favourite SGI.
CDglobal Networks: http://www.cdglobal.net/


Top
 Profile  
 
 Post subject:
Unread postPosted: Wed Nov 24, 2004 6:37 am 
Offline
User avatar

Joined: Tue Jan 13, 2004 10:57 am
Posts: 136
Location: Enschede, The Netherlands
artherd wrote:
Hrm, something is wrong with the system time. Looks like we're still optomizing out the loop, maybe?


Just a few tricks:

1. Use register long i,j,k;
2. Use two temporary float's for (float)2.345 and (float)25.456 instead (preferably declared
as consts) and refer to those variables.
3. Count from MAX down to 0 instead
(By default CPU's can test a regester agains 0 in one clock cycle).


Erik


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC - 8 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group