C-Ray FP/CPU Benchmark Test Results

SGI hardware problems, solutions, tips, hacks, etc.
Forum rules
Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.
User avatar
bjornl
Posts: 342
Joined: Tue May 09, 2006 11:55 am
Location: Sweden

C-Ray FP/CPU Benchmark Test Results

Unread postby bjornl » Thu Sep 20, 2007 10:10 am

Moderator Note: This topic was split from bjornl's O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc topic in the SGI:hinv Forum. <recondas>

mapesdhs wrote:The best CPU for O200 is the dual-R12K/360 module with 4MB L2, so best config would be quad-360.

Darn

mapesdhs wrote:Nice system btw! Would be interesting to know how well it runs the C-Ray test.

I would be happy to try and run the test, if I knew what it was and where to find it. I searched the net and only found references to Cray, but maybe that is what it means.

/Bjorn

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

C-Ray FP/CPU Benchmark Test Results

Unread postby mapesdhs » Thu Sep 20, 2007 1:16 pm

bjornl wrote:
> Darn

Ach, don't feel bad! :D Besides, if I had a quad-360 setup, I wouldn't keep it. Given the
value of such a system, one could sell it and have more than enough to get a very
nice quad-700 Tezro instead. 8)


> I would be happy to try and run the test, if I knew what it was and where to find it. ...
> I searched the net and only found references to Cray, but maybe that is what it means.

Nothing to do with Cray (just goes to show how dumb the search engines are).

Written by John Tsiombikas (one of the people creating the game Theseis), C-ray is a
small ray tracing benchmark, using a dataset that fits entirely in L2 cache, so it's purely
a fp CPU test. I'm taking over the maintenance of the results page and will begin adding
multi-CPU/multi-core results soon. Here's the test archive (198K):

http://www.futuretech.blinkenlights.nl/ ... 1.1.tar.gz

and here's the current results table:

http://www.futuretech.blinkenlights.nl/c-ray.html

The table doesn't yet include multi-CPU/core results (first need to work out a good
way of making it clear which results are single-CPU/core and which are not) but here
is a MIPS summary which includes some multi-CPU examples I've collated, along
with the C-Ray author's dual-core PentiumD result:

Code: Select all

                               Time
                               (ms)

Dual-core PentiumD 3GHz         692
Octane2  Dual-R14K/600 (2MB)    839
Onyx     Quad-R10K/195 (2MB)   1307
Fuel     R14K/700 (4MB)        1425
Octane2  R14K/600 (2MB)        1660
Fuel     R14K/500 (2MB)        1998
Octane2  R12K/400 (2MB)        2517
O2       R12K/400 (2MB)        2556
O2       R12K/270 (2MB)        3793
O2       R10K/250 (1MB)        4158
Indigo2  R10K/195 (1MB)        5274
Onyx     Quad-R4K/150 (1MB)    5749
O2       R5200SC/300 (1MB)     6393
O2       R5000SC/200 (1MB)     9751


The newer IRIX systems perform pretty well for this test, given their low clock, age and lack
of instruction extensions that are available in x86 CPUs. Indeed, the main table shows
R10K/R12K MIPS systems outperforming single-core PCs with much higher clock speeds
quite easily. However, the newer dual-core x86 options changes things a lot.

So, I would expect your quad-270 O200 to be a smidgen slower than the dual-600 Octane2.

Cheers! :)

Ian.

PS. As an x86 scaling efficiency comparison, note that John's dual-core PentiumD 3GHz does
the test in 1365ms when running just 1 thread, ie. using only a single core. The scaling is almost
identical to the percentage speed increase for multi-CPU SGIs, eg. single vs. dual-600 Octane2.

User avatar
jan-jaap
Donor
Donor
Posts: 4938
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby jan-jaap » Fri Sep 21, 2007 7:49 am

Fuel, 600MHz R14k (4MB L2)
MIPSpro 7.4.4m, CFLAGS changed to IP35:

Code: Select all

asterix 123% cat scene | ./c-ray-mt > foo.ppm
c-ray-mt v1.1
Rendering took: 1 seconds (1641 milliseconds)

GCC 4.2.1, using GNU binutils 2.18 (GCC does NOT have a model for the R10k family of CPUs!).

Code: Select all

asterix 172% cat scene | ./c-ray-mt > foo.ppm
c-ray-mt v1.1
Rendering took: 2 seconds (2379 milliseconds)

GCC loses ...

Pentium4 (HT), 2.8GHz, Linux

GCC 3.4.6
2 threads:

Code: Select all

Rendering took: 1 seconds (1048 milliseconds)

1 thread:

Code: Select all

Rendering took: 1 seconds (1430 milliseconds)


GCC 4.1.2
2 threads:

Code: Select all

Rendering took: 1 seconds (1114 milliseconds)

1 thread:

Code: Select all

Rendering took: 1 seconds (1575 milliseconds)

GCC 4.x is slower than GCC 3.x ...

Open64 4.0, a.k.a. MIPSpro ported to IA64 (by SGI), then to x64 (by PathScale):
2 threads:

Code: Select all

Rendering took: 1 seconds (1282 milliseconds)

1 thread:

Code: Select all

Rendering took: 1 seconds (1837 milliseconds)

Open64 loses ...

To be honest: Open64 is a compiler targeted at x64, that happens to do x86 as well. Once I've got Linux running on my new quad core box I'll see whet that brings ...

Oh, and last but not least: I had a quick look at the source code and the math in there is extremely trivial. It's really dubious to use this as a benchmark, especially cross platform. It's a bit like benchmarking the reference BLAS implementation from netlib.org across platfoms, when any target optimized BLAS implementation beats the reference implementation by a factor of at least 10:1.
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 3:05 am

jan-jaap writes:
> Fuel, 600MHz R14k (4MB L2)
> Rendering took: 1 seconds (1641 milliseconds)

Yep, sounds spot on.


> Pentium4 (HT), 2.8GHz, Linux

Thanks! But more details of the system please, and you should try running with more than 2 threads (trust me).
8 or 16 might be optimal.


> GCC 4.x is slower than GCC 3.x ...

What did they do?


> To be honest: Open64 is a compiler targeted at x64, that happens to do x86 as well. Once I've got
> Linux running on my new quad core box I'll see whet that brings ...

If you try it on a quad-core, test with 4, 8, 12, 16, etc. threads. Might find 64 is optimal, or some off-number either side like 62 or 68.


> Oh, and last but not least: I had a quick look at the source code and the math in there is extremely trivial. It's
> really dubious to use this as a benchmark, especially cross platform. ...

Any test can be used as a benchmark, if only of what the test itself is doing. Btw, the code was not written for SGIs
first, if that's what you're wondering about. The orig program was written for Linux/gcc/x86.

Ian.

User avatar
jan-jaap
Donor
Donor
Posts: 4938
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby jan-jaap » Wed Sep 26, 2007 4:34 am

mapesdhs wrote:jan-jaap writes:
> Fuel, 600MHz R14k (4MB L2)
> Rendering took: 1 seconds (1641 milliseconds)

Yep, sounds spot on.


> Pentium4 (HT), 2.8GHz, Linux

Thanks! But more details of the system please,

Pentium 4 CPU, 2800MHz, 512kB L2 cache, 1 core, hyper threading
Generic ASUS P4 motherboard with Intel 965 chipset
1 GB DDR RAM (speed unknown, might be DDR2 even)
Running Debian 4.0 'Etch' (kernel 2.6.18)

and you should try running with more than 2 threads (trust me).
8 or 16 might be optimal.

Right. For 6 ... 16 threads I get times around 990ms, using GCC 3.4.6
Edit: With the Intel compiler (v9.0), using "-O3 -ipo -xN", I get times as low as 707ms from the same system!

> GCC 4.x is slower than GCC 3.x ...

What did they do?

GCC4 marks the beginning of a new optimization framework (tree-SSA). It's not finished, and of course the old stuff is still there. I also noticed that using SSE2 as math engine instead of the old x87 stuff increases the runtime ...

The ATLAS numerical people also put out a warning that GCC3 generated ATLAS was faster than GCC4 versions.

Same is true on the Fuel: GCC 3.4.6: 2247ms, GCC 4.2.1: 2379ms.

If you try it on a quad-core, test with 4, 8, 12, 16, etc. threads. Might find 64 is optimal, or some off-number either side like 62 or 68.


Here it goes:
Intel Core2 Quad (Q6600) CPU
4 * 2.4GHz cores
2 * 4MB L2 cache
ASUS P5WDG2WS Pro workstation motherboard
2GB DDR2 RAM
Knoppix 5.0 CD, kernel 2.6.17

With 16 threads: ~ 250ms, with 64 threads ~ 235 ms, with 128 threads: ~ 275 ms
Did I mention this things is fast :D ?

If I think about it, I'll also give it a spin on the SGI 2100 (8*400MHz w. 8MB L2)
Last edited by jan-jaap on Wed Sep 26, 2007 5:24 am, edited 1 time in total.
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 5:14 am

jan-jaap writes:
> Pentium 4 CPU, 2800MHz, 512kB L2 cache, 1 core, hyper threading

Thanks!! Hey, can you run these tests for the sphract file aswell? (much tougher)

What name should I credit the results to? And what host name should I use for the P4/2.8GHz result?


> Right. For 6 ... 16 threads I get times around 990ms, using GCC 3.4.6

I've put in 990ms for the result.


> Intel Core2 Quad (Q6600) CPU

Awesome! 8) That'll be at the top of the table I expect when I add the multi-core results.


> 4 * 2.4GHz cores

Ach, not thought of overclocking it to 4GHz? ;)


> ASUS P5WDG2WS Pro workstation motherboard

I'm intrigued! What kind of I/O does this have? Any PCIX at all?


> With 16 threads: ~ 250ms, with 64 threads ~ 235 ms, with 128 threads: ~ 275 ms

I'll put it in as 235ms for 64 threads.


> Did I mention this things is fast :D ?

:D:D


> If I think about it, I'll also give it a spin on the SGI 2100 (8*400MHz w. 8MB L2)[/quote]

That would be interesting! I expect you'll find that the optimum no. of threads is somewhere
around the 80 mark. Might be different for the sphract file. Rough guess, it'll probably give
around the 350ms mark.

Ian.

kramlq
Donor
Donor
Posts: 994
Joined: Tue Sep 20, 2005 5:10 pm
Location: IRL

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby kramlq » Wed Sep 26, 2007 6:08 am

jan-jaap wrote:I also noticed that using SSE2 as math engine instead of the old x87 stuff increases the runtime ...

SSE has an additional 8 large registers (or 16 if in Intel64 mode) that potentially have to be saved (to memory) across a thread context switch. Presumably every thread in your process is contending to use the SSE engine, so the kernel can't benefit much from lazy state switching when context switching between them, and the end result is that lots of saving and restoring occurs. Just a theory - I might be wrong, as I know little about this benchmark.

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 6:20 am

kramlq writes:
> switch. Presumably every thread in your process is contending to use the SSE engine, so the kernel can't benefit

Speaking of which jan-japp, have you tried the test with HT turned off? I found a lot of tasks were faster
with HT off.

Ian.

User avatar
jan-jaap
Donor
Donor
Posts: 4938
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby jan-jaap » Wed Sep 26, 2007 6:55 am

Some more data, using the Intel compiler (v9.0)

The P4 @ 2.8GHz:
'scene' : 707ms (small changes per run, not much difference between 6 ... 16 threads)
'sphfract': 13386ms (6 threads), 13324 (8 threads), 13360ms (16 threads)

The quad core (Q6600, 4*2400MHz), this time running Debian 4.0 for AMD64, ICC 9.0 with '-fast' compiler flags, generating 64bit code with full interprocess optimization:

'scene': 161ms (small changes per run, not much difference between 64 or 128 threads)
'sphfract': 3231ms, again pretty much the same for 64 or 128 threads. Times go up for 256 threads.

@Ian: The P5WDG2-WS board has the Intel 975X chipset with dual PCIe 16x slots, dual GBit ethernet (logically on the PCIe bus, not PCI!), and two PCI-X slots behind a PCIe => PCI-X bridge (a bit like Origin has XIO => PCI64 bridges). One of the purposes of this system is driver development for high speed Firewire adapters (multiple 800Mb/s 1394b interfaces on a single PCI-X board). The motherboard was carefully selected for this task, and I can tell you it's a bit of a bandwidth monster :)

Hostnames: the P4 is 'dapdev17.daptechnology.com' and the quadcore is dapdev18.daptechnology.com. Feel free to credit J.J.vanderHeijden <at> gmail.com
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
jan-jaap
Donor
Donor
Posts: 4938
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby jan-jaap » Wed Sep 26, 2007 8:19 am

mapesdhs wrote:kramlq writes:
> switch. Presumably every thread in your process is contending to use the SSE engine, so the kernel can't benefit

Speaking of which jan-japp, have you tried the test with HT turned off? I found a lot of tasks were faster
with HT off.

Ian.

Huh, now you made me reboot the thing ;)

With HT disabled in the BIOS, I get ~ 945ms for 'scene' and ~ 18830ms for 'sphfract'. It seems the Intel compilers get some use out of HT. The times slowly climb for nthreads > 4 which makes sense since more threads competing for the same resources mean more context switches.

The Fuel completes 'sphfract' in ~ 47202ms. Again, with large numbers of threads the times slowly rise, but for nthreads < 8 this is in the 4th decimal so irrelevant. But (unlike the PCs), times don't (initially at least) improve with more threads either, and that's what I would expect as well.
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 9:01 am

jan-jaap writes:
> With HT disabled in the BIOS, I get ~ 945ms for 'scene' and ~ 18830ms for 'sphfract'. It seems the Intel compilers
> get some use out of HT. The times slowly climb for nthreads > 4 which makes sense since more threads competing
> for the same resources mean more context switches.

Thanks for the new info!! Btw, wouldn't it be fastest for single-core (HT off) with just 1 thread?


> The Fuel completes 'sphfract' in ~ 47202ms. ...

Try it with 1 thread, that should give the best result.


> Again, with large numbers of threads the times slowly rise, but for nthreads < 8 this is in the 4th decimal so irrelevant. But
> (unlike the PCs), times don't (initially at least) improve with more threads either, and that's what I would expect as well.

Multiple threads only helps for systems with multiple CPUs/cores or HT tech (AFAIK).

Ian.

User avatar
jan-jaap
Donor
Donor
Posts: 4938
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby jan-jaap » Wed Sep 26, 2007 10:24 am

mapesdhs wrote:> The Fuel completes 'sphfract' in ~ 47202ms. ...

Try it with 1 thread, that should give the best result.


Did that -- the numbers are essentially the same for the c-ray-f (single threaded) or the c-ray-mt with 1, 2, 4 threads. I guess we agree that under normal circumstances times shouldn't improve if multiple threads are competing for the same CPU. I guess it takes a good scheduler for the times to not get worse with multiple threads :)

And now the SGI 2100. It's this baby: viewtopic.php?f=14&t=8475 except it now sports a Seagate 15K3 harddisk.

One weird thing: with Nthreads < Ncpus, the program initially runs on 1 CPU for a second or two, and then starts to utilize the rest. With 8 threads or more, it takes all CPUs right away. weird.

scene:

Code: Select all

N      time (milliseconds)
---------------------------
1      2461
2      2464
4      2138
8      1301
16     522
32     441
64     381
128    338
256    336
512    345
1024   388 (*)

(*): Message appears:

Code: Select all

more threads than scanlines specified, reducing number of threads to 600
:)

sphfract

Code: Select all

N      time (milliseconds)
---------------------------
1      70707
2      40841
4      22512
8      11617
16     9683
32     9140
64     9269
128    9106
256    9208
512    9125

For 32 threads or more, all CPUs are pegged and times don't improve anymore.

Big iron rules 8-)
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 11:24 am

jan-jaap writes:
> And now the SGI 2100. It's this baby: viewtopic.php?f=14&t=8475 except it now sports
> a Seagate 15K3 harddisk.

Nice!!


> One weird thing: with Nthreads < Ncpus, the program initially runs on 1 CPU for a second or two, and then starts
> to utilize the rest. With 8 threads or more, it takes all CPUs right away. weird.

No idea why that should be. :D


> 128 338
> 256 336

Cool stuff!! I'll just use the faster result, even though the difference is small. It's actually slightly
quicker than I was expecting, quite good. Linear increase over a dual-600 Octane2 would have
given 316, so 336 is sweet.

What's the compiler version, etc. for the SGI 2100?


> more threads than scanlines specified, reducing number of threads to 600

Yes, that's because the default output resolution is 800x600. If you used the -s option to create an image
with more lines, then one could use more threads, though that would be faster is another matter.


> sphfract
> 128 9106

Cool! 8) Interesting that for a more complex render, the speed advantage of the quad-core increases.

Mind you, in an odd sort of way it all scales rather nicely. For this particular test, having 8 x R12K/400s
is a bit like having a single imaginary 3200MHz MIPS CPU, all arch things being the same. If one
thinks of the quad-core 2.4GHz as being not unlike the speed one would see from a single 9.6GHz
1-core, then the fact that the quad-core is 3X faster than the SGI 2100 for a more complex render
shows just well SGI's chip design was way back. Clock for clock today, nothing has really improved
in many ways with x86.

Must run this thing on my 24-CPU Onyx...

Ian.

qumefox
Posts: 289
Joined: Sat Aug 25, 2007 2:43 pm
Location: Jackson, TN
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby qumefox » Wed Sep 26, 2007 12:20 pm

I have a 16p r12k-400mhz (well, currently 14 400's and 2 300's, due to one 400mhz cpu module on it's way to get exchanged because it was bad when I received it) origin 2400. Would there be much point in me giving this a run?
:O2000R:, :Fuel:, :320: :1600SW:, :320: :1600SW:, :O2: :1600SW:, :Indigo2IMP:

User avatar
mapesdhs
Posts: 2516
Joined: Mon Nov 10, 2003 4:17 pm
Location: Edinburgh, Scotland
Contact:

Re: O200 Gigachannel rack. 4x270MHz R12K. 3Gb. FC etc

Unread postby mapesdhs » Wed Sep 26, 2007 1:04 pm

qumefox wrote:I have a 16p r12k-400mhz (well, currently 14 400's and 2 300's, due to one 400mhz cpu module on it's way to get exchanged because it was bad when I received it) origin 2400. Would there be much point in me giving this a run?


Absolutely! 8) It might beat the quad-core. ;D

Well, not for the small test (doesn't last long enough), but a moderate chance it would for the sphract test.

Ian.


Return to “SGI: Hardware”

Who is online

Users browsing this forum: Bing [Bot], nyef and 1 guest