This page contains results gleaned from the Contract Monitor output for the ScaLAPACK runs Celso did in Sept 2001 using updated ScaLAPACK model for predicting iteration duration.
The plots show
In addition, some notes about experiment parameters and measured durations are included.
Jump to Experiment1 Experiment2 Experiment3 Experiment4
N = 12,000; NB=64 Processes= 8
| machine | opus16 | opus13 | opus14 | opus15 | torc1 | torc7 | torc4 | torc6 |
| mem(MB) | 240 | 220 | 229 | 220 | 452 | 479 | 446 | 458 |
| speed | 270 | 270 | 270 | 270 | 330 | 330 | 330 | 330 |
| load | 0.88 | 0.97 | 1.00 | 0.86 | 0.35 | 1.12 | 0.38 | 1.35 |
Fine grid latency matrix :
|
1.00 |
0.22 |
0.39 |
0.22 |
81.52 |
81.52 |
81.52 |
81.52 |
|
0.22 |
-1.00 |
0.22 |
0.22 |
81.52 |
81.52 |
81.52 |
81.52 |
|
0.39 |
0.22 |
-1.00 |
0.22 |
81.52 |
81.52 |
81.52 |
81.52 |
|
0.22 |
0.22 |
0.22 |
-1.00 |
81.52 |
81.52 |
81.52 |
81.52 |
|
81.52 |
81.52 |
81.52 |
81.52 |
-1.00 |
27.53 |
0.32 |
1.22 |
|
81.52 |
81.52 |
81.52 |
81.52 |
27.53 |
-1.00 |
0.31 |
0.30 |
|
81.52 |
81.52 |
81.52 |
81.52 |
0.32 |
0.31 |
-1.00 |
0.31 |
|
81.52 |
81.52 |
81.52 |
81.52 |
1.22 |
0.30 |
0.31 |
-1.00 |
Fine grid Bandwidth matrix :
|
-1.00 |
249.54 |
244.88 |
245.22 |
4.39 |
4.39 |
4.39 |
4.39 |
|
249.54 |
-1.00 |
242.84 |
238.42 |
4.39 |
4.39 |
4.39 |
4.39 |
|
244.88 |
242.84 |
-1.00 |
239.51 |
4.39 |
4.39 |
4.39 |
4.39 |
|
245.22 |
238.42 |
239.51 |
-1.00 |
4.39 |
4.39 |
4.39 |
4.39 |
|
4.39 |
4.39 |
4.39 |
4.39 |
-1.00 |
83.04 |
82.14 |
60.12 |
|
4.39 |
4.39 |
4.39 |
4.39 |
83.04 |
-1.00 |
81.93 |
81.45 |
|
4.39 |
4.39 |
4.39 |
4.39 |
82.14 |
81.93 |
-1.00 |
81.23 |
|
4.39 |
4.39 |
4.39 |
4.39 |
60.12 |
81.45 |
81.23 |
-1.00 |
Data shown is for process 0.

In the following zoom view, the peaks occur at iterations 5, 13, 21, etc.
|
This is the same problem size as Experiment 1, but here only 7 processors were selected for the run.
N = 12000; NB=64; Processes=7;
| machine | opus14 | opus13 | opus16 | opus15 | torc4 | torc6 | torc7 |
| mem(MB) | 215 | 214 | 227 | 215 | 233 | 479 | 479 |
| speed | 270 | 270 | 270 | 270 | 330 | 330 | 330 |
| load | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 1.04 | 0.87 |
Fine grid latency matrix :
| -1.00 | 0.24 | 0.29 | 0.26 | 83.78 | 83.78 | 83.78 |
| 0.24 | -1.00 | 0.24 | 0.23 | 83.78 | 83.78 | 83.78 |
| 0.29 | 0.24 | -1.00 | 0.23 | 83.78 | 83.78 | 83.78 |
| 0.26 | 0.23 | 0.23 | -1.00 | 83.78 | 83.78 | 83.78 |
| 83.78 | 83.78 | 83.78 | 83.78 | -1.00 | 0.31 | 0.31 |
| 83.78 | 83.78 | 83.78 | 83.78 | 0.31 | -1.00 | 0.31 |
| 83.78 | 83.78 | 83.78 | 83.78 | 0.31 | 0.31 | -1.00 |
Fine grid Bandwidth matrix :
| -1.00 | 248.83 | 247.31 | 246.38 | 2.83 | 2.83 | 2.83 |
| 248.83 | -1.00 | 244.54 | 240.94 | 2.83 | 2.83 | 2.83 |
| 247.31 | 244.54 | -1.00 | 247.54 | 2.83 | 2.83 | 2.83 |
| 246.38 | 240.94 | 247.54 | -1.00 | 2.83 | 2.83 | 2.83 |
| 2.83 | 2.83 | 2.83 | 2.83 | -1.00 | 81.96 | 56.47 |
| 2.83 | 2.83 | 2.83 | 2.83 | 81.96 | -1.00 | 50.90 |
| 2.83 | 2.83 | 2.83 | 2.83 | 56.47 | 50.90 | -1.00 |
All data plotted is for process 0.



For Experiment 2, we also collected the raw sensor output for all the processes which shows the iteration duration and the timestamp when the measurement was made. That information is plotted here:




The following plot shows the ratio of the measured to the predicted values for 3 different "metrics of performance" which might be used as the basis for contract validation.

This is the same problem size as Experiments 1 and 2. Here 8 systems at UIUC and UCSD were selected.
N = 12000; NB=64; Processes=8;
| machine | opus16 | opus14 | opus13 | opus15 | dralion | mystere | quidam | soleil |
| mem(MB) | 225 | 212 | 214 | 213 | 215 | 210 | 224 | 183 |
| speed | 270 | 270 | 270 | 270 | 270 | 240 | 240 | 240 |
| load | 1.00 | 1.00 | .84 | 0.99 | 1.00 | 1.00 | .64 | 0.71 |
Fine grid latency matrix :
|
-1.00 |
0.25 |
0.25 |
0.30 |
134.94 |
134.94 |
134.94 |
134.94 |
|
0.25 |
-1.00 |
0.56 |
0.35 |
134.94 |
134.94 |
134.94 |
134.94 |
|
0.25 |
0.56 |
-1.00 |
0.23 |
134.94 |
134.94 |
134.94 |
134.94 |
|
0.30 |
0.35 |
0.23 |
-1.00 |
134.94 |
134.94 |
134.94 |
134.94 |
|
134.94 |
134.94 |
134.94 |
134.94 |
-1.00 |
0.23 |
31.41 |
0.38 |
|
134.94 |
134.94 |
134.94 |
134.94 |
0.23 |
-1.00 |
0.24 |
0.36 |
|
134.94 |
134.94 |
134.94 |
134.94 |
31.41 |
0.24 |
-1.00 |
0.23 |
|
134.94 |
134.94 |
134.94 |
134.94 |
0.38 |
0.36 |
0.23 |
-1.00 |
Fine grid Bandwidth matrix :
|
-1.00 |
253.16 |
251.58 |
244.65 |
5.89 |
5.89 |
5.89 |
5.89 |
|
253.16 |
-1.00 |
246.61 |
246.96 |
5.89 |
5.89 |
5.89 |
5.89 |
|
251.58 |
246.61 |
-1.00 |
239.40 |
5.89 |
5.89 |
5.89 |
5.89 |
|
244.65 |
246.96 |
239.40 |
-1.00 |
5.89 |
5.89 |
5.89 |
5.89 |
|
5.89 |
5.89 |
5.89 |
5.89 |
-1.00 |
70.90 |
56.22 |
38.10 |
|
5.89 |
5.89 |
5.89 |
5.89 |
70.90 |
-1.00 |
74.06 |
83.91 |
|
5.89 |
5.89 |
5.89 |
5.89 |
56.22 |
74.06 |
-1.00 |
71.08 |
|
5.89 |
5.89 |
5.89 |
5.89 |
38.10 |
83.91 |
71.08 |
-1.00 |
All data plotted is for process 0:



For Experiment 3, we also collected the raw sensor output for all the processes which shows the iteration duration and the timestamp when the measurement was made. That information is plotted here:




This is the same problem size as Experiments 1, 2, and 3. Here 7 systems at UIUC and UTK were selected. There was an extremely high network load on the UTK systems during this run. An additional computational load was introduced on Processor X about 70 iterations into the run.
N = 12000; NB=64; Processes=7;
| machine | opus15 | opus14 | opus16 | torc4 | torc6 | torc5 | torc7 |
| mem(MB) | 225 | 225 | 226 | 486 | 486 | 486 | 487 |
| speed | 270 | 270 | 270 | 330 | 330 | 330 | 330 |
| load | 1.00 | 1.00 | 1.00 | 1.00 | 1.56 | 1.17 | 1.11 |
Fine grid latency matrix :
|
-1.00 |
0.29 |
0.29 |
193.71 |
193.71 |
193.71 |
193.71 |
|
0.29 |
-1.00 |
0.22 |
193.71 |
193.71 |
193.71 |
193.71 |
|
0.29 |
0.22 |
-1.00 |
193.71 |
193.71 |
193.71 |
193.71 |
|
193.71 |
193.71 |
193.71 |
-1.00 |
0.32 |
0.31 |
0.28 |
|
193.71 |
193.71 |
193.71 |
0.32 |
-1.00 |
0.49 |
0.30 |
|
193.71 |
193.71 |
193.71 |
0.31 |
0.49 |
-1.00 |
0.31 |
|
193.71 |
193.71 |
193.71 |
0.28 |
0.30 |
0.31 |
-1.00 |
Fine grid Bandwidth matrix :
|
-1.00 |
258.52 |
242.39 |
0.73 |
0.73 |
0.73 |
0.73 |
|
258.52 |
-1.00 |
252.43 |
0.73 |
0.73 |
0.73 |
0.73 |
|
242.39 |
252.43 |
-1.00 |
0.73 |
0.73 |
0.73 |
0.73 |
|
0.73 |
0.73 |
0.73 |
-1.00 |
82.01 |
72.05 |
49.08 |
|
0.73 |
0.73 |
0.73 |
82.01 |
-1.00 |
58.88 |
51.89 |
|
0.73 |
0.73 |
0.73 |
72.05 |
58.88 |
-1.00 |
57.30 |
|
0.73 |
0.73 |
0.73 |
49.08 |
51.89 |
57.30 |
-1.00 |
All data plotted is for process 0:


For Experiment 4, we also collected the raw sensor output for all the processes which shows the iteration duration and the timestamp when the measurement was made. That information is plotted here:


The following plot shows the ratio of the measured to the predicted values for 3 different "metrics of performance" which might be used as the basis for contract validation.

This material is based upon work supported by the National Science Foundation under Grant No. 9975020.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Last modified: Tuesday, November 20, 2001 01:08 PM