GrADS:   Mail posted to grads-demo related to contract predictions, etc.

 

 

Date: Fri, 2 Feb 2001 17:19:46 -0500 (EST)
From: Antoine Petitet <petitet@cs.utk.edu>
To: grads-demo@cs.rice.edu
cc: Antoine Petitet <petitet@cs.utk.edu>
Subject: ScaLAPACK demo clarifications

Folks,


< Ruth here deleted things related to how to run demo, select systems, etc >


As far as the performance modelling goes, I believe it quite accurate, or so I convinced myself when this part of the code was developed for the UIUC colleagues. I did a minimal set of tests, and it is always possible that I screwed it up. However, since I saw the runs I performed before the last meeting, there seem to be a slight problem that people are chasing me with. First, it is likely that the flops / cycle specified in the filtering file and used by the performance model is an over- estimate of what someone can actually measure. Second, unless experiments are conducted on a collection of "quite" machines, or say uniformly loaded for a given period of time, it is going to be difficult to say much about the performance model and the measured performance; and that's because the machine info such as latency/bandwidth/speed/availCPU is not refreshed during the run.


Finally, the ScaLAPACK library LU factorization routine does not use lookahead, and the performance model currently used in the GrADS demo reflects this fact. Now, the iscrepancies currently seen between the measurements and the prediction are certainly a problem that need to be adressed. However, I am surely not convinced yet that these problems are imputable to the model itself. I would recommend to check out first the parameters of the machine that are actually used by the model. There is certainly a lot to say about the latency and bandwidth measurements obtained by NWS, and similarly using the peak flops/cycle ratio of each architecture is also debatable even if the model already scale this number by 0.75 to match the performance of ATLAS on each Pentium box. That number is also scaled a second time by the availble_cpu retrieved during the NWS query. Again, that is arguable, since the fact that 60% say of the cpu power is available does not mean that the kernel will give it entirely to the (ScaLAPACK/GrADS) application.


As far as using HPL instead of ScaLAPACK because this software features lookahead natively, I am not sure this is the project purpose. The way I understand it, is that you are all interested in looking at how libraries could be instrumented to fit the Grid requirements. Moreover, lookahead is a very nice idea, but difficult (in a portable manner) to express using say the MPI interface; and that's is probably one of the reason it was included in the ScaLAPACK routine initially. Overlapping computation and communication within a node is/has been popular but truly difficult to make happen and visualize in practice, but I would certainly be happy to be convinced otherwise. Besides the Intel Paragon and maybe Myrinet, I don't seem to recall any machine able to do this at the user level.


Cheers,
Antoine
***********************************************************************
Antoine Petitet Innovative Computing Lab Computer Science Dept
University of Tennessee 1122 Volunteer Blvd Knoxville TN, 37996-3450
Phone: 865-974-8298 petitet@cs.utk.edu <http://www.cs.utk.edu/~petitet>

 


 

Date: Sat, 3 Feb 2001 02:43:10 -0500 (EST)
From: Antoine Petitet <petitet@cs.utk.edu>
To: grads-demo@cs.rice.edu
Subject: Re: ScaLAPACK demo clarifications

Folks,

There was a slight typo in the message I sent earlier, so let's correct it now: The ScaLAPACK LU factorization never featured lookahead. We use a split-ring row broadcast by default, but the update phase of  the next process column is not split into 2 phases. In other words, the panel broadcast in the process rows is pipelined, but there is no lookahead.

 For the record and the interested reader, the oldest paper that describes such a look-ahead or send-ahead technique is I believe: Communication Complexity of the Gaussian Elimination Algorithm on Multiprocessors, Y. Saad, Linear Algebra and Its Applications, Vol. 77, pp. 315-340, 1986.  Do not hesitate to correct if I am wrong. I will certainly be interested to know if an older paper exists. The point is that the absence of lookahead in this ScaLAPACK routine is partially explained by its difficulty of expression in a portable/efficient way using standard communication libraries. The lookahead efficiency also depends on the connectivity of your interconnect. All in all, and from a highly portable software library designer point of view, there are good arguments to nuke lookahead from your "to have" list.

 Hope this helps,

Antoine



Department of Computer Science
University of Illinois at Urbana-Champaign

webmaster@renci.org

Last modified: Tuesday, February 06, 2001 01:12 PM