
Date: Fri, 2 Feb 2001 17:19:46 -0500 (EST)
From: Antoine Petitet <petitet@cs.utk.edu>
To: grads-demo@cs.rice.edu
cc: Antoine Petitet <petitet@cs.utk.edu>
Subject: ScaLAPACK demo clarifications
Folks,
< Ruth here deleted things related to how to run demo, select systems, etc >
As far as the performance modelling goes, I believe it quite accurate, or so I convinced
myself when this part of the code was developed for the UIUC colleagues. I did a minimal
set of tests, and it is always possible that I screwed it up. However, since I saw the
runs I performed before the last meeting, there seem to be a slight problem that people
are chasing me with. First, it is likely that the flops / cycle specified in the filtering
file and used by the performance model is an over- estimate of what someone can actually
measure. Second, unless experiments are conducted on a collection of "quite"
machines, or say uniformly loaded for a given period of time, it is going to be difficult
to say much about the performance model and the measured performance; and that's because
the machine info such as latency/bandwidth/speed/availCPU is not refreshed during the run.
Finally, the ScaLAPACK library LU factorization routine does not use lookahead, and the
performance model currently used in the GrADS demo reflects this fact. Now, the
iscrepancies currently seen between the measurements and the prediction are certainly a
problem that need to be adressed. However, I am surely not convinced yet that these
problems are imputable to the model itself. I would recommend to check out first the
parameters of the machine that are actually used by the model. There is certainly a lot to
say about the latency and bandwidth measurements obtained by NWS, and similarly using the
peak flops/cycle ratio of each architecture is also debatable even if the model already
scale this number by 0.75 to match the performance of ATLAS on each Pentium box. That
number is also scaled a second time by the availble_cpu retrieved during the NWS query.
Again, that is arguable, since the fact that 60% say of the cpu power is available does
not mean that the kernel will give it entirely to the (ScaLAPACK/GrADS) application.
As far as using HPL instead of ScaLAPACK because this software features lookahead
natively, I am not sure this is the project purpose. The way I understand it, is that you
are all interested in looking at how libraries could be instrumented to fit the Grid
requirements. Moreover, lookahead is a very nice idea, but difficult (in a portable
manner) to express using say the MPI interface; and that's is probably one of the reason
it was included in the ScaLAPACK routine initially. Overlapping computation and
communication within a node is/has been popular but truly difficult to make happen and
visualize in practice, but I would certainly be happy to be convinced otherwise. Besides
the Intel Paragon and maybe Myrinet, I don't seem to recall any machine able to do this at
the user level.
Cheers,
Antoine
***********************************************************************
Antoine Petitet Innovative Computing Lab Computer Science Dept
University of Tennessee 1122 Volunteer Blvd Knoxville TN, 37996-3450
Phone: 865-974-8298 petitet@cs.utk.edu <http://www.cs.utk.edu/~petitet>
Date: Sat, 3 Feb 2001 02:43:10 -0500 (EST)
From: Antoine Petitet <petitet@cs.utk.edu>
To: grads-demo@cs.rice.edu
Subject: Re: ScaLAPACK demo clarifications
Folks,
There was a slight typo in the message I sent earlier, so let's correct it now: The ScaLAPACK LU factorization never featured lookahead. We use a split-ring row broadcast by default, but the update phase of the next process column is not split into 2 phases. In other words, the panel broadcast in the process rows is pipelined, but there is no lookahead.
For the record and the interested reader, the oldest paper that describes such a look-ahead or send-ahead technique is I believe: Communication Complexity of the Gaussian Elimination Algorithm on Multiprocessors, Y. Saad, Linear Algebra and Its Applications, Vol. 77, pp. 315-340, 1986. Do not hesitate to correct if I am wrong. I will certainly be interested to know if an older paper exists. The point is that the absence of lookahead in this ScaLAPACK routine is partially explained by its difficulty of expression in a portable/efficient way using standard communication libraries. The lookahead efficiency also depends on the connectivity of your interconnect. All in all, and from a highly portable software library designer point of view, there are good arguments to nuke lookahead from your "to have" list.
Hope this helps,
Antoine
![]()
Department of Computer Science
University of Illinois at Urbana-Champaign
Last modified: Tuesday, February 06, 2001 01:12 PM