PETSc mpich and lam "benchmarks"

To: Debian Beowulf <debian-beowulf@lists.debian.org>, David Dussault <dussault@lyre.mit.edu>
Subject: PETSc mpich and lam "benchmarks"
From: Adam C Powell IV <hazelsct@mit.edu>
Date: Wed, 14 Feb 2001 16:55:24 -0500
Message-id: <[🔎] 3A8AFECB.F9E8B931@mit.edu>

Greetings,

I'm running PETSc examples with mpich and lam to test relative performance,
using the really cool (IMHO :-) /etc/alternatives system to select between the
PETSc/MPI implementations.  (To build the lam version, do "debian/rules
PETSC_MPI=lam binary".)  This is on a set of four 600 MHz LX164 alphas running
woody (testing) on 2.2.18pre21, no big iron but just enough to see some
scaling behavior.

My favorite benchmark is snes/examples/tutorials/ex9 which is 2-D
cavity-driven flow with heat transfer (and optional buoyancy).  I run it with
options -nox -mx 100 -my 100 -snes_monitor, IOW no X display of final results,
100x100 grid, give residuals at each Newton iteration.  For lam, I used
PETSc's mpirun.lam wrapper which calls:

     /usr/bin/mpirun_lam -w -c $np -s n0 $progname -- $options
     /usr/bin/lamclean

The results (smallest wall clock time of three tries, seconds):

                              # Procs mpichlam
                              "0"     88.2688.53
                              1       89.2690.31
                              2       53.8956.15
                              4       22.6522.71

("0" means it was run without mpirun.)  So for all practical purposes, mpich
and lam are indistinguishable on this small cluster of alphas.

Eray, your qualitative tests show lam beating mpich by quite a bit, do you
have numbers to support this?  You mentioned you modified PETSc to work with
lam, how different are your mods from the current "PETSC_MPI=lam"
implementation in 2.0.29-3?  If you did something better, I'd love to put it
in the package.

I find the 2-processor results a bit odd.  They're considerably more than half
the 1-processor time, but whatever overhead is involved in parallel setup
seems to be absent when 4 processors are run.  So more processors scale
linearly, fewer are less efficient?

One interesting result is that for this application, the matrix-free
Newton-Krylov solution method (runtime option -snes_mf) results in much longer
times.  For grids larger than about 25x25, it doesn't converge at all, and for
25x25 on one processor, it takes twice as long as the standard "coloring"
automatic Jacobian calculation method (1.45 secs vs. 0.78).

Coming soon: Debian contrib packages for the Compaq alpha compilers, modeled
after Joey Hess' i386 realplayer RPM installer package.  Then you can
"debian/rules PETSC_ARCH=linux_alpha_dec binary" to build PETSc with ccc/fort,
which should accelerate these calculations quite a bit.

Zeen,

-Adam P.

GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

             Welcome to the best software in the world today cafe!

Reply to:

Follow-Ups:
- Re: PETSc mpich and lam "benchmarks"
  - From: Adam C Powell IV <hazelsct@mit.edu>

Prev by Date: Nagle Algorithm
Next by Date: Re: PETSc mpich and lam "benchmarks"
Previous by thread: Re: Nagle Algorithm
Next by thread: Re: PETSc mpich and lam "benchmarks"
Index(es):
- Date
- Thread