×

Notice

The forum is in read only mode.

MPI CPU load distribution not happening

More
13 years 7 months ago #5883 by JMB
Hello,

I have setup a 2 PC cluster (ubuntu34 & ubuntu35), using a standard CAELinux2011 with mpich2 package added in (I believe properly configured, since the 'Pi' test program [cpi.c] works okay in this 2 PC cluster). I am using mumps01a.* as a test case. It seems that with mpi_nbcpu=2 (ncpus=1 & mpi_nbnoeud=1); I always see two cores being used on ubuntu34, regardless of whether I submit the via ASTK job on ubuntu34 or ubuntu35. I do not see the job being run on any of the cores of ubuntu35! Can anybody clue me as to why? Thanks.

The hostfile I am using for the jobs is:
[code:1]ubuntu34:4
ubuntu35:4[/code:1]
Reversing the order in the $HOME/mpi_hostfile makes no difference. I have tried various things that have not solved this problem, such as:
- Increasing mpi_nbcpu, beyond two does not run.
- Using ubuntu34:1 and ubuntu34:1 in the mpi_hostfile does not help.

I am baffled...

Regards, JMB

PS: This is a re-post of www.code-aster.org/forum2/viewtopic.php?id=16032, with the hope that somebody can help out, test it, or suggest solution(s).

Post edited by: JMB, at: 2011/11/09 16:55
More
13 years 7 months ago #5888 by Joël Cugnoni
Replied by Joël Cugnoni on topic Re:MPI CPU load distribution not happening
Hi JMB,

for the sake of completeness, I post my answer here as well

By the way, I am currently working on a set of scripts/GUI to automate cluster deployment either locally or on Amazon EC2...

I will inform you when I works.

Answer:

to use MPi in CAELinux 2011, you don't need (and should not install) MPICH2 , Code-Aster 11.0 is already compiled using openMPI libraries (and having several MPI libraries installed in the system may create configuration problems).

Personnally, this is the way I proceed, starting from 2 PC with a fresh install of CAELinux 2011 (even if using LiveDVD/liveUSB mode)
so here is a small "How To" for you and others:

1) setup network to have interconnection: I use Network Manager to setup static IP adresses.
set hostnames:

on machine 1: sudo hostname caepc1
on machine 2: sudo hostname caepc2


2) edit /etc/hosts of both machines to define host/ip relationships

sudo nano /etc/hosts

add such lines after 127.0.1.1 xxxx :

192.168.0.1 caepc1
192.168.0.2 caepc2

3) edit your configuration settings directly in /opt/aster110/etc/codeaster/aster-mpihosts

for example (use OpenMPI syntax):

caepc1 slots=1
caepc2 slots=1

4) optional: if you have more than 8Gb Ram per node or more than 16 cores in the cluster, edit also /opt/aster110/etc/codeaster/asrun to tune "interactif_memmax" = max memory per node and "interactif_mpi_nbpmax" = number of cores in the cluster

(optional) passwords: if using liveVD/liveUSB mode, you need to set a password for the default user caelinux.
so on each node, run in a terminal "passwd" (default password is empty) to set a new password

5) ssh setup: you need ssh login without passwords between the two hosts:
on first node, run
scp /home/caelinux/.ssh/id* caepc2:/home/caelinux/.ssh/
scp /home/caelinux/.ssh/authorized* caepc2:/home/caelinux/.ssh/
ssh-keyscan caepc1 >> /home/caelinux/.ssh/known_hosts
ssh-keyscan caepc2 >> /home/caelinux/.ssh/known_hosts
scp /home/caelinux/.ssh/known_hosts caepc2:/home/caelinux/.ssh/

6) setup a shared temp directory with NFS
on node 1

sudo mkdir /srv/shared_tmp
sudo chmod a+rwx /srv/shared_tmp
sudo nano /etc/exports

then add the following line and save:

/srv/shared_tmp *(rw,async)

then

sudo exportfs -a

Now create the mount point and mount the shared folder, run this on all nodes:

sudo mkdir /mnt/shared_tmp
sudo chmod a+rwx /mnt/shared_tmp
sudo mount -t nfs -o rw,rsize=8192,wsize=8192 caepc1:/srv/shared_tmp /mnt/shared_tmp

7) setup Aster config to use this shared temp directory:
nano /opt/aster110/eetc/codeaster/asrun

edit the line with "shared_tmp" as follows:

shared_tmp : /mnt/shared_tmp

then save

8) Open ASTK , go in server and refresh; create your Job, select Options ncpus=1 (no openMP) , mpi_nbcpu= total number of cores to use (nb_noeu*cores_per_host) and mpi_nbnoeud = number of compute nodes

And finally it should run on several nodes!!

Actually , the hard point is that you NEED to have shared tmp folder to run the jobs on a cluster.

Joël Cugnoni - a.k.a admin
www.caelinux.com
More
13 years 7 months ago #5889 by JMB
Hello jcugnoni,

Thank you for the detailed reply! It was most useful and now it works as expected!!!

Previously I had ensured that the steps 1 ~ 7 you had posted were in place. My mistake was to install mpich2 and mpich2-doc and its attendant changes forced upon /opt/aster110/etc/codeaster/asrun and so on. Once I removed the two packages and backtracked all the changes I have made to make mpich2 work, the parallelism of CodeAster_MPI works "almost-out-of-the-box". I say almost because one does have to ensure that the other pre-requisite steps 1 ~ 7 are in place. Then step 8 worked as you had stated.

Thanks for the wonderful job on CAELInux2011 and the above "How To". I am VERY much grateful for it...

Regards, JMB

PS: Duplicated here for completeness, too! I look forward to those scripts...
More
13 years 1 day ago #6356 by florante
Replied by florante on topic Re:MPI CPU load distribution not happening

Administrator wrote: Hi JMB,

for the sake of completeness, I post my answer here as well

By the way, I am currently working on a set of scripts/GUI to automate cluster deployment either locally or on Amazon EC2...

I will inform you when I works.


Answer:

to use MPi in CAELinux 2011, you don't need (and should not install) MPICH2 , Code-Aster 11.0 is already compiled using openMPI libraries (and having several MPI libraries installed in the system may create configuration problems).

Personnally, this is the way I proceed, starting from 2 PC with a fresh install of CAELinux 2011 (even if using LiveDVD/liveUSB mode)
so here is a small "How To" for you and others:

1) setup network to have interconnection: I use Network Manager to setup static IP adresses.
set hostnames:

on machine 1: sudo hostname caepc1
on machine 2: sudo hostname caepc2


2) edit /etc/hosts of both machines to define host/ip relationships

sudo nano /etc/hosts

add such lines after 127.0.1.1 xxxx :

192.168.0.1 caepc1
192.168.0.2 caepc2

3) edit your configuration settings directly in /opt/aster110/etc/codeaster/aster-mpihosts

for example (use OpenMPI syntax):

caepc1 slots=1
caepc2 slots=1

4) optional: if you have more than 8Gb Ram per node or more than 16 cores in the cluster, edit also /opt/aster110/etc/codeaster/asrun to tune "interactif_memmax" = max memory per node and "interactif_mpi_nbpmax" = number of cores in the cluster

(optional) passwords: if using liveVD/liveUSB mode, you need to set a password for the default user caelinux.
so on each node, run in a terminal "passwd" (default password is empty) to set a new password

5) ssh setup: you need ssh login without passwords between the two hosts:
on first node, run
scp /home/caelinux/.ssh/id* caepc2:/home/caelinux/.ssh/
scp /home/caelinux/.ssh/authorized* caepc2:/home/caelinux/.ssh/
ssh-keyscan caepc1 >> /home/caelinux/.ssh/known_hosts
ssh-keyscan caepc2 >> /home/caelinux/.ssh/known_hosts
scp /home/caelinux/.ssh/known_hosts caepc2:/home/caelinux/.ssh/

6) setup a shared temp directory with NFS
on node 1

sudo mkdir /srv/shared_tmp
sudo chmod a+rwx /srv/shared_tmp
sudo nano /etc/exports

then add the following line and save:

/srv/shared_tmp *(rw,async)

then

sudo exportfs -a

Now create the mount point and mount the shared folder, run this on all nodes:

sudo mkdir /mnt/shared_tmp
sudo chmod a+rwx /mnt/shared_tmp
sudo mount -t nfs -o rw,rsize=8192,wsize=8192 caepc1:/srv/shared_tmp /mnt/shared_tmp

7) setup Aster config to use this shared temp directory:
nano /opt/aster110/eetc/codeaster/asrun

edit the line with "shared_tmp" as follows:

shared_tmp : /mnt/shared_tmp

then save

8) Open ASTK , go in server and refresh; create your Job, select Options ncpus=1 (no openMP) , mpi_nbcpu= total number of cores to use (nb_noeu*cores_per_host) and mpi_nbnoeud = number of compute nodes

And finally it should run on several nodes!!

Actually , the hard point is that you NEED to have shared tmp folder to run the jobs on a cluster.


Hi Admin,

Can I run code_saturne on cluster as well?

I have several pc's here just running idle. I want to install caelinux on all of this machines as virtual machines. Mostly, I will be running code_saturne.

Thanks

Florante
More
13 years 1 day ago #6357 by Claus
Replied by Claus on topic Re:MPI CPU load distribution not happening
CS works really well with MPI on a single multi-core work station, so I would imagine there would be little trouble in distributing it across a few remote nodes.

/C

Code_Aster release : STA11.4 on OpenSUSE 12.3 64 bits - EDF/Intel version
More
13 years 22 hours ago #6359 by florante
Replied by florante on topic Re:MPI CPU load distribution not happening

claws wrote: CS works really well with MPI on a single multi-core work station, so I would imagine there would be little trouble in distributing it across a few remote nodes.

/C



thanks for the insight Claws.

I just notice an option in CS wizard on computer selection under Prepare batch calculation.
You can either use a Workstation or Cluster with PBS queue system.It might be that CS is design for cluster computing as well?

I would like to try this option since it took about 10 hours for one of my simulations to finish using a single Dell T7400(Xeon-quad core with 4G mem). I have another one T7400 spare and couple of T3400. If I could utilize all of these machines, I am thinking that I could cut the simulation time by more than 50% but looking at making it less than 3 hours.

Anybody can point me to the right direction?

thanks

Florante
Moderators: catux
Time to create page: 0.233 seconds
Powered by Kunena Forum