MPI CPU load distribution not happening
- JMB
- Topic Author
- Offline
- Elite Member
-
Less
More
- Posts: 166
- Thank you received: 0
13 years 7 months ago #5883
by JMB
MPI CPU load distribution not happening was created by JMB
Hello,
I have setup a 2 PC cluster (ubuntu34 & ubuntu35), using a standard CAELinux2011 with mpich2 package added in (I believe properly configured, since the 'Pi' test program [cpi.c] works okay in this 2 PC cluster). I am using mumps01a.* as a test case. It seems that with mpi_nbcpu=2 (ncpus=1 & mpi_nbnoeud=1); I always see two cores being used on ubuntu34, regardless of whether I submit the via ASTK job on ubuntu34 or ubuntu35. I do not see the job being run on any of the cores of ubuntu35! Can anybody clue me as to why? Thanks.
The hostfile I am using for the jobs is:
[code:1]ubuntu34:4
ubuntu35:4[/code:1]
Reversing the order in the $HOME/mpi_hostfile makes no difference. I have tried various things that have not solved this problem, such as:
- Increasing mpi_nbcpu, beyond two does not run.
- Using ubuntu34:1 and ubuntu34:1 in the mpi_hostfile does not help.
I am baffled...
Regards, JMB
PS: This is a re-post of www.code-aster.org/forum2/viewtopic.php?id=16032, with the hope that somebody can help out, test it, or suggest solution(s).
Post edited by: JMB, at: 2011/11/09 16:55
I have setup a 2 PC cluster (ubuntu34 & ubuntu35), using a standard CAELinux2011 with mpich2 package added in (I believe properly configured, since the 'Pi' test program [cpi.c] works okay in this 2 PC cluster). I am using mumps01a.* as a test case. It seems that with mpi_nbcpu=2 (ncpus=1 & mpi_nbnoeud=1); I always see two cores being used on ubuntu34, regardless of whether I submit the via ASTK job on ubuntu34 or ubuntu35. I do not see the job being run on any of the cores of ubuntu35! Can anybody clue me as to why? Thanks.
The hostfile I am using for the jobs is:
[code:1]ubuntu34:4
ubuntu35:4[/code:1]
Reversing the order in the $HOME/mpi_hostfile makes no difference. I have tried various things that have not solved this problem, such as:
- Increasing mpi_nbcpu, beyond two does not run.
- Using ubuntu34:1 and ubuntu34:1 in the mpi_hostfile does not help.
I am baffled...
Regards, JMB
PS: This is a re-post of www.code-aster.org/forum2/viewtopic.php?id=16032, with the hope that somebody can help out, test it, or suggest solution(s).
Post edited by: JMB, at: 2011/11/09 16:55
- Joël Cugnoni
-
- Offline
- Moderator
-
13 years 7 months ago #5888
by Joël Cugnoni
Joël Cugnoni - a.k.a admin
www.caelinux.com
Replied by Joël Cugnoni on topic Re:MPI CPU load distribution not happening
Hi JMB,
for the sake of completeness, I post my answer here as well
By the way, I am currently working on a set of scripts/GUI to automate cluster deployment either locally or on Amazon EC2...
I will inform you when I works.
Answer:
to use MPi in CAELinux 2011, you don't need (and should not install) MPICH2 , Code-Aster 11.0 is already compiled using openMPI libraries (and having several MPI libraries installed in the system may create configuration problems).
Personnally, this is the way I proceed, starting from 2 PC with a fresh install of CAELinux 2011 (even if using LiveDVD/liveUSB mode)
so here is a small "How To" for you and others:
1) setup network to have interconnection: I use Network Manager to setup static IP adresses.
set hostnames:
on machine 1: sudo hostname caepc1
on machine 2: sudo hostname caepc2
2) edit /etc/hosts of both machines to define host/ip relationships
sudo nano /etc/hosts
add such lines after 127.0.1.1 xxxx :
192.168.0.1 caepc1
192.168.0.2 caepc2
3) edit your configuration settings directly in /opt/aster110/etc/codeaster/aster-mpihosts
for example (use OpenMPI syntax):
caepc1 slots=1
caepc2 slots=1
4) optional: if you have more than 8Gb Ram per node or more than 16 cores in the cluster, edit also /opt/aster110/etc/codeaster/asrun to tune "interactif_memmax" = max memory per node and "interactif_mpi_nbpmax" = number of cores in the cluster
(optional) passwords: if using liveVD/liveUSB mode, you need to set a password for the default user caelinux.
so on each node, run in a terminal "passwd" (default password is empty) to set a new password
5) ssh setup: you need ssh login without passwords between the two hosts:
on first node, run
scp /home/caelinux/.ssh/id* caepc2:/home/caelinux/.ssh/
scp /home/caelinux/.ssh/authorized* caepc2:/home/caelinux/.ssh/
ssh-keyscan caepc1 >> /home/caelinux/.ssh/known_hosts
ssh-keyscan caepc2 >> /home/caelinux/.ssh/known_hosts
scp /home/caelinux/.ssh/known_hosts caepc2:/home/caelinux/.ssh/
6) setup a shared temp directory with NFS
on node 1
sudo mkdir /srv/shared_tmp
sudo chmod a+rwx /srv/shared_tmp
sudo nano /etc/exports
then add the following line and save:
/srv/shared_tmp *(rw,async)
then
sudo exportfs -a
Now create the mount point and mount the shared folder, run this on all nodes:
sudo mkdir /mnt/shared_tmp
sudo chmod a+rwx /mnt/shared_tmp
sudo mount -t nfs -o rw,rsize=8192,wsize=8192 caepc1:/srv/shared_tmp /mnt/shared_tmp
7) setup Aster config to use this shared temp directory:
nano /opt/aster110/eetc/codeaster/asrun
edit the line with "shared_tmp" as follows:
shared_tmp : /mnt/shared_tmp
then save
8) Open ASTK , go in server and refresh; create your Job, select Options ncpus=1 (no openMP) , mpi_nbcpu= total number of cores to use (nb_noeu*cores_per_host) and mpi_nbnoeud = number of compute nodes
And finally it should run on several nodes!!
Actually , the hard point is that you NEED to have shared tmp folder to run the jobs on a cluster.
for the sake of completeness, I post my answer here as well
By the way, I am currently working on a set of scripts/GUI to automate cluster deployment either locally or on Amazon EC2...
I will inform you when I works.
Answer:
to use MPi in CAELinux 2011, you don't need (and should not install) MPICH2 , Code-Aster 11.0 is already compiled using openMPI libraries (and having several MPI libraries installed in the system may create configuration problems).
Personnally, this is the way I proceed, starting from 2 PC with a fresh install of CAELinux 2011 (even if using LiveDVD/liveUSB mode)
so here is a small "How To" for you and others:
1) setup network to have interconnection: I use Network Manager to setup static IP adresses.
set hostnames:
on machine 1: sudo hostname caepc1
on machine 2: sudo hostname caepc2
2) edit /etc/hosts of both machines to define host/ip relationships
sudo nano /etc/hosts
add such lines after 127.0.1.1 xxxx :
192.168.0.1 caepc1
192.168.0.2 caepc2
3) edit your configuration settings directly in /opt/aster110/etc/codeaster/aster-mpihosts
for example (use OpenMPI syntax):
caepc1 slots=1
caepc2 slots=1
4) optional: if you have more than 8Gb Ram per node or more than 16 cores in the cluster, edit also /opt/aster110/etc/codeaster/asrun to tune "interactif_memmax" = max memory per node and "interactif_mpi_nbpmax" = number of cores in the cluster
(optional) passwords: if using liveVD/liveUSB mode, you need to set a password for the default user caelinux.
so on each node, run in a terminal "passwd" (default password is empty) to set a new password
5) ssh setup: you need ssh login without passwords between the two hosts:
on first node, run
scp /home/caelinux/.ssh/id* caepc2:/home/caelinux/.ssh/
scp /home/caelinux/.ssh/authorized* caepc2:/home/caelinux/.ssh/
ssh-keyscan caepc1 >> /home/caelinux/.ssh/known_hosts
ssh-keyscan caepc2 >> /home/caelinux/.ssh/known_hosts
scp /home/caelinux/.ssh/known_hosts caepc2:/home/caelinux/.ssh/
6) setup a shared temp directory with NFS
on node 1
sudo mkdir /srv/shared_tmp
sudo chmod a+rwx /srv/shared_tmp
sudo nano /etc/exports
then add the following line and save:
/srv/shared_tmp *(rw,async)
then
sudo exportfs -a
Now create the mount point and mount the shared folder, run this on all nodes:
sudo mkdir /mnt/shared_tmp
sudo chmod a+rwx /mnt/shared_tmp
sudo mount -t nfs -o rw,rsize=8192,wsize=8192 caepc1:/srv/shared_tmp /mnt/shared_tmp
7) setup Aster config to use this shared temp directory:
nano /opt/aster110/eetc/codeaster/asrun
edit the line with "shared_tmp" as follows:
shared_tmp : /mnt/shared_tmp
then save
8) Open ASTK , go in server and refresh; create your Job, select Options ncpus=1 (no openMP) , mpi_nbcpu= total number of cores to use (nb_noeu*cores_per_host) and mpi_nbnoeud = number of compute nodes
And finally it should run on several nodes!!
Actually , the hard point is that you NEED to have shared tmp folder to run the jobs on a cluster.
Joël Cugnoni - a.k.a admin
www.caelinux.com
- JMB
- Topic Author
- Offline
- Elite Member
-
Less
More
- Posts: 166
- Thank you received: 0
13 years 7 months ago #5889
by JMB
Replied by JMB on topic Re:MPI CPU load distribution not happening
Hello jcugnoni,
Thank you for the detailed reply! It was most useful and now it works as expected!!!
Previously I had ensured that the steps 1 ~ 7 you had posted were in place. My mistake was to install mpich2 and mpich2-doc and its attendant changes forced upon /opt/aster110/etc/codeaster/asrun and so on. Once I removed the two packages and backtracked all the changes I have made to make mpich2 work, the parallelism of CodeAster_MPI works "almost-out-of-the-box". I say almost because one does have to ensure that the other pre-requisite steps 1 ~ 7 are in place. Then step 8 worked as you had stated.
Thanks for the wonderful job on CAELInux2011 and the above "How To". I am VERY much grateful for it...
Regards, JMB
PS: Duplicated here for completeness, too! I look forward to those scripts...
Thank you for the detailed reply! It was most useful and now it works as expected!!!
Previously I had ensured that the steps 1 ~ 7 you had posted were in place. My mistake was to install mpich2 and mpich2-doc and its attendant changes forced upon /opt/aster110/etc/codeaster/asrun and so on. Once I removed the two packages and backtracked all the changes I have made to make mpich2 work, the parallelism of CodeAster_MPI works "almost-out-of-the-box". I say almost because one does have to ensure that the other pre-requisite steps 1 ~ 7 are in place. Then step 8 worked as you had stated.
Thanks for the wonderful job on CAELInux2011 and the above "How To". I am VERY much grateful for it...
Regards, JMB
PS: Duplicated here for completeness, too! I look forward to those scripts...
- florante
- Offline
- Senior Member
-
Less
More
- Posts: 50
- Thank you received: 0
13 years 1 day ago #6356
by florante
Hi Admin,
Can I run code_saturne on cluster as well?
I have several pc's here just running idle. I want to install caelinux on all of this machines as virtual machines. Mostly, I will be running code_saturne.
Thanks
Florante
Replied by florante on topic Re:MPI CPU load distribution not happening
Administrator wrote: Hi JMB,
for the sake of completeness, I post my answer here as well
By the way, I am currently working on a set of scripts/GUI to automate cluster deployment either locally or on Amazon EC2...
I will inform you when I works.
Answer:
to use MPi in CAELinux 2011, you don't need (and should not install) MPICH2 , Code-Aster 11.0 is already compiled using openMPI libraries (and having several MPI libraries installed in the system may create configuration problems).
Personnally, this is the way I proceed, starting from 2 PC with a fresh install of CAELinux 2011 (even if using LiveDVD/liveUSB mode)
so here is a small "How To" for you and others:
1) setup network to have interconnection: I use Network Manager to setup static IP adresses.
set hostnames:
on machine 1: sudo hostname caepc1
on machine 2: sudo hostname caepc2
2) edit /etc/hosts of both machines to define host/ip relationships
sudo nano /etc/hosts
add such lines after 127.0.1.1 xxxx :
192.168.0.1 caepc1
192.168.0.2 caepc2
3) edit your configuration settings directly in /opt/aster110/etc/codeaster/aster-mpihosts
for example (use OpenMPI syntax):
caepc1 slots=1
caepc2 slots=1
4) optional: if you have more than 8Gb Ram per node or more than 16 cores in the cluster, edit also /opt/aster110/etc/codeaster/asrun to tune "interactif_memmax" = max memory per node and "interactif_mpi_nbpmax" = number of cores in the cluster
(optional) passwords: if using liveVD/liveUSB mode, you need to set a password for the default user caelinux.
so on each node, run in a terminal "passwd" (default password is empty) to set a new password
5) ssh setup: you need ssh login without passwords between the two hosts:
on first node, run
scp /home/caelinux/.ssh/id* caepc2:/home/caelinux/.ssh/
scp /home/caelinux/.ssh/authorized* caepc2:/home/caelinux/.ssh/
ssh-keyscan caepc1 >> /home/caelinux/.ssh/known_hosts
ssh-keyscan caepc2 >> /home/caelinux/.ssh/known_hosts
scp /home/caelinux/.ssh/known_hosts caepc2:/home/caelinux/.ssh/
6) setup a shared temp directory with NFS
on node 1
sudo mkdir /srv/shared_tmp
sudo chmod a+rwx /srv/shared_tmp
sudo nano /etc/exports
then add the following line and save:
/srv/shared_tmp *(rw,async)
then
sudo exportfs -a
Now create the mount point and mount the shared folder, run this on all nodes:
sudo mkdir /mnt/shared_tmp
sudo chmod a+rwx /mnt/shared_tmp
sudo mount -t nfs -o rw,rsize=8192,wsize=8192 caepc1:/srv/shared_tmp /mnt/shared_tmp
7) setup Aster config to use this shared temp directory:
nano /opt/aster110/eetc/codeaster/asrun
edit the line with "shared_tmp" as follows:
shared_tmp : /mnt/shared_tmp
then save
8) Open ASTK , go in server and refresh; create your Job, select Options ncpus=1 (no openMP) , mpi_nbcpu= total number of cores to use (nb_noeu*cores_per_host) and mpi_nbnoeud = number of compute nodes
And finally it should run on several nodes!!
Actually , the hard point is that you NEED to have shared tmp folder to run the jobs on a cluster.
Hi Admin,
Can I run code_saturne on cluster as well?
I have several pc's here just running idle. I want to install caelinux on all of this machines as virtual machines. Mostly, I will be running code_saturne.
Thanks
Florante
- Claus
-
- Offline
- Moderator
-
Less
More
- Posts: 670
- Thank you received: 34
13 years 1 day ago #6357
by Claus
Code_Aster release : STA11.4 on OpenSUSE 12.3 64 bits - EDF/Intel version
Replied by Claus on topic Re:MPI CPU load distribution not happening
CS works really well with MPI on a single multi-core work station, so I would imagine there would be little trouble in distributing it across a few remote nodes.
/C
/C
Code_Aster release : STA11.4 on OpenSUSE 12.3 64 bits - EDF/Intel version
- florante
- Offline
- Senior Member
-
Less
More
- Posts: 50
- Thank you received: 0
13 years 22 hours ago #6359
by florante
thanks for the insight Claws.
I just notice an option in CS wizard on computer selection under Prepare batch calculation.
You can either use a Workstation or Cluster with PBS queue system.It might be that CS is design for cluster computing as well?
I would like to try this option since it took about 10 hours for one of my simulations to finish using a single Dell T7400(Xeon-quad core with 4G mem). I have another one T7400 spare and couple of T3400. If I could utilize all of these machines, I am thinking that I could cut the simulation time by more than 50% but looking at making it less than 3 hours.
Anybody can point me to the right direction?
thanks
Florante
Replied by florante on topic Re:MPI CPU load distribution not happening
claws wrote: CS works really well with MPI on a single multi-core work station, so I would imagine there would be little trouble in distributing it across a few remote nodes.
/C
thanks for the insight Claws.
I just notice an option in CS wizard on computer selection under Prepare batch calculation.
You can either use a Workstation or Cluster with PBS queue system.It might be that CS is design for cluster computing as well?
I would like to try this option since it took about 10 hours for one of my simulations to finish using a single Dell T7400(Xeon-quad core with 4G mem). I have another one T7400 spare and couple of T3400. If I could utilize all of these machines, I am thinking that I could cut the simulation time by more than 50% but looking at making it less than 3 hours.
Anybody can point me to the right direction?
thanks
Florante
Moderators: catux
Time to create page: 0.233 seconds