본문 바로가기
Engineer/프로그래밍관련

OpenMPI Error

by _제이빈_ 2020. 6. 5.

1. 시스템 환경 (System Env.)

 

OS Version : Linux / CentOS 6.7

MPI : OpenMPI-3.1.6

 

2. 오류 내용 (error)

 

[$ mpirun -np 2 ./model.x] 입력시 아래와 같은 내용이 출력

When i command [$ mpirun -np 2 ./model.x] on terminal, the outputs below were printed.

 

------------------------------------------------------------------------- 
Failed to create a completion queue (CQ): 

Hostname: head 
Requested CQE: 16384 
Error:    Cannot allocate memory 

Check the CQE attribute. 
-------------------------------------------------------------------------- 
-------------------------------------------------------------------------- 
Open MPI has detected that there are UD-capable Verbs devices on your 
system, but none of them were able to be setup properly.  This may 
indicate a problem on this system. 

You job will continue, but Open MPI will ignore the "ud" oob component 
in this run. 

Hostname: head 
-------------------------------------------------------------------------- 
-------------------------------------------------------------------------- 
No OpenFabrics connection schemes reported that they were able to be 
used on a specific port.  As such, the openib BTL (OpenFabrics 
support) will be disabled for this port. 

  Local host:           head 
  Local device:         mlx4_0 
  Local port:           1 
  CPCs attempted:       rdmacm, udcm 
-------------------------------------------------------------------------- 

 

그 후 모델이 구동되기는 한다...

 

 

3. 해결 방법 (Solution)

 

정확한 요인을 잘 모르겠지만, 구글링 해보니 max locked memory 의 문제라는데... [ulimit -a] 명령어의 출력을 보면 아래와 같다.

I don't know what the exact cause is, but when I googled it, it said it was a problem of max locked memory. ulimit -a command leads following outputs

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256637
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256637
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

max locked memory 가 64로 제한이 되어있다.

As you can see, max locked memory is limited to 64.

 

그래서 [sudo vi /etc/security/limits.conf]에서 아래 중 하나를 추가하면된다는데, 나의 경우 모두 선언 되어있었다. 

Many said adding one of the following commands in [sudo vi /etc/security/limits.conf] will solve it, but in my case it was declared already.

user1    hard   memlock           unlimited
user1    soft   memlock           unlimited
*        hard   memlock           unlimited
*        soft   memlock           unlimited

그런데..ㅋㅋ root에 진입했다가 다시 내 계정으로 돌아오니까 max locked memory가 unlimited로 바뀌어 있지 않은가..?

I found some ridiculus things here, that was, this problem was solved when I entered root account and returned to my account. the memlock was set as unlimited.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256637
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

왜 내 계정은 memlock 이 풀리지 않는지 모르겠으나.. 정확한 원인을 알기 전까지는 일단

before i realize the exact solution, i will use

 

"root 계정 진입 후 다시 내 계정으로 돌아오기 "

"Entered root account and returned to my account."

 

를 이용해 사용해야겠다... ㅠㅠ

 

반응형

댓글