Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.10.6, Lustre 2.12.1
-
None
-
Client:
hostname: ibmpower9
NID: 192.168.177.202@o2ib177
kernel: 4.14.0-115.2.2.el7a.ppc64le
Linux: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Architecture: ppc64le
Byte Order: Little Endian Model: 2.1 (pvr 004e 1201)
Model name: POWER9, altivec supported
Lustre (custom rebuild from git distro on this host):
kmod-lustre-client-2.12.1-1.el7.ppc64le
lustre-client-2.12.1-1.el7.ppc64le
Router:
hostname: newtevnfs
NIDs: 192.168.176.28@o2ib
192.168.177.28@o2ib177
kernel: 2.6.32-696.1.1.el6.x86_64
Linux: Scientific Linux Fermi release 6.10 (Ramsey)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Lustre version (custom rebuild from from source rpm on this host):
lustre-client-2.10.6-1.el6.x86_64
kmod-lustre-client-2.10.6-1.el6.x86_64
Server(s): tevlfsa (MDS), tevlfs1-6 (OSS)
tevlfsa 192.168.176.140@o2ib
tevlfs1 192.168.176.141@o2ib
...
tevlfs5 192.168.176.145@o2ib
tevlfs6 192.168.176.146@o2ib
kernel: 3.10.0-862.6.3.el7.x86_64
Linux: Scientific Linux release 7.4 (Nitrogen)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
Lustre version (custom rebuild from from source rpm on this host tevlfs6):
lustre-2.10.6-1.el7.x86_64
zfs-0.7.9-1.el7.x86_64
There are 6 OSTs, one OST per OSS.
Client: hostname: ibmpower9 NID: 192.168.177.202@o2ib177 kernel: 4.14.0-115.2.2.el7a.ppc64le Linux: Red Hat Enterprise Linux Server release 7.5 (Maipo) Architecture: ppc64le Byte Order: Little Endian Model: 2.1 (pvr 004e 1201) Model name: POWER9, altivec supported Lustre (custom rebuild from git distro on this host): kmod-lustre-client-2.12.1-1.el7.ppc64le lustre-client-2.12.1-1.el7.ppc64le Router: hostname: newtevnfs NIDs: 192.168.176.28@o2ib 192.168.177.28@o2ib177 kernel: 2.6.32-696.1.1.el6.x86_64 Linux: Scientific Linux Fermi release 6.10 (Ramsey) Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Lustre version (custom rebuild from from source rpm on this host): lustre-client-2.10.6-1.el6.x86_64 kmod-lustre-client-2.10.6-1.el6.x86_64 Server(s): tevlfsa (MDS), tevlfs1-6 (OSS) tevlfsa 192.168.176.140@o2ib tevlfs1 192.168.176.141@o2ib ... tevlfs5 192.168.176.145@o2ib tevlfs6 192.168.176.146@o2ib kernel: 3.10.0-862.6.3.el7.x86_64 Linux: Scientific Linux release 7.4 (Nitrogen) Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Model name: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz Lustre version (custom rebuild from from source rpm on this host tevlfs6): lustre-2.10.6-1.el7.x86_64 zfs-0.7.9-1.el7.x86_64 There are 6 OSTs, one OST per OSS.
-
3
-
9223372036854775807
Description
I'm trying to install and configure lustre client on ibm power9 64le machine to mount existing lustre system through router. The power9 host is similar to ORNL Summit worker and we going to use it for debugging software before running it on leadership facilities so this case can be of interest for others.
At first there were issues with connecting to the lnet (lctl ping); the issues get resolved after explicitly setting options in /etc/modprobe.d/ko2iblnd.conf as per LU-3322:
map_on_demand=16 - on ibmpower9
map_on_demand=256 - on x86_64 all servers and router.
Lustre was restarted and modules were reloaded after these changes.
Now I can mount lustre on power9 client and execute SINGLE file read with dd, this works for files located on any of six OSTs but doing one read transfer at a time. I did not try writes.
When I start two transfers in parallel or start one transfer, then start the other 10 seconds later I'm getting LNET error when the transfer starts for the second file. I can kill -9 dd process (but not always all processes); sometimes one of the processes can not be killed with signal -9. Even all IO processes ("dd") are killed on the client the router and servers continue to report errors in the logs; and I do observe IO on both OSTs where files reside. I can not unmount lustre on power9 client or remove modules.
"lctl net unconfigure" reports "LNET busy". I have to reboot power9 client. Only after the client reboot errors stop being reported on servers and router.
Attachments
Issue Links
- is related to
-
LU-5718 RDMA too fragmented with router
- Resolved
-
LU-3322 ko2iblnd support for different map_on_demand and peer_credits between systems
- Resolved
-
LU-7650 ko2iblnd map_on_demand can't negotitate when page sizes are different between nodes.
- Resolved
-
LU-6387 Add Power8 support to Lustre
- Resolved
-
LU-10300 Can the Lustre 2.10.x clients support 64K kernel page?
- Resolved