Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.4.0
-
None
-
Lustre servers:
CentOS 6.2 with Linux version 2.6.32-220.4.2.el6_lustre.x86_64 (jenkins@client 31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012
Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64
HW: 2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz, RAM 48GB
Lustre clients:
CentOS 6.4 with Linux version 2.6.32-358.6.2.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 20:59:36 UTC 2013
Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64
HW: Login nodes are of type 12-core AMD Opteron 6174
Compute nodes are a mix of:
12-core AMD Opteron 6174
quad-core AMD Opteron 8380
quad-core AMD Opteron 8384
8-core Intel Xeon E7-8837Lustre servers: CentOS 6.2 with Linux version 2.6.32-220.4.2.el6_lustre.x86_64 ( jenkins@client 31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012 Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64 HW: 2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz, RAM 48GB Lustre clients: CentOS 6.4 with Linux version 2.6.32-358.6.2.el6.x86_64 ( mockbuild@c6b8.bsys.dev.centos.org ) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 20:59:36 UTC 2013 Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64 HW: Login nodes are of type 12-core AMD Opteron 6174 Compute nodes are a mix of: 12-core AMD Opteron 6174 quad-core AMD Opteron 8380 quad-core AMD Opteron 8384 8-core Intel Xeon E7-8837
-
3
-
10466
Description
Dear support,
our customer is experiencing an eviction problem on the login/client nodes in his lustre cluster.
The eviction seems too frequent and, since seems to be no other way to recover then reboot the node, this interrupts the users work heavily.
The cause of eviction seems to be that the login nodes may be temporarily stuck by different applications for fractions of seconds and consequently the oss does not see the client for minutes (according to logs).
The customer attempted to solve the issue by booting the login nodes with kernel options "notsc" and "clocksource=hpet" to no avail.
Also, they mounted lustre over tcp instead of IB, which also did not help.
Infrastructure info:
2 node for MDS/MGS with pacemaker/corosync cluster
8 node for OSS in a 2 node cluster configuration with pacemaker/corosync cluster with a dothill storage controller per pair.
~1000 client
We attach the messages log file for several days.
For example we found about 280 clients evicted in 5 days, it is normal?
For clarity login nodes have names brutus[1-4] and IB IP 10.201.32.31-34 and Ethernet IP 10.201.0.31-34.
Is it possible that this issue arise from the different version of lustre software btw clients and servers?
Many thanks in advance for your help.