Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
None
Environment:

Hide
Lustre servers:
CentOS 6.2 with Linux version 2.6.32-220.4.2.el6_lustre.x86_64 (jenkins@client 31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012

Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64

HW: 2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz, RAM 48GB

Lustre clients:
CentOS 6.4 with Linux version 2.6.32-358.6.2.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 20:59:36 UTC 2013

Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64

HW: Login nodes are of type 12-core AMD Opteron 6174
Compute nodes are a mix of:
    12-core AMD Opteron 6174
    quad-core AMD Opteron 8380
    quad-core AMD Opteron 8384
    8-core Intel Xeon E7-8837

Show
Lustre servers: CentOS 6.2 with Linux version 2.6.32-220.4.2.el6_lustre.x86_64 ( jenkins@client 31.lab.whamcloud.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 14 13:03:47 PDT 2012 Build Version: 2.2.0-RC2--PRISTINE-2.6.32-220.4.2.el6_lustre.x86_64 HW: 2xIntel(R) Xeon(R) CPU E5645 @ 2.40GHz, RAM 48GB Lustre clients: CentOS 6.4 with Linux version 2.6.32-358.6.2.el6.x86_64 ( mockbuild@c6b8.bsys.dev.centos.org ) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Thu May 16 20:59:36 UTC 2013 Build Version: 2.4.0-RC2--CHANGED-2.6.32-358.6.2.el6.x86_64 HW: Login nodes are of type 12-core AMD Opteron 6174 Compute nodes are a mix of:     12-core AMD Opteron 6174     quad-core AMD Opteron 8380     quad-core AMD Opteron 8384     8-core Intel Xeon E7-8837

Severity:
3
Rank (Obsolete):
10466

Dear support,

our customer is experiencing an eviction problem on the login/client nodes in his lustre cluster.
The eviction seems too frequent and, since seems to be no other way to recover then reboot the node, this interrupts the users work heavily.
The cause of eviction seems to be that the login nodes may be temporarily stuck by different applications for fractions of seconds and consequently the oss does not see the client for minutes (according to logs).
The customer attempted to solve the issue by booting the login nodes with kernel options "notsc" and "clocksource=hpet" to no avail.
Also, they mounted lustre over tcp instead of IB, which also did not help.

Infrastructure info:
2 node for MDS/MGS with pacemaker/corosync cluster
8 node for OSS in a 2 node cluster configuration with pacemaker/corosync cluster with a dothill storage controller per pair.
~1000 client

We attach the messages log file for several days.
For example we found about 280 clients evicted in 5 days, it is normal?

For clarity login nodes have names brutus[1-4] and IB IP 10.201.32.31-34 and Ethernet IP 10.201.0.31-34.

Is it possible that this issue arise from the different version of lustre software btw clients and servers?

Many thanks in advance for your help.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

brutus2_eviction.txt
3 kB
12/Dec/13 10:39 AM
logs_timeout_200s.tar.bz2
353 kB
08/Oct/13 1:52 PM
lustre_and_login_nodes_logs.tar.bz2aa
0.3 kB
13/Sep/13 4:48 PM
lustre_and_login_nodes_logs.tar.bz2ab
0.3 kB
13/Sep/13 4:48 PM
messages_brutus2.txt
103 kB
12/Dec/13 10:39 AM
requested_outputs.tar.bz2
1 kB
17/Dec/13 8:28 AM
vmcore-dmesg.txt
138 kB
30/Jan/14 6:18 AM
vmcore-dmesg.txt
134 kB
29/Jan/14 6:59 AM

Assignee:: Bruno Faccini (Inactive)

Reporter:: Matteo Piccinini (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 13/Sep/13 4:48 PM

Updated:: 28/Apr/14 1:41 PM

Resolved:: 28/Apr/14 1:41 PM

Details

Description

Attachments

Attachments

Activity

People

Dates