Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3
Affects Version/s: None
Labels:
None
Environment:
Lustre 2.4.0-21chaos

Severity:
3
Rank (Obsolete):
12505

It would appear that the timeout system in Lustre is horribly broken in Lustre 2.4.0-21chaos (see github.com/chaos/lustre). On MDS nodes, we frequently see problems where almost all of the mdt threads are stuck waiting under in ldlm_completion_ast(). We see warning on the console from the kernel that these threads are sleeping for in excess of 1200 seconds, despite an at_max of 600 seconds.

The problems get worse than that, sometimes we'll see clients evicted by an mdt after 9000+ seconds. Obviously, that isn't acceptable.

The practical effect of these poorly handled timeouts are file systems that go unresponsive for hours (if not days) at a time.

We need to work out a plan to fix the timeouts in lustre.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

sysrq-t.catalyst141.client.txt
826 kB
10/May/14 12:46 AM
sysrq-t.cider-mds1.txt
1.58 MB
10/May/14 12:46 AM

is related to

LU-4942 lock callback timeout is not per-export

Resolved

LU-5497 Many MDS service threads blocked in ldlm_completion_ast()

Closed

is related to

LU-4786 Apparent denial of service from client to mdt

Resolved

LU-4570 Metadata slowdowns on production filesystem at ORNL

Closed

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Created:: 03/Feb/14 9:58 PM

Updated:: 19/Aug/14 9:58 PM

Resolved:: 23/Jun/14 9:27 PM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates