Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0, Lustre 2.9.0
Affects Version/s: Lustre 2.5.1
Labels:
- patch
Environment:
Servers runs 2.5.1, clients run Lustre 2.5.1

Severity:
3
Rank (Obsolete):
9223372036854775807

As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function.

Logs shows messages like this:

Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000

The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes.

is related to

LU-6573 multiple tests: client evicted, Input/output error

Resolved

LU-6808 Interop 2.5.3<->master sanity test_224c: Bulk IO write error

Resolved

Assignee:: Emoly Liu

Reporter:: Artem Blagodarenko (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 08/Apr/15 9:03 AM

Updated:: 22/Jun/22 8:40 PM

Resolved:: 01/May/15 11:32 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates