[LU-6441] OST problems following router node crash, inactive threads, clients continuously reconnecting Created: 08/Apr/15  Updated: 22/Jun/22  Resolved: 01/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: Lustre 2.8.0, Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: Emoly Liu
Resolution: Fixed Votes: 1
Labels: patch
Environment:

Servers runs 2.5.1, clients run Lustre 2.5.1


Issue Links:
Blocker
Duplicate
Related
is related to LU-6573 multiple tests: client evicted, Input... Resolved
is related to LU-6808 Interop 2.5.3<->master sanity test_22... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function.

Logs shows messages like this:

Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000

The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes.



 Comments   
Comment by Gerrit Updater [ 08/Apr/15 ]

Artem Blagodarenko (artem_blagodarenko@xyratex.com) uploaded a new patch: http://review.whamcloud.com/14399
Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0feb07fa9852dfeb8fa68ea80587cb7f11a7ab04

Comment by Artem Blagodarenko (Inactive) [ 21/Apr/15 ]

This bug happened when 4MB io is enabled. We noticed it already on two clusters. I believe this patch is important for somebody who going to use 4MB io.

Comment by Ian Costello [ 21/Apr/15 ]

Agreed re 4MB rpc, also doesn't require an LNET router crash/panic, can easily reproduce it with a similar trigger such as pulling IB cables (or whatever network you are using) on the clients while the clients are doing I/O to the filesystem.

Patching the server with the above patch I can confirm resolves the problem. Have done this on site at ANU/NCI on the available test kit, i.e. was able to reproduce, patch the server and install then spent a day and a half trying to reproduce (when I could hit this 2/3 attempts on a server without the patch).

Comment by Gerrit Updater [ 01/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14399/
Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0a6470219a8602d7a56fe1c5171dba4a42244738

Comment by Peter Jones [ 01/May/15 ]

Landed for 2.8

Comment by Gerrit Updater [ 09/Sep/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403
Subject: LU-6441 ptlrpc: fix the problem of the patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6bfc8faac211e5c2dd5a369ac23438565d3d16c0

Comment by Peter Jones [ 09/Sep/16 ]

Jinshan

It would be better to open a new ticket and link to this one with any changes needed to the original patch

Peter

Comment by Gerrit Updater [ 13/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/
Subject: LU-6441 ptlrpc: fix sanity 224c for different RPC sizes
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6cde14a5df781ae29da88f98a2559eb4342fe1f3

Generated at Sat Feb 10 02:00:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.