[LU-6441] OST problems following router node crash, inactive threads, clients continuously reconnecting Created: 08/Apr/15 Updated: 22/Jun/22 Resolved: 01/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Artem Blagodarenko (Inactive) | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | patch | ||
| Environment: |
Servers runs 2.5.1, clients run Lustre 2.5.1 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function. Logs shows messages like this: Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000
The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes. |
| Comments |
| Comment by Gerrit Updater [ 08/Apr/15 ] |
|
Artem Blagodarenko (artem_blagodarenko@xyratex.com) uploaded a new patch: http://review.whamcloud.com/14399 |
| Comment by Artem Blagodarenko (Inactive) [ 21/Apr/15 ] |
|
This bug happened when 4MB io is enabled. We noticed it already on two clusters. I believe this patch is important for somebody who going to use 4MB io. |
| Comment by Ian Costello [ 21/Apr/15 ] |
|
Agreed re 4MB rpc, also doesn't require an LNET router crash/panic, can easily reproduce it with a similar trigger such as pulling IB cables (or whatever network you are using) on the clients while the clients are doing I/O to the filesystem. Patching the server with the above patch I can confirm resolves the problem. Have done this on site at ANU/NCI on the available test kit, i.e. was able to reproduce, patch the server and install then spent a day and a half trying to reproduce (when I could hit this 2/3 attempts on a server without the patch). |
| Comment by Gerrit Updater [ 01/May/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14399/ |
| Comment by Peter Jones [ 01/May/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 09/Sep/16 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403 |
| Comment by Peter Jones [ 09/Sep/16 ] |
|
Jinshan It would be better to open a new ticket and link to this one with any changes needed to the original patch Peter |
| Comment by Gerrit Updater [ 13/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/ |