Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.3.0
-
None
-
SWL Hyperion LLNL
-
3
-
10169
Description
Running SWL, MDS gradually goes into a wedged, clients get -EBUSY, MDS nevers clears stuck RPC. Rebooted MDS to recover.
Typical client:
Sep 15 15:23:34 hyperion770 kernel: Lustre: 8865:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347747046/real 1347747046] req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1347747806 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
Sep 15 15:23:34 hyperion770 kernel: Lustre: lustre-MDT0000-mdc-ffff880339d11800: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Sep 15 15:23:42 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
Sep 15 15:31:12 hyperion770 kernel: LustreError: Skipped 5 previous similar messages
,,,,,duplicate
Sep 15 15:35:47 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
Sep 15 15:35:48 hyperion770 kernel: LustreError: Skipped 10 previous similar messages
Sep 15 15:44:15 hyperion770 kernel: Lustre: 3342:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347749019/real 1347749043] req@ffff88014f8e2000 x1413191623538306/t0(0) o38->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 400/544 e 0 to 1 dl 1347749069 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
-------
MDS:
Sep 15 15:21:09 hyperion-rst6 kernel: req@ffff880254359050 x1413191623533468/t0(0) o101->d17e0f27-22a5-38fb-14c0-313655de63cd@192.168.117.51@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747674 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:21:09 hyperion-rst6 kernel: Lustre: 6988:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Sep 15 15:21:21 hyperion-rst6 kernel: Lustre: 6935:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:21:21 hyperion-rst6 kernel: req@ffff88016d1c6050 x1413191619837173/t0(0) o101->821224b2-a0a9-0330-9862-198101015e30@192.168.116.111@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747685 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:21:50 hyperion-rst6 kernel: Lustre: 7165:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:21:50 hyperion-rst6 kernel: req@ffff880268a60050 x1413191612196075/t0(0) o35->3c2d3508-87e2-4c0e-a013-11a2bf99635a@192.168.116.97@o2ib1:0/0 lens 392/4104 e 3 to 0 dl 1347747715 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:22:38 hyperion-rst6 kernel: req@ffff8801152fe450 x1413191612772893/t0(0) o101->e8d4ba14-4c25-89b1-b0eb-af89de332d73@192.168.116.83@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747763 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) reconnecting
Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) refused reconnection, still busy with 1 active RPCs
Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) reconnecting
Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) refused reconnection, still busy with 2 active RPCs
Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) reconnecting
Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) refused reconnection, still busy with 1 active RPCs
-----------
Took a log on MDS with debug= -1 while this was happening, attached.
Attachments
Issue Links
- is related to
-
LU-1934 still busy with active RPCs for days
- Resolved