[LU-1949] SWL - mds wedges 'still busy with 1 RPC' Created: 16/Sep/12  Updated: 02/Jul/15  Resolved: 02/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

SWL Hyperion LLNL


Attachments: File rst6.busy.dk.gz    
Issue Links:
Related
is related to LU-1934 still busy with active RPCs for days Resolved
Severity: 3
Rank (Obsolete): 10169

 Description   

Running SWL, MDS gradually goes into a wedged, clients get -EBUSY, MDS nevers clears stuck RPC. Rebooted MDS to recover.
Typical client:

Sep 15 15:23:34 hyperion770 kernel: Lustre: 8865:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347747046/real 1347747046] req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1347747806 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
Sep 15 15:23:34 hyperion770 kernel: Lustre: lustre-MDT0000-mdc-ffff880339d11800: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Sep 15 15:23:42 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
Sep 15 15:31:12 hyperion770 kernel: LustreError: Skipped 5 previous similar messages
,,,,,duplicate
Sep 15 15:35:47 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
Sep 15 15:35:48 hyperion770 kernel: LustreError: Skipped 10 previous similar messages
Sep 15 15:44:15 hyperion770 kernel: Lustre: 3342:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347749019/real 1347749043] req@ffff88014f8e2000 x1413191623538306/t0(0) o38->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 400/544 e 0 to 1 dl 1347749069 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
-------
MDS:

Sep 15 15:21:09 hyperion-rst6 kernel: req@ffff880254359050 x1413191623533468/t0(0) o101->d17e0f27-22a5-38fb-14c0-313655de63cd@192.168.117.51@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747674 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:21:09 hyperion-rst6 kernel: Lustre: 6988:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Sep 15 15:21:21 hyperion-rst6 kernel: Lustre: 6935:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:21:21 hyperion-rst6 kernel: req@ffff88016d1c6050 x1413191619837173/t0(0) o101->821224b2-a0a9-0330-9862-198101015e30@192.168.116.111@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747685 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:21:50 hyperion-rst6 kernel: Lustre: 7165:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:21:50 hyperion-rst6 kernel: req@ffff880268a60050 x1413191612196075/t0(0) o35->3c2d3508-87e2-4c0e-a013-11a2bf99635a@192.168.116.97@o2ib1:0/0 lens 392/4104 e 3 to 0 dl 1347747715 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
Sep 15 15:22:38 hyperion-rst6 kernel: req@ffff8801152fe450 x1413191612772893/t0(0) o101->e8d4ba14-4c25-89b1-b0eb-af89de332d73@192.168.116.83@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747763 ref 2 fl Interpret:/0/0 rc 0/0
Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) reconnecting
Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) refused reconnection, still busy with 1 active RPCs
Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) reconnecting
Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) refused reconnection, still busy with 2 active RPCs
Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) reconnecting
Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) refused reconnection, still busy with 1 active RPCs

-----------
Took a log on MDS with debug= -1 while this was happening, attached.



 Comments   
Comment by Peter Jones [ 16/Sep/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 17/Sep/12 ]

I think the MDS log is a little bit late for the scenario, I cannot find out why MDS was stucking with the RPC. Can you try to grab MDS logs when it is handling the to-be-timedout request? In this case capture what MDS has done to the client request of "req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1"

Comment by Peter Jones [ 21/Sep/12 ]

Dropping priority as unable to reproduce

Comment by nasf (Inactive) [ 21/Sep/12 ]

There are some unfinished RPCs on the export which prevented client to reconnect. But I cannot find related RPC processing in the lustre-debug log. If there are "ps" log to show what the RPCs were, or stack trace to show what the RPC services thread were doing, then it is much helpful.

Anyway, it seems not the duplication of LU-1976.

Comment by Cliff White (Inactive) [ 29/Sep/12 ]

vmcore is at ~cliffw/lu1948/erofs on brent.

Comment by Andreas Dilger [ 02/Jul/15 ]

Closing old bug.

Generated at Sat Feb 10 01:21:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.