Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1949

SWL - mds wedges 'still busy with 1 RPC'

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.3.0
    • None
    • SWL Hyperion LLNL
    • 3
    • 10169

    Description

      Running SWL, MDS gradually goes into a wedged, clients get -EBUSY, MDS nevers clears stuck RPC. Rebooted MDS to recover.
      Typical client:

      Sep 15 15:23:34 hyperion770 kernel: Lustre: 8865:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347747046/real 1347747046] req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1347747806 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
      Sep 15 15:23:34 hyperion770 kernel: Lustre: lustre-MDT0000-mdc-ffff880339d11800: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Sep 15 15:23:42 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
      Sep 15 15:31:12 hyperion770 kernel: LustreError: Skipped 5 previous similar messages
      ,,,,,duplicate
      Sep 15 15:35:47 hyperion770 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
      Sep 15 15:35:48 hyperion770 kernel: LustreError: Skipped 10 previous similar messages
      Sep 15 15:44:15 hyperion770 kernel: Lustre: 3342:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347749019/real 1347749043] req@ffff88014f8e2000 x1413191623538306/t0(0) o38->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1:12/10 lens 400/544 e 0 to 1 dl 1347749069 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      -------
      MDS:

      Sep 15 15:21:09 hyperion-rst6 kernel: req@ffff880254359050 x1413191623533468/t0(0) o101->d17e0f27-22a5-38fb-14c0-313655de63cd@192.168.117.51@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747674 ref 2 fl Interpret:/0/0 rc 0/0
      Sep 15 15:21:09 hyperion-rst6 kernel: Lustre: 6988:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      Sep 15 15:21:21 hyperion-rst6 kernel: Lustre: 6935:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Sep 15 15:21:21 hyperion-rst6 kernel: req@ffff88016d1c6050 x1413191619837173/t0(0) o101->821224b2-a0a9-0330-9862-198101015e30@192.168.116.111@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747685 ref 2 fl Interpret:/0/0 rc 0/0
      Sep 15 15:21:50 hyperion-rst6 kernel: Lustre: 7165:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Sep 15 15:21:50 hyperion-rst6 kernel: req@ffff880268a60050 x1413191612196075/t0(0) o35->3c2d3508-87e2-4c0e-a013-11a2bf99635a@192.168.116.97@o2ib1:0/0 lens 392/4104 e 3 to 0 dl 1347747715 ref 2 fl Interpret:/0/0 rc 0/0
      Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Sep 15 15:22:38 hyperion-rst6 kernel: req@ffff8801152fe450 x1413191612772893/t0(0) o101->e8d4ba14-4c25-89b1-b0eb-af89de332d73@192.168.116.83@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1347747763 ref 2 fl Interpret:/0/0 rc 0/0
      Sep 15 15:22:38 hyperion-rst6 kernel: Lustre: 7036:0:(service.c:1260:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
      Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) reconnecting
      Sep 15 15:23:05 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 5dfb5850-e495-0463-45aa-51d17e91b47a (at 192.168.116.85@o2ib1) refused reconnection, still busy with 1 active RPCs
      Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) reconnecting
      Sep 15 15:23:08 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client a688d9d6-ffb7-c06d-f4d4-541f43b6f1c5 (at 192.168.116.84@o2ib1) refused reconnection, still busy with 2 active RPCs
      Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) reconnecting
      Sep 15 15:23:21 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 21e2654d-06e7-726f-ced3-f0c96e204093 (at 192.168.116.133@o2ib1) refused reconnection, still busy with 1 active RPCs

      -----------
      Took a log on MDS with debug= -1 while this was happening, attached.

      Attachments

        Issue Links

          Activity

            [LU-1949] SWL - mds wedges 'still busy with 1 RPC'
            adilger Andreas Dilger made changes -
            Resolution New: Cannot Reproduce [ 5 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Closing old bug.

            adilger Andreas Dilger added a comment - Closing old bug.

            vmcore is at ~cliffw/lu1948/erofs on brent.

            cliffw Cliff White (Inactive) added a comment - vmcore is at ~cliffw/lu1948/erofs on brent.

            There are some unfinished RPCs on the export which prevented client to reconnect. But I cannot find related RPC processing in the lustre-debug log. If there are "ps" log to show what the RPCs were, or stack trace to show what the RPC services thread were doing, then it is much helpful.

            Anyway, it seems not the duplication of LU-1976.

            yong.fan nasf (Inactive) added a comment - There are some unfinished RPCs on the export which prevented client to reconnect. But I cannot find related RPC processing in the lustre-debug log. If there are "ps" log to show what the RPCs were, or stack trace to show what the RPC services thread were doing, then it is much helpful. Anyway, it seems not the duplication of LU-1976 .
            pjones Peter Jones made changes -
            Priority Original: Blocker [ 1 ] New: Major [ 3 ]
            pjones Peter Jones added a comment -

            Dropping priority as unable to reproduce

            pjones Peter Jones added a comment - Dropping priority as unable to reproduce
            bobijam Zhenyu Xu made changes -
            Link Original: This issue duplicates LU-1934 [ LU-1934 ]
            bobijam Zhenyu Xu made changes -
            Link New: This issue duplicates LU-1934 [ LU-1934 ]
            bobijam Zhenyu Xu made changes -
            Link New: This issue is related to LU-1934 [ LU-1934 ]
            bobijam Zhenyu Xu added a comment -

            I think the MDS log is a little bit late for the scenario, I cannot find out why MDS was stucking with the RPC. Can you try to grab MDS logs when it is handling the to-be-timedout request? In this case capture what MDS has done to the client request of "req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1"

            bobijam Zhenyu Xu added a comment - I think the MDS log is a little bit late for the scenario, I cannot find out why MDS was stucking with the RPC. Can you try to grab MDS logs when it is handling the to-be-timedout request? In this case capture what MDS has done to the client request of "req@ffff880176204800 x1413191623533468/t0(0) o101->lustre-MDT0000-mdc-ffff880339d11800@192.168.127.6@o2ib1"

            People

              bobijam Zhenyu Xu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: