Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2621

SIngle client timeout hangs MDS -related to LU-793

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • Hyperion/RHEL6
    • 3
    • 6132

    Description

      Running mdtest, file-per-process. A single client times out a request, then the MDS enter the 'waiting on 1 RPC' state, all clients eventually get -EBUSY. This bug is to show the sequence as I'm currently seeing it on Hyperion.
      The first client error:

      2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358283526/real 1358283526]  req@ffff88030ca02400 x1424249167482022/t0(0) o101->lustre-MDT0000-mdc-ffff88033c3dac00@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1358284286 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
      2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      2013-01-15 13:11:26 Lustre: lustre-MDT0000-mdc-ffff88033c3dac00: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-01-15 13:11:26 LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
      

      The MDS log

      Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff88012a324850 x1424249167482022/t0(0) o101->ad6c708f-3715-63c3-9874-6577bf49a8f7@192.168.117.84@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff880161327850 x1424249159092332/t0(0) o101->7ae936ba-6abb-4279-4d8d-6075df2b44ca@192.168.116.112@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:09:10 hyperion-rst6 kernel: Lustre: 7251:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:10 hyperion-rst6 kernel: req@ffff88015b7d1050 x1424249191598931/t0(0) o35->67291746-09fb-9e08-cd08-b1a1abb10ad0@192.168.119.118@o2ib1:0/0 lens 392/2024 e 3 to 0 dl 1358284155 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client ad6c708f-3715-63c3-9874-6577bf49a8f7 (at 192.168.117.84@o2ib1) reconnecting
      Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 7ae936ba-6abb-4279-4d8d-6075df2b44ca (at 192.168.116.112@o2ib1) refused reconnection, still busy with 1 active RPCs
      

      Requires restart of MDS to clear.

      Attachments

        Issue Links

          Activity

            [LU-2621] SIngle client timeout hangs MDS -related to LU-793

            Please reopen if this happens again.

            jlevi Jodi Levi (Inactive) added a comment - Please reopen if this happens again.

            Cliff,
            Have you seen this one again yet?

            jlevi Jodi Levi (Inactive) added a comment - Cliff, Have you seen this one again yet?

            I'd rather leave it open for a bit, until i can repeat a full SWL test w/ldiskfs. This week has been mostly ZFS work.

            cliffw Cliff White (Inactive) added a comment - I'd rather leave it open for a bit, until i can repeat a full SWL test w/ldiskfs. This week has been mostly ZFS work.

            Cliff, have you not seen this since 1/22? If so, can this ticket be closed?

            jlevi Jodi Levi (Inactive) added a comment - Cliff, have you not seen this since 1/22? If so, can this ticket be closed?

            Yes, the load was mdtest. Sadly, the lustre-log was not retained - that bit should now be corrected. I will get stack traces if/when the problems repeats.

            cliffw Cliff White (Inactive) added a comment - Yes, the load was mdtest. Sadly, the lustre-log was not retained - that bit should now be corrected. I will get stack traces if/when the problems repeats.

            I see this is mdtest, one directory per client, so there should be no contention in the filesystem or DLM between the client threads at all, so it is totally unexpected that some MDS thread should become stuck. Can you please also get stack traces from the MDS in this case?

            adilger Andreas Dilger added a comment - I see this is mdtest, one directory per client, so there should be no contention in the filesystem or DLM between the client threads at all, so it is totally unexpected that some MDS thread should become stuck. Can you please also get stack traces from the MDS in this case?

            Cliff, are there any debug logs dumped from this case? Also, is this the full MDS log from when the thread first gets stuck? Can you please describe the test load when the MDS thread first gets stuck?

            adilger Andreas Dilger added a comment - Cliff, are there any debug logs dumped from this case? Also, is this the full MDS log from when the thread first gets stuck? Can you please describe the test load when the MDS thread first gets stuck?

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: