Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2621

SIngle client timeout hangs MDS -related to LU-793

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • Hyperion/RHEL6
    • 3
    • 6132

    Description

      Running mdtest, file-per-process. A single client times out a request, then the MDS enter the 'waiting on 1 RPC' state, all clients eventually get -EBUSY. This bug is to show the sequence as I'm currently seeing it on Hyperion.
      The first client error:

      2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1358283526/real 1358283526]  req@ffff88030ca02400 x1424249167482022/t0(0) o101->lustre-MDT0000-mdc-ffff88033c3dac00@192.168.127.6@o2ib1:12/10 lens 592/1136 e 3 to 1 dl 1358284286 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
      2013-01-15 13:11:26 Lustre: 13481:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      2013-01-15 13:11:26 Lustre: lustre-MDT0000-mdc-ffff88033c3dac00: Connection to lustre-MDT0000 (at 192.168.127.6@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      2013-01-15 13:11:26 LustreError: 11-0: an error occurred while communicating with 192.168.127.6@o2ib1. The mds_connect operation failed with -16
      

      The MDS log

      Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff88012a324850 x1424249167482022/t0(0) o101->ad6c708f-3715-63c3-9874-6577bf49a8f7@192.168.117.84@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:09:09 hyperion-rst6 kernel: Lustre: 10570:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:09 hyperion-rst6 kernel: req@ffff880161327850 x1424249159092332/t0(0) o101->7ae936ba-6abb-4279-4d8d-6075df2b44ca@192.168.116.112@o2ib1:0/0 lens 592/1152 e 3 to 0 dl 1358284154 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:09:10 hyperion-rst6 kernel: Lustre: 7251:0:(service.c:1290:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-23), not sending early reply
      Jan 15 13:09:10 hyperion-rst6 kernel: req@ffff88015b7d1050 x1424249191598931/t0(0) o35->67291746-09fb-9e08-cd08-b1a1abb10ad0@192.168.119.118@o2ib1:0/0 lens 392/2024 e 3 to 0 dl 1358284155 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client ad6c708f-3715-63c3-9874-6577bf49a8f7 (at 192.168.117.84@o2ib1) reconnecting
      Jan 15 13:11:26 hyperion-rst6 kernel: Lustre: lustre-MDT0000: Client 7ae936ba-6abb-4279-4d8d-6075df2b44ca (at 192.168.116.112@o2ib1) refused reconnection, still busy with 1 active RPCs
      

      Requires restart of MDS to clear.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: