Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6441

OST problems following router node crash, inactive threads, clients continuously reconnecting

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0, Lustre 2.9.0
    • Lustre 2.5.1
    • Servers runs 2.5.1, clients run Lustre 2.5.1
    • 3
    • 9223372036854775807

    Description

      As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function.

      Logs shows messages like this:

      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
      Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000
      

      The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes.

      Attachments

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: