Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6441

OST problems following router node crash, inactive threads, clients continuously reconnecting

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0, Lustre 2.9.0
    • Lustre 2.5.1
    • Servers runs 2.5.1, clients run Lustre 2.5.1
    • 3
    • 9223372036854775807

    Description

      As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function.

      Logs shows messages like this:

      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
      Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000
      

      The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes.

      Attachments

        Issue Links

          Activity

            [LU-6441] OST problems following router node crash, inactive threads, clients continuously reconnecting
            cfaber Colin Faber made changes -
            Link New: This issue duplicates DDN-2885 [ DDN-2885 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is blocking DELL-394 [ DELL-394 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s New: Lustre 2.9.0 [ 11891 ]

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/
            Subject: LU-6441 ptlrpc: fix sanity 224c for different RPC sizes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6cde14a5df781ae29da88f98a2559eb4342fe1f3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/ Subject: LU-6441 ptlrpc: fix sanity 224c for different RPC sizes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6cde14a5df781ae29da88f98a2559eb4342fe1f3
            pjones Peter Jones added a comment -

            Jinshan

            It would be better to open a new ticket and link to this one with any changes needed to the original patch

            Peter

            pjones Peter Jones added a comment - Jinshan It would be better to open a new ticket and link to this one with any changes needed to the original patch Peter

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403
            Subject: LU-6441 ptlrpc: fix the problem of the patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6bfc8faac211e5c2dd5a369ac23438565d3d16c0

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403 Subject: LU-6441 ptlrpc: fix the problem of the patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6bfc8faac211e5c2dd5a369ac23438565d3d16c0
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-6808 [ LU-6808 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to LDEV-44 [ LDEV-44 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-45 [ LDEV-45 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to LDEV-36 [ LDEV-36 ]

            People

              emoly.liu Emoly Liu
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: