Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6441

OST problems following router node crash, inactive threads, clients continuously reconnecting

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0, Lustre 2.9.0
    • Lustre 2.5.1
    • Servers runs 2.5.1, clients run Lustre 2.5.1
    • 3
    • 9223372036854775807

    Description

      As part of acceptance testing, an lnet router node was deliberately crashed (via sysrq-trigger). Following the crash, a set of OSTS nodes started reporting problems, hung threads, timeouts, clients continually losing connection, reconnecting, etc. Nodes are held on ptlrpc_abort_bulk() function.

      Logs shows messages like this:

      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) ost_io: unhealthy - request has been waiting 16940s
      Feb 25 21:17:15 somenode kernel: LustreError: 32278:0:(service.c:3214:ptlrpc_svcpt_health_check()) Skipped 5 previous similar messages
      Feb 25 21:20:31 somenode kernel: Lustre: 106418:0:(niobuf.c:282:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff8802e3970000
      

      The system uses a fine-grained routing configuration. It was one of the 5 routers in this set that was killed. The expectation is minimal disruption for clients, the other 4 routers are still functioning, clients should detect that one is down and send traffic to the other 4 nodes.

      Attachments

        Issue Links

          Activity

            [LU-6441] OST problems following router node crash, inactive threads, clients continuously reconnecting

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/
            Subject: LU-6441 ptlrpc: fix sanity 224c for different RPC sizes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6cde14a5df781ae29da88f98a2559eb4342fe1f3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22403/ Subject: LU-6441 ptlrpc: fix sanity 224c for different RPC sizes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6cde14a5df781ae29da88f98a2559eb4342fe1f3
            pjones Peter Jones added a comment -

            Jinshan

            It would be better to open a new ticket and link to this one with any changes needed to the original patch

            Peter

            pjones Peter Jones added a comment - Jinshan It would be better to open a new ticket and link to this one with any changes needed to the original patch Peter

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403
            Subject: LU-6441 ptlrpc: fix the problem of the patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6bfc8faac211e5c2dd5a369ac23438565d3d16c0

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/22403 Subject: LU-6441 ptlrpc: fix the problem of the patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6bfc8faac211e5c2dd5a369ac23438565d3d16c0
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14399/
            Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0a6470219a8602d7a56fe1c5171dba4a42244738

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14399/ Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0a6470219a8602d7a56fe1c5171dba4a42244738
            icostelloddn Ian Costello added a comment -

            Agreed re 4MB rpc, also doesn't require an LNET router crash/panic, can easily reproduce it with a similar trigger such as pulling IB cables (or whatever network you are using) on the clients while the clients are doing I/O to the filesystem.

            Patching the server with the above patch I can confirm resolves the problem. Have done this on site at ANU/NCI on the available test kit, i.e. was able to reproduce, patch the server and install then spent a day and a half trying to reproduce (when I could hit this 2/3 attempts on a server without the patch).

            icostelloddn Ian Costello added a comment - Agreed re 4MB rpc, also doesn't require an LNET router crash/panic, can easily reproduce it with a similar trigger such as pulling IB cables (or whatever network you are using) on the clients while the clients are doing I/O to the filesystem. Patching the server with the above patch I can confirm resolves the problem. Have done this on site at ANU/NCI on the available test kit, i.e. was able to reproduce, patch the server and install then spent a day and a half trying to reproduce (when I could hit this 2/3 attempts on a server without the patch).

            This bug happened when 4MB io is enabled. We noticed it already on two clusters. I believe this patch is important for somebody who going to use 4MB io.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - This bug happened when 4MB io is enabled. We noticed it already on two clusters. I believe this patch is important for somebody who going to use 4MB io.

            Artem Blagodarenko (artem_blagodarenko@xyratex.com) uploaded a new patch: http://review.whamcloud.com/14399
            Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0feb07fa9852dfeb8fa68ea80587cb7f11a7ab04

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem_blagodarenko@xyratex.com) uploaded a new patch: http://review.whamcloud.com/14399 Subject: LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0feb07fa9852dfeb8fa68ea80587cb7f11a7ab04

            People

              emoly.liu Emoly Liu
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: