Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17385

sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

      test_26a failed with the following error:

      (7) only 3 of 4 MDTs are in completed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

      <<Please provide additional information about the failure here>>

      First started on 2023-12-20 for full runs, may be related to recent patch landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed

      Attachments

        Issue Links

          Activity

            [LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545
            Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545 Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            This is failing 22/62 runs since the LU-16826 test case landed.  I don't see anything obvious in the test logs, like an MDT reconnecting in test_26/test_27 after it was stopped/started in test_23d, so I added some more debugging to see why this is failing.

            I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution.

            adilger Andreas Dilger added a comment - This is failing 22/62 runs since the LU-16826 test case landed.  I don't see anything obvious in the test logs, like an MDT reconnecting in test_26/test_27 after it was stopped/started in test_23d, so I added some more debugging to see why this is failing. I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544
            Subject: LU-17385 tests: add sanity-lfsck/24 debugging
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 94f62d0d5bea764b3b0287662384a524283dd419

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544 Subject: LU-17385 tests: add sanity-lfsck/24 debugging Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 94f62d0d5bea764b3b0287662384a524283dd419
            adilger Andreas Dilger added a comment - - edited

            It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23d, but used:

            Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d
            

            so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail.

            adilger Andreas Dilger added a comment - - edited It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " LU-16826 tests: lfsck to repair a dangling remote entry " landing on 2023-12-20 which added sanity-lfsck.sh test_23d, but used: Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail.

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530
            Subject: EX-8860 lfsck: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 17251801b1cf5516132edebd6677e2f34fcbc61c

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530 Subject: EX-8860 lfsck: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 17251801b1cf5516132edebd6677e2f34fcbc61c
            hongchao.zhang Hongchao Zhang added a comment - - edited

            The LFSCK_NOTIFY request is failed to be processed by MDT, but it is strange that other LFSCK_NOTIFY request succeed

            00000100:00100000:1.0:1703075647.451663:0:1223156:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0
            00000100:00100000:0.0:1703075647.451664:0:1215944:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.240.26.106@tcp, seq: 1159
            00000100:00100000:0.0:1703075647.451667:0:1215944:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000617d7544 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_001:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025792:12345-10.240.26.106@tcp:1101:lctl.0
            00000100:00100000:1.0:1703075647.451692:0:1223156:0:(service.c:2382:ptlrpc_server_handle_request()) Handled RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 Request processed in 29us (98us total) trans 0 rc -95/-95
            00000100:00100000:1.0:1703075647.451695:0:1223156:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.240.26.106@tcp, seq: 1158
            00100000:10000000:0.0:1703075647.451784:0:1215944:0:(lfsck_lib.c:2707:lfsck_load_one_trace_file()) lustre-MDT0003-osd: unlink lfsck sub trace file lfsck_namespace_01: rc = 0
            

            the request (xid= x1785766154025664) failed with -95 immediately, but the similar request (xid = x1785766154025792) succeeded

            will create a debug patch to collect the logs

            hongchao.zhang Hongchao Zhang added a comment - - edited The LFSCK_NOTIFY request is failed to be processed by MDT, but it is strange that other LFSCK_NOTIFY request succeed 00000100:00100000:1.0:1703075647.451663:0:1223156:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 00000100:00100000:0.0:1703075647.451664:0:1215944:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.240.26.106@tcp, seq: 1159 00000100:00100000:0.0:1703075647.451667:0:1215944:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000617d7544 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_001:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025792:12345-10.240.26.106@tcp:1101:lctl.0 00000100:00100000:1.0:1703075647.451692:0:1223156:0:(service.c:2382:ptlrpc_server_handle_request()) Handled RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 Request processed in 29us (98us total) trans 0 rc -95/-95 00000100:00100000:1.0:1703075647.451695:0:1223156:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.240.26.106@tcp, seq: 1158 00100000:10000000:0.0:1703075647.451784:0:1215944:0:(lfsck_lib.c:2707:lfsck_load_one_trace_file()) lustre-MDT0003-osd: unlink lfsck sub trace file lfsck_namespace_01: rc = 0 the request (xid= x1785766154025664) failed with -95 immediately, but the similar request (xid = x1785766154025792) succeeded will create a debug patch to collect the logs
            pjones Peter Jones added a comment -

            Hongchao

            This seems to have started failing only very recently. Can you identify which change introduced this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao This seems to have started failing only very recently. Can you identify which change introduced this issue? Thanks Peter

            People

              zam Alexander Zarochentsev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: