Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9376

Recovery bug exposed during sanity 103b test

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • The error occurs on the MDS server running RHEL7.3 with ldiskfs while running the sanity test 103b
    • 3
    • 9223372036854775807

    Description

      After the recent mergers during testing I begain to see failures in my sanity test for test 103a. With the test I see on the MDS node the following error:

      [ 2272.188019] LDISKFS-fs (dm-0): Mount option "noacl" will be removed by 3.5
      Contact linux-ldiskfs@vger.kernel.org if you think we should keep it.
      
      [ 2272.216119] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: noacl,user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 2272.627509] Lustre: *** cfs_fail_loc=15b, val=0***
      [ 2272.634026] Lustre: Skipped 123 previous similar messages
      [ 2272.641089] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) lustre-OST0000-osc-MDT0000: next log does not exist!
      [ 2272.654143] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) Skipped 62 previous similar messages
      [ 2272.665796] LustreError: 18834:0:(osp_sync.c:1439:osp_sync_init()) lustre-OST0000-osc-MDT0000: can't initialize llog: rc = -5
      [ 2272.679028] LustreError: 18834:0:(obd_config.c:574:class_setup()) setup lustre-OST0000-osc-MDT0000 failed (-5)
      [ 2272.690791] LustreError: 18834:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -5
      [ 2272.706189] Lustre: cmd=cf003 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.37.248.198@o2ib1
      
      [ 2272.721032] LustreError: 18834:0:(llog.c:616:llog_process_thread()) Local llog found corrupted
      [ 2272.744649] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      [ 2273.451269] Lustre: DEBUG MARKER: ninja34: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 6
      [ 2274.271991] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 1 client reconnects
      [ 2274.303786] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
      [ 2274.894275] Lustre: Failing over lustre-MDT0000
      [ 2275.277776] Lustre: server umount lustre-MDT0000 complete
      [ 2277.131077] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0***
      [ 2277.436972] Lustre: Skipped 370 previous similar messages
      [ 2277.503475] LustreError: 20275:0:(genops.c:334:class_newdev()) Device lustre-OST0000-osc-MDT0000 already exists at 7, won't add
      [ 2277.518507] LustreError: 20275:0:(obd_config.c:366:class_attach()) Cannot create device lustre-OST0000-osc-MDT0000 of type osp : -17
      [ 2277.533986] LustreError: 20275:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -17
      [ 2277.549656] Lustre: cmd=cf001 0:lustre-OST0000-osc-MDT0000 1:osp 2:lustre-MDT0000-mdtlov_UUID
      
      [ 2277.564109] LustreError: 15c-8: MGC10.37.248.196@o2ib1: The configuration from log 'lustre-MDT0000' failed (-17). This may be the result of.
      [ 2277.564113] LustreError: 20223:0:(obd_mount_server.c:1351:server_start_targets()) failed to start server lustre-MDT0000: -17
      [ 2277.564207] LustreError: 20223:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -17
      [ 2277.564233] Lustre: Failing over lustre-MDT0000
      [ 2277.895953] Lustre: server umount lustre-MDT0000 complete
      [ 2277.903356] LustreError: 20223:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-17)
      

      Attachments

        Issue Links

          Activity

            [LU-9376] Recovery bug exposed during sanity 103b test

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/
            Subject: LU-9376 tests: remove sanity.sh test_103b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 30faa618bbc8775595bf25803d06410fe0e67fd6

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/ Subject: LU-9376 tests: remove sanity.sh test_103b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 30faa618bbc8775595bf25803d06410fe0e67fd6
            simmonsja James A Simmons added a comment - - edited

            The patch for LU-9725 resolves this bug. A patch still exist that removes this obsolete test so don't close this ticket once LU-9725 lands.

            simmonsja James A Simmons added a comment - - edited The patch for LU-9725 resolves this bug. A patch still exist that removes this obsolete test so don't close this ticket once LU-9725 lands.

            Yes my MGS is a separate node from the MDS.

            simmonsja James A Simmons added a comment - Yes my MGS is a separate node from the MDS.

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109
            Subject: LU-9376 tests: remove sanity.sh test_103b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7b3ebd654d07b5a7cf9d2951a75d80fc9967fb30

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109 Subject: LU-9376 tests: remove sanity.sh test_103b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7b3ebd654d07b5a7cf9d2951a75d80fc9967fb30

            On a related note, Mount option "noacl" will be removed by 3.5, and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8.

            James, one thought about why you are seeing this spurious cfs_fail_loc=15b, is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))".

            adilger Andreas Dilger added a comment - On a related note, Mount option "noacl" will be removed by 3.5 , and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8 . James, one thought about why you are seeing this spurious cfs_fail_loc=15b , is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))" .

            I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly.

            simmonsja James A Simmons added a comment - I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly.
            emoly.liu Emoly Liu added a comment -

            James,
            Can you show me your command to run this test? because we have already used reset_fail_loc() to set fail_loc=0 in the end of each test case on all client nodes and active server nodes.

            reset_fail_loc () {
                    echo -n "Resetting fail_loc on all nodes..."
                    do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \
                        fail_val=0 2>/dev/null" || true
                    echo done.
            }
            
            emoly.liu Emoly Liu added a comment - James, Can you show me your command to run this test? because we have already used reset_fail_loc() to set fail_loc=0 in the end of each test case on all client nodes and active server nodes. reset_fail_loc () { echo -n "Resetting fail_loc on all nodes..." do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \ fail_val=0 2>/dev/null" || true echo done. }
            pjones Peter Jones added a comment -

            Emoly

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Emoly Could you please look into this issue? Thanks Peter
            simmonsja James A Simmons added a comment - - edited

            The log I posted is from running sanity.sh from start to finish. I have discovered that the running 103b stand alone will pass. So some left over state is causing this strange failure. The stdout from the MDS for the total sanity  run is posted in the description. This is why you see sanity 63 left overs.

            simmonsja James A Simmons added a comment - - edited The log I posted is from running sanity.sh from start to finish. I have discovered that the running 103b stand alone will pass. So some left over state is causing this strange failure. The stdout from the MDS for the total sanity  run is posted in the description. This is why you see sanity 63 left overs.

            I see in your log snippet:

            [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0***
            

            but this shouldn't be happening during sanity test_105a. That is left over from sanity test_60e, which is very strange. Are you sure these are the right logs from test_103a? Could you please also include the stdout output from the test.

            adilger Andreas Dilger added a comment - I see in your log snippet: [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0*** but this shouldn't be happening during sanity test_105a. That is left over from sanity test_60e, which is very strange. Are you sure these are the right logs from test_103a? Could you please also include the stdout output from the test.

            People

              emoly.liu Emoly Liu
              simmonsja James A Simmons
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: