Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9376

Recovery bug exposed during sanity 103b test

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • The error occurs on the MDS server running RHEL7.3 with ldiskfs while running the sanity test 103b
    • 3
    • 9223372036854775807

    Description

      After the recent mergers during testing I begain to see failures in my sanity test for test 103a. With the test I see on the MDS node the following error:

      [ 2272.188019] LDISKFS-fs (dm-0): Mount option "noacl" will be removed by 3.5
      Contact linux-ldiskfs@vger.kernel.org if you think we should keep it.
      
      [ 2272.216119] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: noacl,user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 2272.627509] Lustre: *** cfs_fail_loc=15b, val=0***
      [ 2272.634026] Lustre: Skipped 123 previous similar messages
      [ 2272.641089] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) lustre-OST0000-osc-MDT0000: next log does not exist!
      [ 2272.654143] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) Skipped 62 previous similar messages
      [ 2272.665796] LustreError: 18834:0:(osp_sync.c:1439:osp_sync_init()) lustre-OST0000-osc-MDT0000: can't initialize llog: rc = -5
      [ 2272.679028] LustreError: 18834:0:(obd_config.c:574:class_setup()) setup lustre-OST0000-osc-MDT0000 failed (-5)
      [ 2272.690791] LustreError: 18834:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -5
      [ 2272.706189] Lustre: cmd=cf003 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.37.248.198@o2ib1
      
      [ 2272.721032] LustreError: 18834:0:(llog.c:616:llog_process_thread()) Local llog found corrupted
      [ 2272.744649] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      [ 2273.451269] Lustre: DEBUG MARKER: ninja34: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 6
      [ 2274.271991] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 1 client reconnects
      [ 2274.303786] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
      [ 2274.894275] Lustre: Failing over lustre-MDT0000
      [ 2275.277776] Lustre: server umount lustre-MDT0000 complete
      [ 2277.131077] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0***
      [ 2277.436972] Lustre: Skipped 370 previous similar messages
      [ 2277.503475] LustreError: 20275:0:(genops.c:334:class_newdev()) Device lustre-OST0000-osc-MDT0000 already exists at 7, won't add
      [ 2277.518507] LustreError: 20275:0:(obd_config.c:366:class_attach()) Cannot create device lustre-OST0000-osc-MDT0000 of type osp : -17
      [ 2277.533986] LustreError: 20275:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -17
      [ 2277.549656] Lustre: cmd=cf001 0:lustre-OST0000-osc-MDT0000 1:osp 2:lustre-MDT0000-mdtlov_UUID
      
      [ 2277.564109] LustreError: 15c-8: MGC10.37.248.196@o2ib1: The configuration from log 'lustre-MDT0000' failed (-17). This may be the result of.
      [ 2277.564113] LustreError: 20223:0:(obd_mount_server.c:1351:server_start_targets()) failed to start server lustre-MDT0000: -17
      [ 2277.564207] LustreError: 20223:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -17
      [ 2277.564233] Lustre: Failing over lustre-MDT0000
      [ 2277.895953] Lustre: server umount lustre-MDT0000 complete
      [ 2277.903356] LustreError: 20223:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-17)
      

      Attachments

        Issue Links

          Activity

            [LU-9376] Recovery bug exposed during sanity 103b test

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/
            Subject: LU-9376 tests: remove sanity.sh test_103b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 30faa618bbc8775595bf25803d06410fe0e67fd6

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/ Subject: LU-9376 tests: remove sanity.sh test_103b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 30faa618bbc8775595bf25803d06410fe0e67fd6
            simmonsja James A Simmons added a comment - - edited

            The patch for LU-9725 resolves this bug. A patch still exist that removes this obsolete test so don't close this ticket once LU-9725 lands.

            simmonsja James A Simmons added a comment - - edited The patch for LU-9725 resolves this bug. A patch still exist that removes this obsolete test so don't close this ticket once LU-9725 lands.

            Yes my MGS is a separate node from the MDS.

            simmonsja James A Simmons added a comment - Yes my MGS is a separate node from the MDS.

            Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109
            Subject: LU-9376 tests: remove sanity.sh test_103b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7b3ebd654d07b5a7cf9d2951a75d80fc9967fb30

            gerrit Gerrit Updater added a comment - Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109 Subject: LU-9376 tests: remove sanity.sh test_103b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7b3ebd654d07b5a7cf9d2951a75d80fc9967fb30

            On a related note, Mount option "noacl" will be removed by 3.5, and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8.

            James, one thought about why you are seeing this spurious cfs_fail_loc=15b, is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))".

            adilger Andreas Dilger added a comment - On a related note, Mount option "noacl" will be removed by 3.5 , and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8 . James, one thought about why you are seeing this spurious cfs_fail_loc=15b , is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))" .

            I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly.

            simmonsja James A Simmons added a comment - I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly.
            emoly.liu Emoly Liu added a comment -

            James,
            Can you show me your command to run this test? because we have already used reset_fail_loc() to set fail_loc=0 in the end of each test case on all client nodes and active server nodes.

            reset_fail_loc () {
                    echo -n "Resetting fail_loc on all nodes..."
                    do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \
                        fail_val=0 2>/dev/null" || true
                    echo done.
            }
            
            emoly.liu Emoly Liu added a comment - James, Can you show me your command to run this test? because we have already used reset_fail_loc() to set fail_loc=0 in the end of each test case on all client nodes and active server nodes. reset_fail_loc () { echo -n "Resetting fail_loc on all nodes..." do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \ fail_val=0 2>/dev/null" || true echo done. }
            pjones Peter Jones added a comment -

            Emoly

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Emoly Could you please look into this issue? Thanks Peter

            People

              emoly.liu Emoly Liu
              simmonsja James A Simmons
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: