[LU-9376] Recovery bug exposed during sanity 103b test Created: 20/Apr/17  Updated: 26/Aug/17  Resolved: 13/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Emoly Liu
Resolution: Fixed Votes: 1
Labels: None
Environment:

The error occurs on the MDS server running RHEL7.3 with ldiskfs while running the sanity test 103b


Attachments: File dump.log.gz    
Issue Links:
Related
is related to LU-9725 Mount commands don't return for targe... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After the recent mergers during testing I begain to see failures in my sanity test for test 103a. With the test I see on the MDS node the following error:

[ 2272.188019] LDISKFS-fs (dm-0): Mount option "noacl" will be removed by 3.5
Contact linux-ldiskfs@vger.kernel.org if you think we should keep it.

[ 2272.216119] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: noacl,user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 2272.627509] Lustre: *** cfs_fail_loc=15b, val=0***
[ 2272.634026] Lustre: Skipped 123 previous similar messages
[ 2272.641089] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) lustre-OST0000-osc-MDT0000: next log does not exist!
[ 2272.654143] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) Skipped 62 previous similar messages
[ 2272.665796] LustreError: 18834:0:(osp_sync.c:1439:osp_sync_init()) lustre-OST0000-osc-MDT0000: can't initialize llog: rc = -5
[ 2272.679028] LustreError: 18834:0:(obd_config.c:574:class_setup()) setup lustre-OST0000-osc-MDT0000 failed (-5)
[ 2272.690791] LustreError: 18834:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -5
[ 2272.706189] Lustre: cmd=cf003 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.37.248.198@o2ib1

[ 2272.721032] LustreError: 18834:0:(llog.c:616:llog_process_thread()) Local llog found corrupted
[ 2272.744649] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 2273.451269] Lustre: DEBUG MARKER: ninja34: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 6
[ 2274.271991] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 1 client reconnects
[ 2274.303786] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
[ 2274.894275] Lustre: Failing over lustre-MDT0000
[ 2275.277776] Lustre: server umount lustre-MDT0000 complete
[ 2277.131077] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0***
[ 2277.436972] Lustre: Skipped 370 previous similar messages
[ 2277.503475] LustreError: 20275:0:(genops.c:334:class_newdev()) Device lustre-OST0000-osc-MDT0000 already exists at 7, won't add
[ 2277.518507] LustreError: 20275:0:(obd_config.c:366:class_attach()) Cannot create device lustre-OST0000-osc-MDT0000 of type osp : -17
[ 2277.533986] LustreError: 20275:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -17
[ 2277.549656] Lustre: cmd=cf001 0:lustre-OST0000-osc-MDT0000 1:osp 2:lustre-MDT0000-mdtlov_UUID

[ 2277.564109] LustreError: 15c-8: MGC10.37.248.196@o2ib1: The configuration from log 'lustre-MDT0000' failed (-17). This may be the result of.
[ 2277.564113] LustreError: 20223:0:(obd_mount_server.c:1351:server_start_targets()) failed to start server lustre-MDT0000: -17
[ 2277.564207] LustreError: 20223:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -17
[ 2277.564233] Lustre: Failing over lustre-MDT0000
[ 2277.895953] Lustre: server umount lustre-MDT0000 complete
[ 2277.903356] LustreError: 20223:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-17)


 Comments   
Comment by James A Simmons [ 20/Apr/17 ]

I have attached a lctl dump from the MDS to this ticket from when the bug occurred.

Comment by Andreas Dilger [ 21/Apr/17 ]

I see in your log snippet:

[ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0***

but this shouldn't be happening during sanity test_105a. That is left over from sanity test_60e, which is very strange. Are you sure these are the right logs from test_103a? Could you please also include the stdout output from the test.

Comment by James A Simmons [ 24/Apr/17 ]

The log I posted is from running sanity.sh from start to finish. I have discovered that the running 103b stand alone will pass. So some left over state is causing this strange failure. The stdout from the MDS for the total sanity  run is posted in the description. This is why you see sanity 63 left overs.

Comment by Peter Jones [ 25/Apr/17 ]

Emoly

Could you please look into this issue?

Thanks

Peter

Comment by Emoly Liu [ 27/Apr/17 ]

James,
Can you show me your command to run this test? because we have already used reset_fail_loc() to set fail_loc=0 in the end of each test case on all client nodes and active server nodes.

reset_fail_loc () {
        echo -n "Resetting fail_loc on all nodes..."
        do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \
            fail_val=0 2>/dev/null" || true
        echo done.
}
Comment by James A Simmons [ 03/May/17 ]

I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly.

Comment by Andreas Dilger [ 12/May/17 ]

On a related note, Mount option "noacl" will be removed by 3.5, and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8.

James, one thought about why you are seeing this spurious cfs_fail_loc=15b, is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))".

Comment by Gerrit Updater [ 15/May/17 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109
Subject: LU-9376 tests: remove sanity.sh test_103b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7b3ebd654d07b5a7cf9d2951a75d80fc9967fb30

Comment by James A Simmons [ 15/May/17 ]

Yes my MGS is a separate node from the MDS.

Comment by James A Simmons [ 05/Aug/17 ]

The patch for LU-9725 resolves this bug. A patch still exist that removes this obsolete test so don't close this ticket once LU-9725 lands.

Comment by Gerrit Updater [ 13/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/
Subject: LU-9376 tests: remove sanity.sh test_103b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 30faa618bbc8775595bf25803d06410fe0e67fd6

Generated at Sat Feb 10 02:25:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.