[LU-9376] Recovery bug exposed during sanity 103b test Created: 20/Apr/17 Updated: 26/Aug/17 Resolved: 13/Aug/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
The error occurs on the MDS server running RHEL7.3 with ldiskfs while running the sanity test 103b |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After the recent mergers during testing I begain to see failures in my sanity test for test 103a. With the test I see on the MDS node the following error: [ 2272.188019] LDISKFS-fs (dm-0): Mount option "noacl" will be removed by 3.5 Contact linux-ldiskfs@vger.kernel.org if you think we should keep it. [ 2272.216119] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: noacl,user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 2272.627509] Lustre: *** cfs_fail_loc=15b, val=0*** [ 2272.634026] Lustre: Skipped 123 previous similar messages [ 2272.641089] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) lustre-OST0000-osc-MDT0000: next log does not exist! [ 2272.654143] LustreError: 18834:0:(llog_cat.c:396:llog_cat_current_log()) Skipped 62 previous similar messages [ 2272.665796] LustreError: 18834:0:(osp_sync.c:1439:osp_sync_init()) lustre-OST0000-osc-MDT0000: can't initialize llog: rc = -5 [ 2272.679028] LustreError: 18834:0:(obd_config.c:574:class_setup()) setup lustre-OST0000-osc-MDT0000 failed (-5) [ 2272.690791] LustreError: 18834:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -5 [ 2272.706189] Lustre: cmd=cf003 0:lustre-OST0000-osc-MDT0000 1:lustre-OST0000_UUID 2:10.37.248.198@o2ib1 [ 2272.721032] LustreError: 18834:0:(llog.c:616:llog_process_thread()) Local llog found corrupted [ 2272.744649] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180 [ 2273.451269] Lustre: DEBUG MARKER: ninja34: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 6 [ 2274.271991] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 1 client reconnects [ 2274.303786] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted. [ 2274.894275] Lustre: Failing over lustre-MDT0000 [ 2275.277776] Lustre: server umount lustre-MDT0000 complete [ 2277.131077] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0*** [ 2277.436972] Lustre: Skipped 370 previous similar messages [ 2277.503475] LustreError: 20275:0:(genops.c:334:class_newdev()) Device lustre-OST0000-osc-MDT0000 already exists at 7, won't add [ 2277.518507] LustreError: 20275:0:(obd_config.c:366:class_attach()) Cannot create device lustre-OST0000-osc-MDT0000 of type osp : -17 [ 2277.533986] LustreError: 20275:0:(obd_config.c:1709:class_config_llog_handler()) MGC10.37.248.196@o2ib1: cfg command failed: rc = -17 [ 2277.549656] Lustre: cmd=cf001 0:lustre-OST0000-osc-MDT0000 1:osp 2:lustre-MDT0000-mdtlov_UUID [ 2277.564109] LustreError: 15c-8: MGC10.37.248.196@o2ib1: The configuration from log 'lustre-MDT0000' failed (-17). This may be the result of. [ 2277.564113] LustreError: 20223:0:(obd_mount_server.c:1351:server_start_targets()) failed to start server lustre-MDT0000: -17 [ 2277.564207] LustreError: 20223:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -17 [ 2277.564233] Lustre: Failing over lustre-MDT0000 [ 2277.895953] Lustre: server umount lustre-MDT0000 complete [ 2277.903356] LustreError: 20223:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-17) |
| Comments |
| Comment by James A Simmons [ 20/Apr/17 ] |
|
I have attached a lctl dump from the MDS to this ticket from when the bug occurred. |
| Comment by Andreas Dilger [ 21/Apr/17 ] |
|
I see in your log snippet: [ 2277.430391] Lustre: *** cfs_fail_loc=15b, val=0*** but this shouldn't be happening during sanity test_105a. That is left over from sanity test_60e, which is very strange. Are you sure these are the right logs from test_103a? Could you please also include the stdout output from the test. |
| Comment by James A Simmons [ 24/Apr/17 ] |
|
The log I posted is from running sanity.sh from start to finish. I have discovered that the running 103b stand alone will pass. So some left over state is causing this strange failure. The stdout from the MDS for the total sanity run is posted in the description. This is why you see sanity 63 left overs. |
| Comment by Peter Jones [ 25/Apr/17 ] |
|
Emoly Could you please look into this issue? Thanks Peter |
| Comment by Emoly Liu [ 27/Apr/17 ] |
|
James, reset_fail_loc () {
echo -n "Resetting fail_loc on all nodes..."
do_nodes $(comma_list $(nodes_list)) "lctl set_param -n fail_loc=0 \
fail_val=0 2>/dev/null" || true
echo done.
}
|
| Comment by James A Simmons [ 03/May/17 ] |
|
I just do a llmount.sh and then run sanity.sh to reproduce this problem. The setup is a single client and the back end ldiskfs with one MGS, one MDS and one OSS server. Only the OSS server has 2 disk. I did git bisect it it and found the problem has existed for a long time. Only reason I didn't see it before was that I recently changed my test bed configuration, split MGS and MDS into two different servers. I collecting and looking at debug logs. Will have more info shortly. |
| Comment by Andreas Dilger [ 12/May/17 ] |
|
On a related note, Mount option "noacl" will be removed by 3.5, and we are already at kernel 4.12 so this "noacl" test_103b should probably just be removed completely, and the mention of the noacl option removed from mount.lustre.8. James, one thought about why you are seeing this spurious cfs_fail_loc=15b, is whether your MGS is on a separate node, or just a separate device from the MDS? I'm wondering if for some reason one of the nodes is not being caught by the reset_fail_loc call to nodes="$nodes $(facets_nodes $(get_facets))". |
| Comment by Gerrit Updater [ 15/May/17 ] |
|
Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27109 |
| Comment by James A Simmons [ 15/May/17 ] |
|
Yes my MGS is a separate node from the MDS. |
| Comment by James A Simmons [ 05/Aug/17 ] |
|
The patch for |
| Comment by Gerrit Updater [ 13/Aug/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27109/ |