[LU-12102] sanity-scrub test_7: (8) Expected 'scanning' on mds1 Created: 25/Mar/19 Updated: 21/Feb/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/8b3b3134-48aa-11e9-b98a-52540065bddc test_7 failed with the following error: Update not seen after 6s: wanted 'scanning' got 'completed' (8) Expected 'scanning' on mds1 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 25/Mar/19 ] |
|
Some notes. It looks like we have trouble reconnecting the MDTs, and that causes this message (from dmesg on MDS1): [16447.097886] Lustre: lustre-MDT0000: hit invalid OI mapping for [0x200001b74:0x2:0x0] during recovering, that may because auto scrub is disabled on related MDT, and will cause recovery failure. Please enable auto scrub and retry the recovery. This makes the OI scrub startup weird, and the subsequent messages in the log are different from the success case. We're having trouble re-establishing the connection here: 00000100:02020000:0.0:1552798330.508523:0:17624:0:(client.c:1279:ptlrpc_check_status()) 11-0: lustre-MDT0000-osp-MDT0002: operation mds_connect to node 0@lo failed: rc = -114 00000100:00080000:0.0:1552798330.511162:0:17624:0:(import.c:1338:ptlrpc_connect_interpret()) recovery of lustre-MDT0000_UUID on 10.9.6.9@tcp failed (-114) 00000100:02020000:0.0:1552798330.512119:0:17624:0:(client.c:1279:ptlrpc_check_status()) 11-0: lustre-MDT0001-osp-MDT0002: operation mds_connect to node 10.9.6.10@tcp failed: rc = -114 00000100:00080000:0.0:1552798330.512123:0:17624:0:(import.c:1338:ptlrpc_connect_interpret()) recovery of lustre-MDT0001_UUID on 10.9.6.10@tcp failed (-114) This is before we turn on oi scub: 00000001:02000400:0.0:1552798342.891701:0:29330:0:(debug.c:511:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl set_param -n osd-*.*.auto_scrub=1 But we're still taking errors after that: 00000100:00100000:0.0:1552798343.266169:0:26241:0:(service.c:2198:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_000:lustre-MDT0001-mdtlov_UUID+5:14884:x1628209949343232:12345-10.9.6.10@tcp:1000 Request processed in 3455us (3476us total) trans 0 rc -115/-115 That's an OUT_UPDATE request. (Posting this to get it out, more to check...) |
| Comment by Patrick Farrell (Inactive) [ 25/Mar/19 ] |
|
I dug in some more & didn't find any smoking guns - I'm going to drop this for now. Perhaps helpful for someone else or just something for later. But the cause of the failure definitely seems to be the MDSes not being fully connected, leading to this error: [16447.097886] Lustre: lustre-MDT0000: hit invalid OI mapping for [0x200001b74:0x2:0x0] during recovering, that may because auto scrub is disabled on related MDT, and will cause recovery failure. Please enable auto scrub and retry the recovery. This suggests we may need some better check for startup, as this does appear to unwind itself, it just does so after we start oi scrub. |