[LU-13813] conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN Created: 22/Jul/20 Updated: 27/Mar/23 Resolved: 19/Aug/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for vik <vlapa@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c03460d0-489d-4fbb-84e5-d7df9a59d9dc test_112 failed with the following error: import is not in FULL state Investigation showed that MDS was not connected during start_mds test step in _wait_import_state() can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN debug log from vm2 contains messages about import state changes. 00000100:00080000:1.0:1594563480.058266:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:1.0:1594563480.058358:0:25586:0:(pinger.c:242:ptlrpc_pinger_process_import()) 2d768d9f-aa8c-42e0-a75c-fa93c720357c->lustre-MDT0000_UUID: level DISCONN/3 force 0 force_next 0 deactive 0 pingable 1 suppress 0 00000100:00080000:1.0:1594563480.058361:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:1.0:1594563480.059863:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:1.0:1594563480.059906:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN 00000001:02000400:0.0:1594563481.248258:0:24754:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN 00000001:02000400:1.0:1594563481.249585:0:24755:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN Started lustre-MDT0000 CMD: onyx-42vm1.onyx.whamcloud.com,onyx-42vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-42vm1: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990) onyx-42vm2: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990) onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm1: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm2: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm1: CMD: onyx-42vm1.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null onyx-42vm1: onyx-42vm1.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-42vm2: onyx-42vm2.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com lctl get_param -n at_max onyx-42vm2: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN onyx-42vm2: Trace dump: onyx-42vm2: = /usr/lib64/lustre/tests/test-framework.sh:6216:error() onyx-42vm2: = /usr/lib64/lustre/tests/test-framework.sh:7395:_wait_import_state() onyx-42vm2: = /usr/lib64/lustre/tests/test-framework.sh:7417:wait_import_state() onyx-42vm2: = /usr/lib64/lustre/tests/test-framework.sh:7426:wait_import_state_mount() onyx-42vm2: = rpc.sh:18:main() VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 06/Feb/21 ] |
|
Fixed via |
| Comment by James Nunez (Inactive) [ 18/Feb/21 ] |
|
It looks like we’re seeing this issue or something similar for 2.14.0-RC3 ldiskfs non-DNE testing; https://testing.whamcloud.com/test_sets/14efbbbc-a9fd-4ed7-999c-71e1f38af9a6 . If this is a new issue, I can close this ticket and open a new one. The client error messages look the same, but on the MDS (vm4), we see the DISCONN to CONNECTING state changes: 00000100:00080000:0.0:1613560151.829979:0:3006202:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN 00000100:00000001:1.0:1613560151.829981:0:3128009:0:(pinger.c:158:ptlrpc_ping()) Process leaving (rc=0 : 0 : 0) 00000020:00000001:1.0:1613560151.829983:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=276728730 : 276728730 : 107e8b9a) 00000020:00000001:1.0:1613560151.829984:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=18446744073561016351 : -148535265 : fffffffff725881f) 00000100:00080000:0.0:1613560151.829984:0:3006202:0:(import.c:1440:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.2.8.108@tcp failed (-110) 00000100:00000040:1.0:1613560151.829985:0:3128009:0:(pinger.c:322:ptlrpc_pinger_main()) next wakeup in 5 (51652) 00000100:00000001:0.0:1613560151.829985:0:3006202:0:(import.c:1446:ptlrpc_connect_interpret()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92) 00000100:00080000:1.0:1613560151.829986:0:3128009:0:(pinger.c:249:ptlrpc_pinger_process_import()) lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 0 suppress 1 00000100:00000040:0.0:1613560151.829986:0:3006202:0:(lustre_net.h:2470:ptlrpc_rqphase_move()) @@@ move request phase from Interpret to Complete req@00000000e533b5b3 x1691936139685568/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.8.108@tcp:28/4 lens 520/544 e 0 to 1 dl 1613560151 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:1.0' 00000100:00000001:1.0:1613560151.829988:0:3128009:0:(recover.c:56:ptlrpc_initiate_recovery()) Process entered 00000100:00080000:1.0:1613560151.829989:0:3128009:0:(recover.c:58:ptlrpc_initiate_recovery()) lustre-OST0000_UUID: starting recovery 00000100:00000001:1.0:1613560151.829990:0:3128009:0:(import.c:697:ptlrpc_connect_import_locked()) Process entered 00000100:00080000:1.0:1613560151.829990:0:3128009:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING |
| Comment by Andreas Dilger [ 19/Mar/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/7fb8c261-d3c5-4f43-af97-6cb1e10724ad |
| Comment by Andreas Dilger [ 30/Jun/22 ] |
|
It seems all PASS cases are with DNE mode, and all failed for single MDS. Should we just add an ((MDSCOUNT >= 2)) check? I don't know why this is failing only for single-MDS tests and passing for DNE, but at least we wouldn't have continuous failures. |
| Comment by Jian Yu [ 22/Jul/22 ] |
|
The failure can be reproduced by running test 112 after test 111. It passed in review test sessions because test 111 was skipped with SLOW=no. |
| Comment by Gerrit Updater [ 23/Jul/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48021 |
| Comment by Gerrit Updater [ 23/Jul/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48022 |
| Comment by Gerrit Updater [ 19/Aug/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48022/ |
| Comment by Peter Jones [ 19/Aug/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 20/Aug/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48021/ |