[LU-13813] conf-sanity test_112: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN Created: 22/Jul/20  Updated: 27/Mar/23  Resolved: 19/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12857 recovery-mds-scale test_failover_ost ... Resolved
is related to LU-13221 conf-sanity test_112: FAIL: MDS start... Reopened
is related to LU-12818 replay-single test_70b and other test... Resolved
is related to LU-13184 conf-sanity test_112: problem creatin... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for vik <vlapa@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c03460d0-489d-4fbb-84e5-d7df9a59d9dc

test_112 failed with the following error:

import is not in FULL state

Investigation showed that MDS was not connected during start_mds test step in _wait_import_state()

can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN

debug log from vm2 contains messages about import state changes.
MDT0000_UUID: changing import state from DISCONN to CONNECTING and
MDT0000_UUID: changing import state from CONNECTING to DISCONN

00000100:00080000:1.0:1594563480.058266:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:1.0:1594563480.058358:0:25586:0:(pinger.c:242:ptlrpc_pinger_process_import()) 2d768d9f-aa8c-42e0-a75c-fa93c720357c->lustre-MDT0000_UUID: level DISCONN/3 force 0 force_next 0 deactive 0 pingable 1 suppress 0
00000100:00080000:1.0:1594563480.058361:0:25586:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from DISCONN to CONNECTING
00000100:00080000:1.0:1594563480.059863:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a285a1e5000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:1.0:1594563480.059906:0:10210:0:(import.c:86:import_set_state_nolock()) ffff9a287aae0000 lustre-MDT0000_UUID: changing import state from CONNECTING to DISCONN
00000001:02000400:0.0:1594563481.248258:0:24754:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
00000001:02000400:1.0:1594563481.249585:0:24755:0:(debug.c:712:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN
Started lustre-MDT0000
CMD: onyx-42vm1.onyx.whamcloud.com,onyx-42vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
onyx-42vm1: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990)
onyx-42vm2: == rpc test complete, duration -o sec ================================================================ 13:53:10 (1594561990)
onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm1: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm2: CMD: onyx-42vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm1: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm2: CMD: onyx-42vm3 /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm1: CMD: onyx-42vm1.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
onyx-42vm1: onyx-42vm1.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-42vm2: onyx-42vm2.onyx.whamcloud.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-42vm2: CMD: onyx-42vm2.onyx.whamcloud.com lctl get_param -n at_max
onyx-42vm2:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have DISCONN 
onyx-42vm2:   Trace dump:
onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:6216:error()
onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7395:_wait_import_state()
onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7417:wait_import_state()
onyx-42vm2:   = /usr/lib64/lustre/tests/test-framework.sh:7426:wait_import_state_mount()
onyx-42vm2:   = rpc.sh:18:main()

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_112 - import is not in FULL state



 Comments   
Comment by Andreas Dilger [ 06/Feb/21 ]

Fixed via LU-13184.

Comment by James Nunez (Inactive) [ 18/Feb/21 ]

It looks like we’re seeing this issue or something similar for 2.14.0-RC3 ldiskfs non-DNE testing; https://testing.whamcloud.com/test_sets/14efbbbc-a9fd-4ed7-999c-71e1f38af9a6 . If this is a new issue, I can close this ticket and open a new one.

The client error messages look the same, but on the MDS (vm4), we see the DISCONN to CONNECTING state changes:

00000100:00080000:0.0:1613560151.829979:0:3006202:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN
00000100:00000001:1.0:1613560151.829981:0:3128009:0:(pinger.c:158:ptlrpc_ping()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:1.0:1613560151.829983:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=276728730 : 276728730 : 107e8b9a)
00000020:00000001:1.0:1613560151.829984:0:3128009:0:(lprocfs_status.c:1609:lprocfs_read_helper()) Process leaving (rc=18446744073561016351 : -148535265 : fffffffff725881f)
00000100:00080000:0.0:1613560151.829984:0:3006202:0:(import.c:1440:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 10.2.8.108@tcp failed (-110)
00000100:00000040:1.0:1613560151.829985:0:3128009:0:(pinger.c:322:ptlrpc_pinger_main()) next wakeup in 5 (51652)
00000100:00000001:0.0:1613560151.829985:0:3006202:0:(import.c:1446:ptlrpc_connect_interpret()) Process leaving (rc=18446744073709551506 : -110 : ffffffffffffff92)
00000100:00080000:1.0:1613560151.829986:0:3128009:0:(pinger.c:249:ptlrpc_pinger_process_import()) lustre-MDT0000-mdtlov_UUID->lustre-OST0000_UUID: level DISCONN/3 force 1 force_next 0 deactive 0 pingable 0 suppress 1
00000100:00000040:0.0:1613560151.829986:0:3006202:0:(lustre_net.h:2470:ptlrpc_rqphase_move()) @@@ move request phase from Interpret to Complete  req@00000000e533b5b3 x1691936139685568/t0(0) o8->lustre-OST0000-osc-MDT0000@10.2.8.108@tcp:28/4 lens 520/544 e 0 to 1 dl 1613560151 ref 1 fl Interpret:EXNQU/0/ffffffff rc -110/-1 job:'kworker/u4:1.0'
00000100:00000001:1.0:1613560151.829988:0:3128009:0:(recover.c:56:ptlrpc_initiate_recovery()) Process entered
00000100:00080000:1.0:1613560151.829989:0:3128009:0:(recover.c:58:ptlrpc_initiate_recovery()) lustre-OST0000_UUID: starting recovery
00000100:00000001:1.0:1613560151.829990:0:3128009:0:(import.c:697:ptlrpc_connect_import_locked()) Process entered
00000100:00080000:1.0:1613560151.829990:0:3128009:0:(import.c:86:import_set_state_nolock()) 0000000059d48be4 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING
Comment by Andreas Dilger [ 19/Mar/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/7fb8c261-d3c5-4f43-af97-6cb1e10724ad

Comment by Andreas Dilger [ 30/Jun/22 ]

It seems all PASS cases are with DNE mode, and all failed for single MDS.

Should we just add an ((MDSCOUNT >= 2)) check? I don't know why this is failing only for single-MDS tests and passing for DNE, but at least we wouldn't have continuous failures.

Comment by Jian Yu [ 22/Jul/22 ]

The failure can be reproduced by running test 112 after test 111. It passed in review test sessions because test 111 was skipped with SLOW=no.
I'm working on a patch to fix the test script.
 

Comment by Gerrit Updater [ 23/Jul/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48021
Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: f6ad9782eed05eaa9b2af29e70080652e0d03e10

Comment by Gerrit Updater [ 23/Jul/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48022
Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 77c40a5864758698bf92f74ff419b00491d402c8

Comment by Gerrit Updater [ 19/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48022/
Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0109cee2610b8dfeaaca25c3eb1e805e033c593d

Comment by Peter Jones [ 19/Aug/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 20/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/48021/
Subject: LU-13813 tests: fix stack_trap in conf-sanity test 110/111
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 0331d908abcb026fb450241ac8566d1a2bf425bc

Generated at Sat Feb 10 03:04:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.