[LU-13179] conf-sanity test 93 fails with ''mds2: import is not in FULL state after 40'' Created: 31/Jan/20  Updated: 31/Jan/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

DNE/ZFS


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity tes_93 fails with ''mds2: import is not in FULL state after 40''. Looking at the console log for MDS 2/4 (vm5) for the failure at https://testing.whamcloud.com/test_sets/1e5610e6-43ab-11ea-8072-52540065bddc, we see that

trevis-42vm5: trevis-42vm5.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
CMD: trevis-42vm4 zfs get -H -o value 						lustre:svname lustre-mdt3/mdt3
Starting mds3:   lustre-mdt3/mdt3 /mnt/lustre-mds3
CMD: trevis-42vm4 mkdir -p /mnt/lustre-mds3; mount -t lustre   lustre-mdt3/mdt3 /mnt/lustre-mds3
CMD: trevis-42vm5 zfs get -H -o value 						lustre:svname lustre-mdt4/mdt4
Starting mds4:   lustre-mdt4/mdt4 /mnt/lustre-mds4
CMD: trevis-42vm5 mkdir -p /mnt/lustre-mds4; mount -t lustre   lustre-mdt4/mdt4 /mnt/lustre-mds4
trevis-42vm5:  rpc : @@@@@@ FAIL: can't put import for os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid into FULL state after 40 sec, have  
trevis-42vm5:   Trace dump:
trevis-42vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5900:error()
trevis-42vm5:   = /usr/lib64/lustre/tests/test-framework.sh:7027:_wait_import_state()
trevis-42vm5:   = /usr/lib64/lustre/tests/test-framework.sh:7049:wait_import_state()

In all the console logs, we just get confirmation that the MSD1 can’t connect in the allotted time limit.

Looking at the console log for the MDS1/3 (vm4), we see

[83050.530881] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-42vm5.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
[83050.717460] Lustre: DEBUG MARKER: zfs get -H -o value 						lustre:svname lustre-mdt3/mdt3
[83050.753501] Lustre: DEBUG MARKER: trevis-42vm5.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
[83051.092507] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds3; mount -t lustre   lustre-mdt3/mdt3 /mnt/lustre-mds3
[83064.412617] LustreError: 3861:0:(fail.c:129:__cfs_fail_timeout_set()) cfs_fail_timeout id 90e sleeping for 10000ms
[83064.414535] LustreError: 3861:0:(fail.c:129:__cfs_fail_timeout_set()) Skipped 76 previous similar messages
[83074.420718] LustreError: 3861:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 90e awake
[83074.422537] LustreError: 3861:0:(fail.c:133:__cfs_fail_timeout_set()) Skipped 76 previous similar messages
[83089.482356] Lustre: lustre-MDT0002: Imperative Recovery not enabled, recovery window 60-180
[83089.484385] Lustre: Skipped 6 previous similar messages
[83090.262090] Lustre: cli-ctl-lustre-MDT0002: Allocated super-sequence [0x0000000280000400-0x00000002c0000400]:2:mdt]
[83091.385126] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid into FULL state after 40 sec, have  
[83091.610883] Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid into FULL state after 40 sec, have

Looking at the console log for the OSS (vm3), we see

[83053.780999] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 00:08:41 (1580342921)
[83054.152945] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-42vm5.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
[83054.372472] Lustre: DEBUG MARKER: trevis-42vm5.trevis.whamcloud.com: executing wait_import_state FULL os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid 40
[83092.153168] Lustre: cli-lustre-OST0000-super: Allocated super-sequence [0x0000000240000400-0x0000000280000400]:0:ost]
[83092.155157] Lustre: Skipped 2 previous similar messages
[83094.995539] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  rpc : @@@@@@ FAIL: can\'t put import for os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid into FULL state after 40 sec, have  
[83095.232157] Lustre: DEBUG MARKER: rpc : @@@@@@ FAIL: can't put import for os[cp].lustre-OST0000-osc-MDT0001.ost_server_uuid into FULL state after 40 sec, have

On the client1 (vm1) console log, we see

[83271.701328] Lustre: 17215:0:(lmv_obd.c:269:lmv_init_ea_size()) lustre-clilmv-ffff97f8f9748800: NULL export for 1
[83292.919916] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_93: @@@@@@ FAIL: mds2: import is not in FULL state after 40 
[83293.136689] Lustre: DEBUG MARKER: conf-sanity test_93: @@@@@@ FAIL: mds2: import is not in FULL state after 40

This is the first time we’ve seen this issue for banch testing; 29 JAN 2020 for 2.12.3.109 DNE/ZFS.

In the past year, we’ve seen a similar issue twice when running testing for patches, but the patch being tested may have cause the failure.


Generated at Sat Feb 10 02:59:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.