[LU-12753] sanity test_300a: FAIL: 1:stripe_count is 1, expect 2 Created: 11/Sep/19  Updated: 20/Sep/19  Resolved: 20/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

olegtest


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seeing regular failures from Oleg's test environment:

Initial testing failed:

    sanity2@ldiskfs+DNE Failure(6653s)
    - 300a(1:stripe_count is 1, expect 2)

It looks like the MDS is still recovering from the shutdown in test_278 when test_300a is started, so it isn't creating a directory stripe on the missing MDT:

[ 5346.863984] Lustre: DEBUG MARKER: == sanity test 278: Race starting MDS between MDTs stop/start ======================================== 13:15:41 (1568222141)
[ 5347.936754] Lustre: Failing over lustre-MDT0001
[ 5347.942260] Lustre: Skipped 11 previous similar messages
[ 5347.946645] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_disconnect to node 192.168.203.134@tcp failed: rc = -19
[ 5347.949716] LustreError: 16840:0:(osp_dev.c:485:osp_disconnect()) lustre-MDT0000-osp-MDT0001: can't disconnect: rc = -19
[ 5347.954701] LustreError: 16840:0:(lod_dev.c:267:lod_sub_process_config()) lustre-MDT0001-mdtlov: error cleaning up LOD index 0: cmd 0xcf031 : rc = -19
[ 5348.090212] LustreError: 16840:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 60c sleeping
[ 5348.901928] Lustre: server umount lustre-MDT0000 complete
[ 5348.905261] Lustre: Skipped 10 previous similar messages
[ 5351.974023] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 5352.048229] LustreError: 17417:0:(libcfs_fail.h:174:cfs_race()) cfs_fail_race id 60c waking
[ 5352.049878] LustreError: 16840:0:(libcfs_fail.h:172:cfs_race()) cfs_fail_race id 60c awake: rc=0
[ 5352.050393] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 192.168.203.134@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 5352.050394] Lustre: Skipped 1 previous similar message
[ 5352.051035] LustreError: 166-1: MGC192.168.203.134@tcp: Connection to MGS (at 192.168.203.134@tcp) was lost; in progress operations using this service will fail
[ 5352.052430] Lustre: Evicted from MGS (at 192.168.203.134@tcp) after server handle changed from 0xb4695c9561748b31 to 0xb4695c9561c72f3e
[ 5354.166243] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 5355.558035] Lustre: DEBUG MARKER: oleg334-server.virtnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 8
[ 5359.121998] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 5359.290980] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation mds_connect to node 192.168.203.134@tcp failed: rc = -114
[ 5359.294022] LustreError: Skipped 1 previous similar message
[ 5360.745846] Lustre: DEBUG MARKER: oleg334-server.virtnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 8
[ 5363.943859] Lustre: DEBUG MARKER: == sanity test 300a: basic striped dir sanity test =================================================== 13:15:58 (1568222158)
[ 5364.351961] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
[ 5364.355319] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) Skipped 2 previous similar messages
[ 5365.197447] Lustre: lustre-MDT0000: Recovery over after 0:08, of 2 clients 2 recovered and 0 were evicted.
[ 5365.202538] Lustre: Skipped 2 previous similar messages
[ 5365.217245] Lustre: lustre-OST0001: deleting orphan objects from 0x0:53071 to 0x0:53089
[ 5365.217254] Lustre: lustre-OST0000: deleting orphan objects from 0x0:53209 to 0x0:53281
[ 5365.246825] Lustre: 18717:0:(ldlm_lib.c:1855:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
[ 5365.268393] Lustre: lustre-OST0000: deleting orphan objects from 0x280000400:8777 to 0x280000400:8897
[ 5365.268407] Lustre: lustre-OST0001: deleting orphan objects from 0x2c0000400:8931 to 0x2c0000400:8993
[ 5365.795157] Lustre: DEBUG MARKER: sanity test_300a: @@@@@@ FAIL: 1:stripe_count is 1, expect 2

It seems likely that we need something like "wait_recovery_complete mds2" at the end of test_278() instead of letting it bleed into the next test.



 Comments   
Comment by Gerrit Updater [ 11/Sep/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36167
Subject: LU-12753 tests: wait for mds2 recovery in sanity 278
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 31946d9c959db6ac389a164480058ff54f0fe45d

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36167/
Subject: LU-12753 tests: wait for mds2 recovery in sanity 278
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ee809615178d2fdbf6f2004ec871d04c2cfbca7e

Comment by Peter Jones [ 20/Sep/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:55:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.