[LU-10600] sanity test 66 hangs when run after tests 64d and 65k Created: 02/Feb/18  Updated: 27/Feb/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: test_script_improvements, tests

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

To reproduce this hang, run the following:

# ./llmount.sh
#  ./auster -v -k sanity --only "64d 65k 66"

The above commands with set up Lustre with one combined MGS/MDS and two OSTs with loop back devices on a single node. The auster command will run sanity test 64d, 65k and 66 only.

The output from the run is:

== sanity test 64d: check grant limit exceed ========================================================= 19:10:43 (1517512243)
debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
dd: error writing '/mnt/lustre/f64d.sanity': No space left on device
278+0 records in
277+0 records out
290684928 bytes (291 MB) copied, 13.1767 s, 22.1 MB/s
/usr/lib64/lustre/tests/sanity.sh: line 6121: kill: (25805) - No such process
debug=trace inode super ext2 malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec lfsck hsm snapshot layout
Resetting fail_loc on all nodes...done.
PASS 64d (15s)


== sanity test 65k: validate manual striping works properly with deactivated OSCs ==================== 19:10:58 (1517512258)
Check OST status: 
lustre-OST0000-osc-MDT0000 is active
lustre-OST0001-osc-MDT0000 is active
total: 1000 open/close in 1.49 seconds: 672.84 ops/second
Deactivate:  lustre-OST0000-osc-MDT0000
/usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0
/usr/bin/lfs setstripe -i 1 -c 1 /mnt/lustre/d65k.sanity/1
 - unlinked 0 (time 1517512260 ; total 0 ; last 0)
total: 1000 unlinks in 2 seconds: 500.000000 unlinks/second
lustre-OST0000-osc-MDT0000 is Activate
trevis-58vm6.trevis.hpdd.intel.com: executing wait_import_state FULL osc.*OST*-osc-MDT0000.ost_server_uuid 40
osc.*OST*-osc-MDT0000.ost_server_uuid in FULL state after 0 sec
total: 1000 open/close in 1.62 seconds: 615.67 ops/second
Deactivate:  lustre-OST0001-osc-MDT0000
/usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0
lfs setstripe: error on ioctl 0x4008669a for '/mnt/lustre/d65k.sanity/0' (3): No space left on device
 sanity test_65k: @@@@@@ FAIL: setstripe 0 should succeed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5130:error()
  = /usr/lib64/lustre/tests/sanity.sh:6310:test_65k()
  = /usr/lib64/lustre/tests/test-framework.sh:5406:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5445:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5244:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:6326:main()
Dumping lctl log to /tmp/test_logs/2018-02-01/191040/sanity.test_65k.*.1517512264.log
Dumping logs only on local client.
Resetting fail_loc on all nodes...done.
FAIL 65k (6s)


== sanity test 66: update inode blocks count on client =============================================== 19:11:04 (1517512264)

Looking at the testing output, we see that test 64d fills an OST and doesn’t remove the file and then test 65k deactivates one OST at a time and, when it has deactivated the empty OST and tries to use the full one, the ‘lfs setstripe’ command fails. Test 65k exits when the ‘lfs setstripe’ command fail and leaves one OST deactivated.

The last thing I see in dmesg from test 66 is:

 [797721.192593] Lustre: DEBUG MARKER: == sanity test 66: update inode blocks count on client =============================================== 23:33:24 (1517614404)
[797773.923382] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff88000c752c00, cur 1517614457 expire 1517614427 last 1517614404

We need to clean up the file from test 64d and, on failure, we need to activate the OST and clean up files in test 65k.

I’ll upload a patch for this.



 Comments   
Comment by Gerrit Updater [ 02/Feb/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/31159
Subject: LU-10600 tests: clean up sanity tests 64d and 65k
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cb666ba8f3c5f3591abcf81180b00dc8cbab8755

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31159/
Subject: LU-10600 tests: clean up sanity tests 64d and 65k
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1f50ba91a2a58890437924d8f72a27aecc2463d3

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:36:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.