Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.11.0
-
3
-
9223372036854775807
Description
To reproduce this hang, run the following:
# ./llmount.sh # ./auster -v -k sanity --only "64d 65k 66"
The above commands with set up Lustre with one combined MGS/MDS and two OSTs with loop back devices on a single node. The auster command will run sanity test 64d, 65k and 66 only.
The output from the run is:
== sanity test 64d: check grant limit exceed ========================================================= 19:10:43 (1517512243) debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck dd: error writing '/mnt/lustre/f64d.sanity': No space left on device 278+0 records in 277+0 records out 290684928 bytes (291 MB) copied, 13.1767 s, 22.1 MB/s /usr/lib64/lustre/tests/sanity.sh: line 6121: kill: (25805) - No such process debug=trace inode super ext2 malloc cache info ioctl neterror net warning buffs other dentry nettrace page dlmtrace error emerg ha rpctrace vfstrace reada mmap config console quota sec lfsck hsm snapshot layout Resetting fail_loc on all nodes...done. PASS 64d (15s) == sanity test 65k: validate manual striping works properly with deactivated OSCs ==================== 19:10:58 (1517512258) Check OST status: lustre-OST0000-osc-MDT0000 is active lustre-OST0001-osc-MDT0000 is active total: 1000 open/close in 1.49 seconds: 672.84 ops/second Deactivate: lustre-OST0000-osc-MDT0000 /usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0 /usr/bin/lfs setstripe -i 1 -c 1 /mnt/lustre/d65k.sanity/1 - unlinked 0 (time 1517512260 ; total 0 ; last 0) total: 1000 unlinks in 2 seconds: 500.000000 unlinks/second lustre-OST0000-osc-MDT0000 is Activate trevis-58vm6.trevis.hpdd.intel.com: executing wait_import_state FULL osc.*OST*-osc-MDT0000.ost_server_uuid 40 osc.*OST*-osc-MDT0000.ost_server_uuid in FULL state after 0 sec total: 1000 open/close in 1.62 seconds: 615.67 ops/second Deactivate: lustre-OST0001-osc-MDT0000 /usr/bin/lfs setstripe -i 0 -c 1 /mnt/lustre/d65k.sanity/0 lfs setstripe: error on ioctl 0x4008669a for '/mnt/lustre/d65k.sanity/0' (3): No space left on device sanity test_65k: @@@@@@ FAIL: setstripe 0 should succeed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5130:error() = /usr/lib64/lustre/tests/sanity.sh:6310:test_65k() = /usr/lib64/lustre/tests/test-framework.sh:5406:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5445:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5244:run_test() = /usr/lib64/lustre/tests/sanity.sh:6326:main() Dumping lctl log to /tmp/test_logs/2018-02-01/191040/sanity.test_65k.*.1517512264.log Dumping logs only on local client. Resetting fail_loc on all nodes...done. FAIL 65k (6s) == sanity test 66: update inode blocks count on client =============================================== 19:11:04 (1517512264)
Looking at the testing output, we see that test 64d fills an OST and doesn’t remove the file and then test 65k deactivates one OST at a time and, when it has deactivated the empty OST and tries to use the full one, the ‘lfs setstripe’ command fails. Test 65k exits when the ‘lfs setstripe’ command fail and leaves one OST deactivated.
The last thing I see in dmesg from test 66 is:
[797721.192593] Lustre: DEBUG MARKER: == sanity test 66: update inode blocks count on client =============================================== 23:33:24 (1517614404) [797773.923382] Lustre: lustre-OST0001: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff88000c752c00, cur 1517614457 expire 1517614427 last 1517614404
We need to clean up the file from test 64d and, on failure, we need to activate the OST and clean up files in test 65k.
I’ll upload a patch for this.