[LU-9096] sanity test_253: File creation failed after rm Created: 09/Feb/17  Updated: 22/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, zfs

Issue Links:
Duplicate
is duplicated by LU-9450 precreate logic badness between lod_s... Closed
is duplicated by LU-9454 sanity test_255a: f255a.sanity failed Closed
Related
is related to LU-7754 DNE3: osd-zfs gets into a livelock if... Open
is related to LU-15553 replay-vbr test 12a fails with 'test_... Open
is related to LU-10070 PFL self-extending file layout Resolved
is related to LU-9457 sanity test_253() must restore osp.*.... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b06b5a90-e8f0-11e6-935d-5254006e85c2.

The sub-test test_253 failed with the following error:

CMD: trevis-43vm3 /usr/sbin/lctl get_param -n osp.lustre-OST0000-osc-MDT0000.prealloc_status
prealloc_status -28
dd: failed to open '/mnt/lustre/d253.sanity/2': No space left on device
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.0534031 s, 98.2 MB/s
CMD: trevis-43vm3 /usr/sbin/lctl set_param -n osd*.*MD*.force_sync 1
CMD: trevis-43vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
:
:
CMD: trevis-43vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Delete is not completed in 28 seconds
Waiting for local destroys to complete
CMD: trevis-43vm3 lctl get_param -n lov.*.qos_maxage
 sanity test_253: @@@@@@ FAIL: File creation failed after rm

The MDS console log shows that there are no precreated objects for this OST:

20:44:10:[11571.982460] LustreError: 22503:0:(lod_qos.c:1273:lod_alloc_specific()) can't lstripe objid [0x20000234c:0x13b8:0x0]: have 0 want 1

Info required for matching: sanity 253



 Comments   
Comment by Andreas Dilger [ 10/Feb/17 ]
20:45:07:[ 4933.488092] Lustre: DEBUG MARKER: == sanity test 253: Check object allocation limit ==================================================== 20:39:14 (1484858354)
20:45:07:[ 5040.092189] INFO: task txg_sync:20227 blocked for more than 120 seconds.
20:45:07:[ 5040.096367] txg_sync        D 0000000000000001     0 20227      2 0x00000080
20:45:07:[ 5040.106578] Call Trace:
20:45:07:[ 5040.108189]  [<ffffffff8168b9e9>] schedule+0x29/0x70
20:45:07:[ 5040.110193]  [<ffffffff81689439>] schedule_timeout+0x239/0x2d0
20:45:07:[ 5040.112031]  [<ffffffff810c4ff2>] ? default_wake_function+0x12/0x20
20:45:07:[ 5040.114129]  [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
20:45:07:[ 5040.115970]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
20:45:07:[ 5040.118039]  [<ffffffff8168af8e>] io_schedule_timeout+0xae/0x130
20:45:07:[ 5040.119922]  [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90
20:45:07:[ 5040.122039]  [<ffffffff8168b028>] io_schedule+0x18/0x20
20:45:07:[ 5040.123883]  [<ffffffffa0662617>] cv_wait_common+0xa7/0x130 [spl]
20:45:07:[ 5040.125886]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
20:45:07:[ 5040.127688]  [<ffffffffa06626f8>] __cv_wait_io+0x18/0x20 [spl]
20:45:07:[ 5040.129716]  [<ffffffffa07bc51b>] zio_wait+0x10b/0x1f0 [zfs]
20:45:07:[ 5040.131471]  [<ffffffffa0745cdf>] dsl_pool_sync+0xbf/0x440 [zfs]
20:45:07:[ 5040.133451]  [<ffffffffa0760868>] spa_sync+0x388/0xb50 [zfs]
20:45:07:[ 5040.135159]  [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40
20:45:07:[ 5040.137112]  [<ffffffff81689b92>] ? mutex_lock+0x12/0x2f
20:45:07:[ 5040.138815]  [<ffffffffa07724e5>] txg_sync_thread+0x3c5/0x620 [zfs]
20:45:07:[ 5040.142523]  [<ffffffffa065d851>] thread_generic_wrapper+0x71/0x80 [spl]
20:45:07:[ 5040.146352]  [<ffffffff810b064f>] kthread+0xcf/0xe0
Comment by nasf (Inactive) [ 21/Mar/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/7029362e-0e0b-11e7-9053-5254006e85c2

Comment by Shuichi Ihara (Inactive) [ 22/Mar/17 ]

+1 on master
https://testing.hpdd.intel.com/sub_tests/987b39ae-0eee-11e7-b5b2-5254006e85c2

Comment by Patrick Farrell (Inactive) [ 28/Mar/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sessions/c0ddafc8-f0b9-4bd4-b3d8-bee81935e304

Comment by Bob Glossman (Inactive) [ 31/Mar/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/abcc08e0-15cd-11e7-9073-5254006e85c2

Comment by Bob Glossman (Inactive) [ 03/Apr/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/8bf82420-1769-11e7-9073-5254006e85c2
https://testing.hpdd.intel.com/test_sets/485d3906-1973-11e7-b742-5254006e85c2

Comment by Emoly Liu [ 14/Apr/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/fdc0b50c-2067-11e7-9073-5254006e85c2

Comment by Peter Jones [ 18/Apr/17 ]

Nathaniel

It seems like this test failure only impacts ZFS runs

Peter

Comment by Andreas Dilger [ 18/Apr/17 ]

Alex, you have a patch https://review.whamcloud.com/18341 that may be related to this problem?

Comment by Steve Guminski (Inactive) [ 18/Apr/17 ]

Another on master:

https://testing.hpdd.intel.com/test_sessions/10710b40-d31b-41be-8e9a-07f9c3aec583

Comment by Alex Zhuravlev [ 24/Apr/17 ]

Andreas, that patch prevents indefinit blocking which isn't the case (the tests completed) ?

Comment by nasf (Inactive) [ 24/Apr/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/af75a44a-28e4-11e7-9073-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 26/Apr/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/cb756bb2-2a97-11e7-9de9-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 26/Apr/17 ]

+1 on master :https://testing.hpdd.intel.com/test_sets/34f1e788-2aab-11e7-8920-5254006e85c2

Comment by Alex Zhuravlev [ 05/May/17 ]

updated statfs: rc = 0, state, ffree 57923, bavail 52864, avail 206, low 1
updated statfs: rc = 0, state, ffree 57923, bavail 52864, avail 206, low 1
updated statfs: rc = 0, state enospc, ffree 57923, bavail 39552, avail 154, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 44512, avail 173, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 44512, avail 173, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 44512, avail 173, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 44512, avail 173, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 42848, avail 167, low 209
updated statfs: rc = 0, state enospc, ffree 18345, bavail 42848, avail 167, low 209
updated statfs: rc = 0, state enospc, ffree 57908, bavail 52864, avail 206, low 209
updated statfs: rc = 0, state enospc, ffree 57908, bavail 52864, avail 206, low 209

sanity/253 sets low to 209:
local blocks=$($LFS df $MOUNT | grep $ost_name | awk '

{ print $4 }

')
local new_lwm=$((blocks/1024-10))

osp.lustre-OST0000-osc-MDT0000.reserved_mb_low=209

Comment by Alex Zhuravlev [ 05/May/17 ]

@@ -14318,7 +14318,7 @@ test_253() {
local blocks=$($LFS df $MOUNT | grep $ost_name | awk '

{ print $4 }

')
echo "OST still has $((blocks/1024)) mbytes free"

  • local new_lwm=$((blocks/1024-10))
    + local new_lwm=$((blocks/1024-20))
    do_facet $SINGLEMDS $LCTL set_param \
    osp.$mdtosc_proc1.reserved_mb_high=$((new_lwm+5))
    do_facet $SINGLEMDS $LCTL set_param \

solves the problem.. I'm wondering where 10 comes from.

Comment by John Hammond [ 05/May/17 ]

> solves the problem.. I'm wondering where 10 comes from.

I think it matches the 10 used a few lines before:
dd if=/dev/zero of=$DIR/$tdir/0 bs=1M count=10

Comment by John Hammond [ 05/May/17 ]

In the interim it would be good to dump all the osp params when this fails (rather than just prealloc_status).

Comment by Nathaniel Clark [ 09/May/17 ]

bzzz,

Instead of hard coding a constant, would it make sense to use test_framework.sh::fs_log_size() to ensure there's enough room for the fs log?

Comment by Gerrit Updater [ 09/May/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/27013
Subject: LU-9096 test: add sanity 253 to ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d62f40a8af0fa7d4041517e81a0183bb051c1785

Comment by James Nunez (Inactive) [ 09/May/17 ]

Since sanity 253 is failing intermittently and the correctness of the test is being questioned, I've uploaded a patch to add the test to the ALWAYS_EXCEPT list and, thus, not be run.

Patch https://review.whamcloud.com/27013

Comment by Gerrit Updater [ 12/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27013/
Subject: LU-9096 test: add sanity 253 to ALWAYS_EXCEPT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cdc7b3bb16537fc513a64b4f20824b9832efcdf9

Comment by Peter Jones [ 12/May/17 ]

The test is no longer being run so I will remove the fixversion from this ticket and it can be reintroduced when the test has been given a clean bill of health

Comment by Andreas Dilger [ 03/Dec/20 ]

For some reason this test is now running again and failing intermittently (about 4x per month).

Comment by Andreas Dilger [ 03/Dec/20 ]

It looks like this test was re-enabled by patch https://review.whamcloud.com/33778 "LU-10070 tests: New test-framework functionality". Since it currently is not failing very offer, I'm not in a hurry to turn it off again, especially if it is adding coverage for SEL.

Comment by Sergey Cheremencev [ 29/Jun/23 ]

+1 on master https://testing.whamcloud.com/test_sessions/c590176e-852a-4080-abb9-e601015fbed9

Comment by Andreas Dilger [ 17/Nov/23 ]

Patch https://review.whamcloud.com/51861 "LU-15553 test: mkdir_on_mdt0 in sanity" is adding this subtest to the always_except list again.

Generated at Sat Feb 10 02:23:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.