[LU-10751] sanity test 27o fails with 'able to create /mnt/lustre/d27o.sanity/f27o.sanity' Created: 01/Mar/18  Updated: 23/Sep/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: dne

Issue Links:
Duplicate
is duplicated by LU-13245 sanity test_27o: sanity test_27o: @@... Open
is duplicated by LU-14166 sanity test_27o: able to create /mnt/... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_27o is failing because it is able to create a file after it exhausts all precreations on all OSTs. The error message for this failure is

'able to create /mnt/lustre/d27o.sanity/f27o.sanity'

For each OST, the test, in exhaust_all_preallocations(), collects osc..prealloc_last_id and osc..prealloc_next_id and creates (last_id – next_id +2) files to exhaust all file precreations. For each OST, and looking at the suite_log for the failure at https://testing.hpdd.intel.com/test_sets/dabd9962-0d65-11e8-bd00-52540065bddc, we see that this works. For example, for OST1

 
OSTIDX=1 MDTIDX=3
CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0001-osc-MDT0003.prealloc_last_id
CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0001-osc-MDT0003.prealloc_next_id
CMD: trevis-15vm5 lctl get_param osc.*OST*-osc-MDT0003.prealloc*
…
osc.lustre-OST0001-osc-MDT0003.prealloc_last_id=97
osc.lustre-OST0001-osc-MDT0003.prealloc_last_seq=0x380000401
osc.lustre-OST0001-osc-MDT0003.prealloc_next_id=69
osc.lustre-OST0001-osc-MDT0003.prealloc_next_seq=0x380000401
osc.lustre-OST0001-osc-MDT0003.prealloc_reserved=0
osc.lustre-OST0001-osc-MDT0003.prealloc_status=-28
…
striped dir -i3 -c2 /mnt/lustre/d27o.sanity/lustre-OST0001
CMD: trevis-15vm3 lctl set_param fail_val=-1 fail_loc=0x215
fail_val=-1
fail_loc=0x215
Creating to objid 97 on ost lustre-OST0001...
open(/mnt/lustre/d27o.sanity/lustre-OST0001/f71) error: No space left on device
total: 2 open/close in 0.01 seconds: 210.13 ops/second

So, OST1 is “full”, and we see this for all OSTs except for one:

OSTIDX=0 MDTIDX=3
CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0000-osc-MDT0003.prealloc_last_id
CMD: trevis-15vm5 lctl get_param -n osc.lustre-OST0000-osc-MDT0003.prealloc_next_id
CMD: trevis-15vm5 lctl get_param osc.*OST*-osc-MDT0003.prealloc*
osc.lustre-OST0000-osc-MDT0003.prealloc_last_id=129
osc.lustre-OST0000-osc-MDT0003.prealloc_last_seq=0x300000401
osc.lustre-OST0000-osc-MDT0003.prealloc_next_id=85
osc.lustre-OST0000-osc-MDT0003.prealloc_next_seq=0x300000401
osc.lustre-OST0000-osc-MDT0003.prealloc_reserved=0
osc.lustre-OST0000-osc-MDT0003.prealloc_status=0
…
striped dir -i3 -c2 /mnt/lustre/d27o.sanity/lustre-OST0000
CMD: trevis-15vm3 lctl set_param fail_val=-1 fail_loc=0x215
fail_val=-1
fail_loc=0x215
Creating to objid 129 on ost lustre-OST0000...
total: 46 open/close in 0.07 seconds: 615.37 ops/second

We don’t see OST0 fill/error with “No space on device”. Unfortunately, we see the same thing for sanity test 27o when it passes.

Although this might expected due to the fail_loc, in the dmesg log MDS1/3, we see

[ 1181.147030] Lustre: DEBUG MARKER: == sanity test 27o: create file with all full OSTs (should error) ==================================== 20:59:07 (1518123547)
[ 1181.756807] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1194.103935] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1205.580992] LustreError: 28563:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4be:0x0]: have 0 want 1
[ 1206.377129] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1217.863538] LustreError: 30940:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4bf:0x0]: have 0 want 1
[ 1218.669304] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1230.138926] LustreError: 28565:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c0:0x0]: have 0 want 1
[ 1230.931360] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1242.447227] LustreError: 32075:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c1:0x0]: have 0 want 1
[ 1243.258795] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1254.768848] LustreError: 32075:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c2:0x0]: have 0 want 1
[ 1255.579674] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1267.067337] LustreError: 30940:0:(lod_qos.c:1352:lod_alloc_specific()) can't lstripe objid [0x2000013a2:0xf4c3:0x0]: have 0 want 1
[ 1267.873867] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1280.214333] Lustre: DEBUG MARKER: lctl get_param -n lov.*.qos_maxage
[ 1290.618993] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_27o: @@@@@@ FAIL: able to create \/mnt\/lustre\/d27o.sanity\/f27o.sanity 

sanity test 27o started failing with this error message on 2018-01-25 and, so far, only fails for DNE testing.

Logs for failures are at
https://testing.hpdd.intel.com/test_sets/959d7148-1c58-11e8-a10a-52540065bddc



 Comments   
Comment by John Hammond [ 23/Sep/21 ]

I don't think this test is testing anything other than whether the test function exhaust_all_precreations() can reliably exhaust all precreations.

Generated at Sat Feb 10 02:37:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.