[LU-10350] ost-pools test 1n fails with 'failed to write to /mnt/lustre/d1n.ost-pools/file: 1' Created: 07/Dec/17  Updated: 14/Jun/22  Resolved: 14/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.12.1, Lustre 2.12.6
Fix Version/s: Lustre 2.12.7, Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-9277 ost-pools test_19: createmany /mnt/lu... Resolved
Related
is related to LU-10396 ost-pools test_23b: dd did not fail w... Open
is related to LU-10353 parallel-scale* tests fail with ‘No s... Open
is related to LU-10689 parallel-scale-nfsv3 test_connectatho... Open
is related to LU-8264 lfs setstripe without -p pool_name do... Resolved
is related to LU-2113 ENOSPC sometimes incorrectly reported... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

ost-pools tests 1n, 11, 15, 16, 19 and 22 all fail trying to create/open or write files with the following error message:

File too large

For example, from the test_log of test_1n

== ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 10:03:28 (1512554608)
CMD: trevis-8vm4 lctl pool_new lustre.testpool1234567
trevis-8vm4: Pool lustre.testpool1234567 created
CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 		2>/dev/null || echo foo
CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 		2>/dev/null || echo foo
CMD: trevis-8vm4 lctl pool_add lustre.testpool1234567 OST0000
trevis-8vm4: OST lustre-OST0000_UUID added to pool lustre.testpool1234567
CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 |
		sort -u | tr '\n' ' ' 
CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 |
		sort -u | tr '\n' ' ' 
dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large
 ost-pools test_1n: @@@@@@ FAIL: failed to write to /mnt/lustre/d1n.ost-pools/file: 1 

In the dmesg log for the MDS (vm4), we can see a failure

[18753.542095] Lustre: DEBUG MARKER: == ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 13:37:10 (1512567430)
[18753.714379] Lustre: DEBUG MARKER: lctl pool_new lustre.testpool1234567
[18758.015205] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
[18758.331296] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
[18760.686719] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool1234567 OST0000
[18766.993199] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
[18767.303867] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
[18768.515291] LustreError: 3750:0:(lod_qos.c:1350:lod_alloc_specific()) can't lstripe objid [0x200029443:0xdaad:0x0]: have 1 want 7
[18768.704524] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  ost-pools test_1n: @@@@@@ FAIL: failed to write to \/mnt\/lustre\/d1n.ost-pools\/file: 1 
[18768.896290] Lustre: DEBUG MARKER: ost-pools test_1n: @@@@@@ FAIL: failed to write to /mnt/lustre/d1n.ost-pools/file: 1
[18769.103049] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2017-12-05/lustre-master-el7-x86_64--full--1_1_1__3676___6c155f47-820d-447d-893f-15b24418827f/ost-pools.test_1n.debug_log.$(hostname -s).1512567446.log;
         dmesg > /home/autotest/autotest/lo

and similar failures for the other tests. Note: there are 7 OSTs and 1 MDS for the following test suite:
https://testing.hpdd.intel.com/test_sets/fdd54642-dae4-11e7-8027-52540065bddc

These ost-pools tests started failing with the ‘File too large’ error on September 27, 2017 with 2.10.52.113.

Note: So far we are only seeing these failures during 'full' test sessions and not in review-* test sessions.

Logs for some of the other instances of this failure are at:
https://testing.hpdd.intel.com/test_sets/da2df238-db44-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/4fc12420-daa0-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/307880b4-da7c-11e7-9c63-52540065bddc
https://testing.hpdd.intel.com/test_sets/0e1cd21c-da73-11e7-8027-52540065bddc
https://testing.hpdd.intel.com/test_sets/c1f5d0c8-dadb-11e7-9c63-52540065bddc



 Comments   
Comment by Andreas Dilger [ 07/Dec/17 ]

The file create appears to be failing because a 7-stripe file was requested, but only 1 stripe could be created. We need at least 3/4 of the requested stripe count to consider the create successful.

First thing to check is whether the debug log on the MDS has enough info to see why the MDS isn’t able to create the requested stripes. It might be some leftovers from the previous tests that have exhausted insides on the OSTs?

Separately, it would be useful to make a debugging patch enable full debugging for test_1a, to print lfs df and lfs df -i before the test is run, along with do_nodes $(comma_list $(mdts_nodes)) lctl get_param osp.*.prealloc_*_id to dump the OST object preallocation state before and after the test failure.

Comment by Gerrit Updater [ 07/Dec/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30440
Subject: LU-10350 tests: get inode count for ost-pools test 1n
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2a01764becd66379e735a0db308bda2bac84b951

Comment by James Nunez (Inactive) [ 09/Dec/17 ]

ost-pools failed with the debug patch; https://testing.hpdd.intel.com/test_sets/4df8a486-dc82-11e7-9840-52540065bddc.

For test_1n, we print the free space and free inodes at the beginning of the test and on error. There's enough of both. prealoc_last and next are also printed. Here's what we see in the client test_log:

== ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 17:13:48 (1512753228)
CMD: trevis-33vm4 /usr/sbin/lctl get_param -n debug
CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug_mb=150
debug_mb=150
debug_mb=150
debug_mb=150
debug_mb=150
CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug=-1;
debug=-1
debug=-1
debug=-1
debug=-1
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       10980     1051724   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID     13745592       52880    12957832   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID     13745592       44108    12966604   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID     13745592       48732    12961980   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID     13745592       46088    12964624   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID     13745592       63636    12947076   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID     13745592       45744    12964968   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID     13745592       46824    12963888   0% /mnt/lustre[OST:6]

filesystem_summary:     96219144      348012    90726972   0% /mnt/lustre

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       838864         551      838313   0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID       211200         293      210907   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID       211200         291      210909   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID       211200         285      210915   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID       211200         284      210916   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID       211200         294      210906   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID       211200         291      210909   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID       211200         292      210908   0% /mnt/lustre[OST:6]

filesystem_summary:       838864         551      838313   0% /mnt/lustre

CMD: trevis-33vm4 lctl get_param osp.*.prealloc_*_id
osp.lustre-OST0000-osc-MDT0000.prealloc_last_id=58697
osp.lustre-OST0000-osc-MDT0000.prealloc_next_id=58666
osp.lustre-OST0001-osc-MDT0000.prealloc_last_id=24385
osp.lustre-OST0001-osc-MDT0000.prealloc_next_id=24354
osp.lustre-OST0002-osc-MDT0000.prealloc_last_id=24353
osp.lustre-OST0002-osc-MDT0000.prealloc_next_id=24322
osp.lustre-OST0003-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0003-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0004-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0004-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0005-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0005-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0006-osc-MDT0000.prealloc_last_id=24289
osp.lustre-OST0006-osc-MDT0000.prealloc_next_id=24258
CMD: trevis-33vm4 lctl pool_new lustre.testpool1234567
trevis-33vm4: Pool lustre.testpool1234567 created
CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 				2>/dev/null || echo foo
CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 		2>/dev/null || echo foo
CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 		2>/dev/null || echo foo
CMD: trevis-33vm4 lctl pool_add lustre.testpool1234567 OST0000
trevis-33vm4: OST lustre-OST0000_UUID added to pool lustre.testpool1234567
CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
				sort -u | tr '\n' ' ' 
CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 |
		sort -u | tr '\n' ' ' 
CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 |
		sort -u | tr '\n' ' ' 
dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       10984     1051720   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID     13745592       52880    12957832   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID     13745592       44108    12966604   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID     13745592       48732    12961980   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID     13745592       46088    12964624   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID     13745592       63636    12947076   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID     13745592       45744    12964968   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID     13745592       46824    12963888   0% /mnt/lustre[OST:6]

filesystem_summary:     96219144      348012    90726972   0% /mnt/lustre

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       838864         552      838312   0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID       211200         293      210907   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID       211200         291      210909   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID       211200         285      210915   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID       211200         284      210916   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID       211200         294      210906   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID       211200         291      210909   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID       211200         292      210908   0% /mnt/lustre[OST:6]

filesystem_summary:       838864         552      838312   0% /mnt/lustre

CMD: trevis-33vm4 lctl get_param osp.*.prealloc_*_id
osp.lustre-OST0000-osc-MDT0000.prealloc_last_id=58697
osp.lustre-OST0000-osc-MDT0000.prealloc_next_id=58666
osp.lustre-OST0001-osc-MDT0000.prealloc_last_id=24385
osp.lustre-OST0001-osc-MDT0000.prealloc_next_id=24354
osp.lustre-OST0002-osc-MDT0000.prealloc_last_id=24353
osp.lustre-OST0002-osc-MDT0000.prealloc_next_id=24322
osp.lustre-OST0003-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0003-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0004-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0004-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0005-osc-MDT0000.prealloc_last_id=24321
osp.lustre-OST0005-osc-MDT0000.prealloc_next_id=24290
osp.lustre-OST0006-osc-MDT0000.prealloc_last_id=24289
osp.lustre-OST0006-osc-MDT0000.prealloc_next_id=24258
CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug_mb=4
debug_mb=4
debug_mb=4
debug_mb=4
debug_mb=4
Comment by Andreas Dilger [ 11/Dec/17 ]

Looking at the most recent logs, I'm wondering if there is some problem adding the OST(s) to the pool, which causes an error creating a file in a pool with no OSTs? I've added some more debugging to James' patch.

The debug logs have the -EFBIG = -27 error:
https://testing.hpdd.intel.com/test_logs/4e4d4982-dc82-11e7-9840-52540065bddc/show_text

0000004:00000001:0.0:1512753249.949213:0:30195:0:(lod_object.c:4453:lod_declare_striped_create()) Process entered
00020000:00000001:0.0:1512753249.949221:0:30195:0:(lod_qos.c:2253:lod_prepare_create()) Process entered
00020000:00001000:0.0:1512753249.949225:0:30195:0:(lod_qos.c:2298:lod_prepare_create()) 0 [0, 0)
00020000:00000001:0.0:1512753249.949226:0:30195:0:(lod_qos.c:2065:lod_qos_prep_create()) Process entered
00020000:00000001:0.0:1512753249.949227:0:30195:0:(lod_qos.c:270:lod_qos_statfs_update()) Process entered
00020000:00000001:0.0:1512753249.949229:0:30195:0:(lod_qos.c:195:lod_statfs_and_check()) Process entered
00000004:00001000:0.0:1512753249.949232:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0000-osc-MDT0000: 3436398 blocks, 3423178 free, 3239390 avail, 211200 files, 210907 free files
00000004:00001000:0.0:1512753249.949237:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0001-osc-MDT0000: 3436398 blocks, 3425371 free, 3241583 avail, 211200 files, 210909 free files
00000004:00001000:0.0:1512753249.949242:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0002-osc-MDT0000: 3436398 blocks, 3424215 free, 3240427 avail, 211200 files, 210915 free files
00000004:00001000:0.0:1512753249.949245:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0003-osc-MDT0000: 3436398 blocks, 3424876 free, 3241088 avail, 211200 files, 210916 free files
00000004:00001000:0.0:1512753249.949249:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0004-osc-MDT0000: 3436398 blocks, 3420489 free, 3236701 avail, 211200 files, 210906 free files
00000004:00001000:0.0:1512753249.949252:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0005-osc-MDT0000: 3436398 blocks, 3424962 free, 3241174 avail, 211200 files, 210909 free files
00000004:00001000:0.0:1512753249.949256:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0006-osc-MDT0000: 3436398 blocks, 3424692 free, 3240904 avail, 211200 files, 210908 free files
00020000:00000001:0.0:1512753249.949258:0:30195:0:(lod_qos.c:296:lod_qos_statfs_update()) Process leaving
00020000:00001000:0.0:1512753249.949260:0:30195:0:(lod_qos.c:2101:lod_qos_prep_create()) tgt_count 7 stripe_count 7
00020000:00000001:0.0:1512753249.949260:0:30195:0:(lod_qos.c:1237:lod_alloc_specific()) Process entered
:
:
00020000:00020000:0.0:1512753249.949299:0:30195:0:(lod_qos.c:1350:lod_alloc_specific()) can't lstripe objid [0x2000599b1:0x2:0x0]: have 1 want 7
00020000:00000001:0.0:1512753249.953090:0:30195:0:(lod_qos.c:1359:lod_alloc_specific()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00020000:00000001:0.0:1512753249.953100:0:30195:0:(lod_qos.c:2157:lod_qos_prep_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00020000:00000001:0.0:1512753249.953101:0:30195:0:(lod_qos.c:2306:lod_prepare_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
00000004:00000001:0.0:1512753249.953105:0:30195:0:(lod_object.c:4462:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551589 : -27 : 0xffffffffffffffe5)
00000004:00000001:0.0:1512753249.953111:0:30195:0:(lod_object.c:4603:lod_declare_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5)
Comment by Andreas Dilger [ 11/Dec/17 ]

It looks like the problem is that there is only a single OST added to the pool:

CMD: trevis-35vm8 lctl pool_add lustre.testpool1234567 OST0000
trevis-35vm8: OST lustre-OST0000_UUID added to pool lustre.testpool1234567
Pools from lustre:
lustre.testpool1234567
Pool: lustre.testpool1234567
lustre-OST0000_UUID
dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large
# lfs df -p
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       10752     1051952   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID     13745592       43056    12967656   0% /mnt/lustre[OST:0]

filesystem_summary:     13745592       43056    12967656   0% /mnt/lustre
Comment by Andreas Dilger [ 11/Dec/17 ]

More correctly, the problem appears to be that the filesystem default stripe count is 7, but there is only a single OST in the pool, which causes the test failure. So it doesn't look like the problem is in ost-pools.sh itself, but some previous test is changing the default stripe count.

Comment by James Nunez (Inactive) [ 11/Dec/17 ]

I ran ost-pools on my test system and it completed with no failures. I then ran sanity-pfl and then ost-pools and ost-pools test 1n fails with 'File too large' error.

If you run sanity-pfl test 10 and then run ost-pools test 1n, you can trigger the error. On my system, before running sanity-pfl, the layout of the mount point looks like:

[root@trevis-58vm8 tests]# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

After running sanity-pfl test 10, we see that the pattern is now raid0

# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: 0
Comment by Andreas Dilger [ 12/Dec/17 ]

It would be useful to add a call to lfs getstripe -d $MOUNT and lfs getstripe -d $DIR to see what the default striping is at the end of sanity-pfl. It doesn’t make sense that it would be 1, but 7. Maybe that is a difference between your local test configuration and the auto test full config?

Comment by Andreas Dilger [ 12/Dec/17 ]

It does indeed seem that the addition of sanity-pfl to the full test list is the source of this problem - it was added to the autotest repo on Sept. 25th, just before the problems were first seen on Sept. 27th.

commit 4213c2cc5caad5abc9d4ac328f57df2836cdc605
Author:     colmstea <charlie.olmstead@intel.com>
AuthorDate: Mon Sep 25 09:51:54 2017 -0600
Commit:     Charlie Olmstead <charlie.olmstead@intel.com>
CommitDate: Mon Sep 25 15:53:54 2017 +0000

    ATM-675 - add sanity-pfl to autotest full test group
    
    added sanity-pfl to the full test group
    
    Change-Id: I50c0d197301c77687d9df7b20117990ac20a6394
    Reviewed-on: https://review.whamcloud.com/29192
Comment by James Nunez (Inactive) [ 12/Dec/17 ]

When I create a file system, the mount point pattern is blank and I, as root, can’t set the pattern on the mount point to raid0 or mdt:

# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

# lfs setstripe -L raid0 /lustre/scratch/
# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

# lfs setstripe -L mdt /lustre/scratch/
# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

Yet, sanity-pfl test_10 does change the pattern on the mount point to the default ‘raid0’ (and this answers Andreas’ question about what is the default striping is after sanity-pfl):

# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

# NAME=ncli ./auster -k -v sanity-pfl --only 10
Started at Tue Dec 12 16:15:25 UTC 2017
…
PASS 10 (3s)
== sanity-pfl test complete, duration 14 sec ========================================================= 16:15:46 (1513095346)
sanity-pfl returned 0
Finished at Tue Dec 12 16:15:46 UTC 2017 in 21s
./auster: completed with rc 0

# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: 0

and I can set the mount point pattern back to ‘blank’

# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: 0

# lfs setstripe -d /lustre/scratch/
# lfs getstripe /lustre/scratch/
/lustre/scratch/
stripe_count:  1 stripe_size:   1048576 pattern:        stripe_offset: -1

sanity-pfl test 10 gets the layout of the mount point using get_layout_param()/parse_layout_param(), but these functions don’t take into account the pattern of the directory meaning they don’t get the file/dir pattern (--layout parameter). If the pattern isn’t specified, then it defaults to the default pattern which is raid0.

We really want mount point pattern to remain the same before and after sanity-pfl. Do we want to allow the user to set the pattern on the mount point?

Comment by Joseph Gmitter (Inactive) [ 12/Dec/17 ]

Hi Lai,

Can you please look into this one?

Thanks.
Joe

Comment by Andreas Dilger [ 12/Dec/17 ]

It isn't clear if we want to allow only the pattern to be set on the mountpoint, since a raw "mdt" layout on the root is mostly useless unless the filesystem has only MDTs, no OSTs (we can cross that bridge when we get to it, there will be other fixes needed as well). Instead, it makes sense to set a PFL layout with mdt as the first component.

What is strange/broken in ost-pools test_1n is that the test is using create_dir to set the stripe count to -1 (as it always has) in a pool with only 1 OST (as it always has been), but this is now failing when trying to create 7 stripes on the file. It should limit the stripe count to the number of OSTs in the pool.

Comment by Gerrit Updater [ 21/Dec/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30636
Subject: LU-10350 tests: make parsing routines pattern aware
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 36e0f170359a231b8b75e1c49deb8595f61ddb84

Comment by James Nunez (Inactive) [ 21/Dec/17 ]

The patch at https://review.whamcloud.com/30636 only modifies the parsing routines that sanity-pfl test 10 use. When sanity-pfl test_10 is run, this patch should return all original parameters to the mount point and, thus, stop several test failures including most (all?) recent/new ost-pools.sh test failures.

This patch does not address the OST pools issues that Andreas has commented on in this ticket.

Comment by Gerrit Updater [ 14/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30636/
Subject: LU-10350 tests: make parsing routines pattern aware
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 503a78bde8a59e176356a02b2d078332e3201575

Comment by Peter Jones [ 14/Jan/18 ]

Landed for 2.11

Comment by James Nunez (Inactive) [ 15/Feb/18 ]

Reopening this issue because we are seeing it or something closely related with recent full testing. One example of a recent failure is at:
https://testing.hpdd.intel.com/test_sets/b29773fa-10e3-11e8-bd00-52540065bddc

Comment by Sarah Liu [ 21/Feb/18 ]

+1 on master, tag-2.10.58
https://testing.hpdd.intel.com/test_sets/8d2359e6-1132-11e8-a6ad-52540065bddc

Comment by Minh Diep [ 12/Mar/18 ]

+1 on b2_10

https://testing.hpdd.intel.com/test_sets/7d4a2422-23da-11e8-8d2f-52540065bddc

Comment by Gerrit Updater [ 18/Apr/18 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32048
Subject: LU-10350 tests: make parsing routines pattern aware
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 94a3d12c3ce299a519809ee5c4f36e941c202fa2

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32048/
Subject: LU-10350 tests: make parsing routines pattern aware
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 563b643c089cc651d12e82e4af84e1ff8c643b6b

Comment by Sarah Liu [ 18/May/18 ]

Still hit this on 2.10.4 EL7 server with EL6.9 client

https://testing.hpdd.intel.com/test_sets/010bc5d2-599a-11e8-b9d3-52540065bddc

Comment by Sarah Liu [ 06/Aug/18 ]

hit this again on 2.10.5 ldiskfs DNE
https://testing.whamcloud.com/test_sets/e80a3ce8-994b-11e8-b0aa-52540065bddc

Comment by James Nunez (Inactive) [ 12/Dec/18 ]

We're seeing parallel-scale-nfsv3 and parallel-scale-nfsv4 test_compilebench fail with ‘IOError: [Errno 27] File too large’ and

[102528.920205] LustreError: 26259:0:(lod_qos.c:1438:lod_alloc_specific()) can't lstripe objid [0x200022ac9:0x8e8b:0x0]: have 7 want 8

in the MDS dmesg. It looks like this is the same issue as reported here.

Logs are at (all use zfs):
https://testing.whamcloud.com/test_sets/2f3b07b8-fd9d-11e8-b837-52540065bddc
https://testing.whamcloud.com/test_sets/2252d69e-f752-11e8-b67f-52540065bddc
https://testing.whamcloud.com/test_sets/776270a4-f518-11e8-86c0-52540065bddc
https://testing.whamcloud.com/test_sets/77290d46-f518-11e8-86c0-52540065bddc

Comment by Gerrit Updater [ 31/May/21 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/43882
Subject: LU-10350 lod: adjust stripe count to available ost count
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 725e920db06d8f61a3a107231539f44fca8638e4

Comment by Gerrit Updater [ 10/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43976
Subject: LU-10350 lod: adjust stripe count to available ost count
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1a8f62ed7fe82fa7b2f9a76b9bc9f7a7f621d2ef

Comment by Gerrit Updater [ 11/Jun/21 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43976/
Subject: LU-10350 lod: adjust stripe count to available ost count
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 670d78952901183012ae08f2b5e9374d6e293bcf

Comment by Gerrit Updater [ 14/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43882/
Subject: LU-10350 lod: adjust stripe count to available ost count
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f430ec079bf882744729d7aabc2021dfd26aba0c

Generated at Sat Feb 10 02:34:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.