[LU-10350] ost-pools test 1n fails with 'failed to write to /mnt/lustre/d1n.ost-pools/file: 1' Created: 07/Dec/17 Updated: 14/Jun/22 Resolved: 14/Jun/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.12.1, Lustre 2.12.6 |
| Fix Version/s: | Lustre 2.12.7, Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
ost-pools tests 1n, 11, 15, 16, 19 and 22 all fail trying to create/open or write files with the following error message: File too large For example, from the test_log of test_1n == ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 10:03:28 (1512554608) CMD: trevis-8vm4 lctl pool_new lustre.testpool1234567 trevis-8vm4: Pool lustre.testpool1234567 created CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-8vm4 lctl pool_add lustre.testpool1234567 OST0000 trevis-8vm4: OST lustre-OST0000_UUID added to pool lustre.testpool1234567 CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-8vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-8vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 | sort -u | tr '\n' ' ' dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large ost-pools test_1n: @@@@@@ FAIL: failed to write to /mnt/lustre/d1n.ost-pools/file: 1 In the dmesg log for the MDS (vm4), we can see a failure [18753.542095] Lustre: DEBUG MARKER: == ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 13:37:10 (1512567430)
[18753.714379] Lustre: DEBUG MARKER: lctl pool_new lustre.testpool1234567
[18758.015205] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo
[18758.331296] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo
[18760.686719] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool1234567 OST0000
[18766.993199] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
sort -u | tr '\n' ' '
[18767.303867] Lustre: DEBUG MARKER: lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 |
sort -u | tr '\n' ' '
[18768.515291] LustreError: 3750:0:(lod_qos.c:1350:lod_alloc_specific()) can't lstripe objid [0x200029443:0xdaad:0x0]: have 1 want 7
[18768.704524] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost-pools test_1n: @@@@@@ FAIL: failed to write to \/mnt\/lustre\/d1n.ost-pools\/file: 1
[18768.896290] Lustre: DEBUG MARKER: ost-pools test_1n: @@@@@@ FAIL: failed to write to /mnt/lustre/d1n.ost-pools/file: 1
[18769.103049] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2017-12-05/lustre-master-el7-x86_64--full--1_1_1__3676___6c155f47-820d-447d-893f-15b24418827f/ost-pools.test_1n.debug_log.$(hostname -s).1512567446.log;
dmesg > /home/autotest/autotest/lo
and similar failures for the other tests. Note: there are 7 OSTs and 1 MDS for the following test suite: These ost-pools tests started failing with the ‘File too large’ error on September 27, 2017 with 2.10.52.113. Note: So far we are only seeing these failures during 'full' test sessions and not in review-* test sessions. Logs for some of the other instances of this failure are at: |
| Comments |
| Comment by Andreas Dilger [ 07/Dec/17 ] |
|
The file create appears to be failing because a 7-stripe file was requested, but only 1 stripe could be created. We need at least 3/4 of the requested stripe count to consider the create successful. First thing to check is whether the debug log on the MDS has enough info to see why the MDS isn’t able to create the requested stripes. It might be some leftovers from the previous tests that have exhausted insides on the OSTs? Separately, it would be useful to make a debugging patch enable full debugging for test_1a, to print lfs df and lfs df -i before the test is run, along with do_nodes $(comma_list $(mdts_nodes)) lctl get_param osp.*.prealloc_*_id to dump the OST object preallocation state before and after the test failure. |
| Comment by Gerrit Updater [ 07/Dec/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30440 |
| Comment by James Nunez (Inactive) [ 09/Dec/17 ] |
|
ost-pools failed with the debug patch; https://testing.hpdd.intel.com/test_sets/4df8a486-dc82-11e7-9840-52540065bddc. For test_1n, we print the free space and free inodes at the beginning of the test and on error. There's enough of both. prealoc_last and next are also printed. Here's what we see in the client test_log: == ost-pools test 1n: Pool with a 15 char pool name works well ======================================= 17:13:48 (1512753228) CMD: trevis-33vm4 /usr/sbin/lctl get_param -n debug CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug_mb=150 debug_mb=150 debug_mb=150 debug_mb=150 debug_mb=150 CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug=-1; debug=-1 debug=-1 debug=-1 debug=-1 UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 10980 1051724 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 13745592 52880 12957832 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 13745592 44108 12966604 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 13745592 48732 12961980 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 13745592 46088 12964624 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 13745592 63636 12947076 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 13745592 45744 12964968 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 13745592 46824 12963888 0% /mnt/lustre[OST:6] filesystem_summary: 96219144 348012 90726972 0% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 838864 551 838313 0% /mnt/lustre[MDT:0] lustre-OST0000_UUID 211200 293 210907 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 211200 291 210909 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 211200 285 210915 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 211200 284 210916 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 211200 294 210906 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 211200 291 210909 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 211200 292 210908 0% /mnt/lustre[OST:6] filesystem_summary: 838864 551 838313 0% /mnt/lustre CMD: trevis-33vm4 lctl get_param osp.*.prealloc_*_id osp.lustre-OST0000-osc-MDT0000.prealloc_last_id=58697 osp.lustre-OST0000-osc-MDT0000.prealloc_next_id=58666 osp.lustre-OST0001-osc-MDT0000.prealloc_last_id=24385 osp.lustre-OST0001-osc-MDT0000.prealloc_next_id=24354 osp.lustre-OST0002-osc-MDT0000.prealloc_last_id=24353 osp.lustre-OST0002-osc-MDT0000.prealloc_next_id=24322 osp.lustre-OST0003-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0003-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0004-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0004-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0005-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0005-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0006-osc-MDT0000.prealloc_last_id=24289 osp.lustre-OST0006-osc-MDT0000.prealloc_next_id=24258 CMD: trevis-33vm4 lctl pool_new lustre.testpool1234567 trevis-33vm4: Pool lustre.testpool1234567 created CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 2>/dev/null || echo foo CMD: trevis-33vm4 lctl pool_add lustre.testpool1234567 OST0000 trevis-33vm4: OST lustre-OST0000_UUID added to pool lustre.testpool1234567 CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-33vm4 lctl get_param -n lod.lustre-MDT0000-mdtlov.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 | sort -u | tr '\n' ' ' CMD: trevis-33vm1.trevis.hpdd.intel.com lctl get_param -n lov.lustre-*.pools.testpool1234567 | sort -u | tr '\n' ' ' dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 10984 1051720 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 13745592 52880 12957832 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 13745592 44108 12966604 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 13745592 48732 12961980 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 13745592 46088 12964624 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 13745592 63636 12947076 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 13745592 45744 12964968 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 13745592 46824 12963888 0% /mnt/lustre[OST:6] filesystem_summary: 96219144 348012 90726972 0% /mnt/lustre UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 838864 552 838312 0% /mnt/lustre[MDT:0] lustre-OST0000_UUID 211200 293 210907 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 211200 291 210909 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 211200 285 210915 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 211200 284 210916 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 211200 294 210906 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 211200 291 210909 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 211200 292 210908 0% /mnt/lustre[OST:6] filesystem_summary: 838864 552 838312 0% /mnt/lustre CMD: trevis-33vm4 lctl get_param osp.*.prealloc_*_id osp.lustre-OST0000-osc-MDT0000.prealloc_last_id=58697 osp.lustre-OST0000-osc-MDT0000.prealloc_next_id=58666 osp.lustre-OST0001-osc-MDT0000.prealloc_last_id=24385 osp.lustre-OST0001-osc-MDT0000.prealloc_next_id=24354 osp.lustre-OST0002-osc-MDT0000.prealloc_last_id=24353 osp.lustre-OST0002-osc-MDT0000.prealloc_next_id=24322 osp.lustre-OST0003-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0003-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0004-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0004-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0005-osc-MDT0000.prealloc_last_id=24321 osp.lustre-OST0005-osc-MDT0000.prealloc_next_id=24290 osp.lustre-OST0006-osc-MDT0000.prealloc_last_id=24289 osp.lustre-OST0006-osc-MDT0000.prealloc_next_id=24258 CMD: trevis-33vm1.trevis.hpdd.intel.com,trevis-33vm2,trevis-33vm3,trevis-33vm4 /usr/sbin/lctl set_param debug_mb=4 debug_mb=4 debug_mb=4 debug_mb=4 debug_mb=4 |
| Comment by Andreas Dilger [ 11/Dec/17 ] |
|
Looking at the most recent logs, I'm wondering if there is some problem adding the OST(s) to the pool, which causes an error creating a file in a pool with no OSTs? I've added some more debugging to James' patch. The debug logs have the -EFBIG = -27 error: 0000004:00000001:0.0:1512753249.949213:0:30195:0:(lod_object.c:4453:lod_declare_striped_create()) Process entered 00020000:00000001:0.0:1512753249.949221:0:30195:0:(lod_qos.c:2253:lod_prepare_create()) Process entered 00020000:00001000:0.0:1512753249.949225:0:30195:0:(lod_qos.c:2298:lod_prepare_create()) 0 [0, 0) 00020000:00000001:0.0:1512753249.949226:0:30195:0:(lod_qos.c:2065:lod_qos_prep_create()) Process entered 00020000:00000001:0.0:1512753249.949227:0:30195:0:(lod_qos.c:270:lod_qos_statfs_update()) Process entered 00020000:00000001:0.0:1512753249.949229:0:30195:0:(lod_qos.c:195:lod_statfs_and_check()) Process entered 00000004:00001000:0.0:1512753249.949232:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0000-osc-MDT0000: 3436398 blocks, 3423178 free, 3239390 avail, 211200 files, 210907 free files 00000004:00001000:0.0:1512753249.949237:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0001-osc-MDT0000: 3436398 blocks, 3425371 free, 3241583 avail, 211200 files, 210909 free files 00000004:00001000:0.0:1512753249.949242:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0002-osc-MDT0000: 3436398 blocks, 3424215 free, 3240427 avail, 211200 files, 210915 free files 00000004:00001000:0.0:1512753249.949245:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0003-osc-MDT0000: 3436398 blocks, 3424876 free, 3241088 avail, 211200 files, 210916 free files 00000004:00001000:0.0:1512753249.949249:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0004-osc-MDT0000: 3436398 blocks, 3420489 free, 3236701 avail, 211200 files, 210906 free files 00000004:00001000:0.0:1512753249.949252:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0005-osc-MDT0000: 3436398 blocks, 3424962 free, 3241174 avail, 211200 files, 210909 free files 00000004:00001000:0.0:1512753249.949256:0:30195:0:(osp_dev.c:774:osp_statfs()) lustre-OST0006-osc-MDT0000: 3436398 blocks, 3424692 free, 3240904 avail, 211200 files, 210908 free files 00020000:00000001:0.0:1512753249.949258:0:30195:0:(lod_qos.c:296:lod_qos_statfs_update()) Process leaving 00020000:00001000:0.0:1512753249.949260:0:30195:0:(lod_qos.c:2101:lod_qos_prep_create()) tgt_count 7 stripe_count 7 00020000:00000001:0.0:1512753249.949260:0:30195:0:(lod_qos.c:1237:lod_alloc_specific()) Process entered : : 00020000:00020000:0.0:1512753249.949299:0:30195:0:(lod_qos.c:1350:lod_alloc_specific()) can't lstripe objid [0x2000599b1:0x2:0x0]: have 1 want 7 00020000:00000001:0.0:1512753249.953090:0:30195:0:(lod_qos.c:1359:lod_alloc_specific()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5) 00020000:00000001:0.0:1512753249.953100:0:30195:0:(lod_qos.c:2157:lod_qos_prep_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5) 00020000:00000001:0.0:1512753249.953101:0:30195:0:(lod_qos.c:2306:lod_prepare_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5) 00000004:00000001:0.0:1512753249.953105:0:30195:0:(lod_object.c:4462:lod_declare_striped_create()) Process leaving via out (rc=18446744073709551589 : -27 : 0xffffffffffffffe5) 00000004:00000001:0.0:1512753249.953111:0:30195:0:(lod_object.c:4603:lod_declare_create()) Process leaving (rc=18446744073709551589 : -27 : ffffffffffffffe5) |
| Comment by Andreas Dilger [ 11/Dec/17 ] |
|
It looks like the problem is that there is only a single OST added to the pool: CMD: trevis-35vm8 lctl pool_add lustre.testpool1234567 OST0000 trevis-35vm8: OST lustre-OST0000_UUID added to pool lustre.testpool1234567 Pools from lustre: lustre.testpool1234567 Pool: lustre.testpool1234567 lustre-OST0000_UUID dd: failed to open '/mnt/lustre/d1n.ost-pools/file': File too large # lfs df -p UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 1165900 10752 1051952 1% /mnt/lustre[MDT:0] lustre-OST0000_UUID 13745592 43056 12967656 0% /mnt/lustre[OST:0] filesystem_summary: 13745592 43056 12967656 0% /mnt/lustre |
| Comment by Andreas Dilger [ 11/Dec/17 ] |
|
More correctly, the problem appears to be that the filesystem default stripe count is 7, but there is only a single OST in the pool, which causes the test failure. So it doesn't look like the problem is in ost-pools.sh itself, but some previous test is changing the default stripe count. |
| Comment by James Nunez (Inactive) [ 11/Dec/17 ] |
|
I ran ost-pools on my test system and it completed with no failures. I then ran sanity-pfl and then ost-pools and ost-pools test 1n fails with 'File too large' error. If you run sanity-pfl test 10 and then run ost-pools test 1n, you can trigger the error. On my system, before running sanity-pfl, the layout of the mount point looks like: [root@trevis-58vm8 tests]# lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 After running sanity-pfl test 10, we see that the pattern is now raid0 # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: 0 |
| Comment by Andreas Dilger [ 12/Dec/17 ] |
|
It would be useful to add a call to lfs getstripe -d $MOUNT and lfs getstripe -d $DIR to see what the default striping is at the end of sanity-pfl. It doesn’t make sense that it would be 1, but 7. Maybe that is a difference between your local test configuration and the auto test full config? |
| Comment by Andreas Dilger [ 12/Dec/17 ] |
|
It does indeed seem that the addition of sanity-pfl to the full test list is the source of this problem - it was added to the autotest repo on Sept. 25th, just before the problems were first seen on Sept. 27th. commit 4213c2cc5caad5abc9d4ac328f57df2836cdc605
Author: colmstea <charlie.olmstead@intel.com>
AuthorDate: Mon Sep 25 09:51:54 2017 -0600
Commit: Charlie Olmstead <charlie.olmstead@intel.com>
CommitDate: Mon Sep 25 15:53:54 2017 +0000
ATM-675 - add sanity-pfl to autotest full test group
added sanity-pfl to the full test group
Change-Id: I50c0d197301c77687d9df7b20117990ac20a6394
Reviewed-on: https://review.whamcloud.com/29192
|
| Comment by James Nunez (Inactive) [ 12/Dec/17 ] |
|
When I create a file system, the mount point pattern is blank and I, as root, can’t set the pattern on the mount point to raid0 or mdt: # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 # lfs setstripe -L raid0 /lustre/scratch/ # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 # lfs setstripe -L mdt /lustre/scratch/ # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 Yet, sanity-pfl test_10 does change the pattern on the mount point to the default ‘raid0’ (and this answers Andreas’ question about what is the default striping is after sanity-pfl): # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 # NAME=ncli ./auster -k -v sanity-pfl --only 10 Started at Tue Dec 12 16:15:25 UTC 2017 … PASS 10 (3s) == sanity-pfl test complete, duration 14 sec ========================================================= 16:15:46 (1513095346) sanity-pfl returned 0 Finished at Tue Dec 12 16:15:46 UTC 2017 in 21s ./auster: completed with rc 0 # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: 0 and I can set the mount point pattern back to ‘blank’ # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: 0 # lfs setstripe -d /lustre/scratch/ # lfs getstripe /lustre/scratch/ /lustre/scratch/ stripe_count: 1 stripe_size: 1048576 pattern: stripe_offset: -1 sanity-pfl test 10 gets the layout of the mount point using get_layout_param()/parse_layout_param(), but these functions don’t take into account the pattern of the directory meaning they don’t get the file/dir pattern (--layout parameter). If the pattern isn’t specified, then it defaults to the default pattern which is raid0. We really want mount point pattern to remain the same before and after sanity-pfl. Do we want to allow the user to set the pattern on the mount point? |
| Comment by Joseph Gmitter (Inactive) [ 12/Dec/17 ] |
|
Hi Lai, Can you please look into this one? Thanks. |
| Comment by Andreas Dilger [ 12/Dec/17 ] |
|
It isn't clear if we want to allow only the pattern to be set on the mountpoint, since a raw "mdt" layout on the root is mostly useless unless the filesystem has only MDTs, no OSTs (we can cross that bridge when we get to it, there will be other fixes needed as well). Instead, it makes sense to set a PFL layout with mdt as the first component. What is strange/broken in ost-pools test_1n is that the test is using create_dir to set the stripe count to -1 (as it always has) in a pool with only 1 OST (as it always has been), but this is now failing when trying to create 7 stripes on the file. It should limit the stripe count to the number of OSTs in the pool. |
| Comment by Gerrit Updater [ 21/Dec/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/30636 |
| Comment by James Nunez (Inactive) [ 21/Dec/17 ] |
|
The patch at https://review.whamcloud.com/30636 only modifies the parsing routines that sanity-pfl test 10 use. When sanity-pfl test_10 is run, this patch should return all original parameters to the mount point and, thus, stop several test failures including most (all?) recent/new ost-pools.sh test failures. This patch does not address the OST pools issues that Andreas has commented on in this ticket. |
| Comment by Gerrit Updater [ 14/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30636/ |
| Comment by Peter Jones [ 14/Jan/18 ] |
|
Landed for 2.11 |
| Comment by James Nunez (Inactive) [ 15/Feb/18 ] |
|
Reopening this issue because we are seeing it or something closely related with recent full testing. One example of a recent failure is at: |
| Comment by Sarah Liu [ 21/Feb/18 ] |
|
+1 on master, tag-2.10.58 |
| Comment by Minh Diep [ 12/Mar/18 ] |
|
+1 on b2_10 https://testing.hpdd.intel.com/test_sets/7d4a2422-23da-11e8-8d2f-52540065bddc |
| Comment by Gerrit Updater [ 18/Apr/18 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/32048 |
| Comment by Gerrit Updater [ 03/May/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/32048/ |
| Comment by Sarah Liu [ 18/May/18 ] |
|
Still hit this on 2.10.4 EL7 server with EL6.9 client https://testing.hpdd.intel.com/test_sets/010bc5d2-599a-11e8-b9d3-52540065bddc |
| Comment by Sarah Liu [ 06/Aug/18 ] |
|
hit this again on 2.10.5 ldiskfs DNE |
| Comment by James Nunez (Inactive) [ 12/Dec/18 ] |
|
We're seeing parallel-scale-nfsv3 and parallel-scale-nfsv4 test_compilebench fail with ‘IOError: [Errno 27] File too large’ and [102528.920205] LustreError: 26259:0:(lod_qos.c:1438:lod_alloc_specific()) can't lstripe objid [0x200022ac9:0x8e8b:0x0]: have 7 want 8 in the MDS dmesg. It looks like this is the same issue as reported here. Logs are at (all use zfs): |
| Comment by Gerrit Updater [ 31/May/21 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/43882 |
| Comment by Gerrit Updater [ 10/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43976 |
| Comment by Gerrit Updater [ 11/Jun/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43976/ |
| Comment by Gerrit Updater [ 14/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43882/ |