[LU-16872] sanity: test_27M Error: '(5) stripe count , should be 8 for append' Created: 05/Jun/23 Updated: 16/Jan/24 Resolved: 19/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Thomas Bertschinger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Frank Sehr <fsehr@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/46d1b101-8e4f-4415-bb49-ee39963275fe Test session details: == sanity test 27M: test O_APPEND striping =============== 03:40:09 (1685936409) |
| Comments |
| Comment by Thomas Bertschinger [ 20/Jun/23 ] |
|
Andreas - I'm away on vacation through June 25 so if this bug needs a quick resolution, you may want to have someone else look into it. Otherwise I'll continue to work on this when I get back next week. |
| Comment by Arshad Hussain [ 29/Jun/23 ] |
|
+1 on master (https://testing.whamcloud.com/sub_tests/399cef35-a321-43da-944a-84f8ce67c9f1) |
| Comment by Thomas Bertschinger [ 29/Jun/23 ] |
|
Here's a status update since the bug is still causing failures. I'm still looking at this but have not found the root cause yet, but I have a hypothesis. The origin of the EINVAL response to the open() call is lod_alloc_ost_list(). From the debug logs, it seems it must be one of these: if (!test_bit(ost_idx, m->lod_ost_bitmap)) { rc = -EINVAL; break; } ... if (lod_qos_is_tgt_used(env, ost_idx, stripe_count) && !(lod_comp->llc_pattern & LOV_PATTERN_OVERSTRIPING)) { rc = -EINVAL; break; } However, I am fairly confident that lod_alloc_ost_list() should NOT be getting called in the append case at all, because this function appears to be for files with specifically set stripes, e.g., it gets called if a file inherits custom striping from its parent directory. Therefore, I believe the actual problem is that lod_comp->llc_ostlist.op_array and lod_comp->llc_ostlist.op_count are incorrectly non-zero when the failure occurs. My current hypothesis for the cause is this: the struct lu_tgt_pool inside struct lod_layout_component (where the op_array and op_count fields are) is preceded by char *llc_pool so it's possible that char * is being overflowed and putting garbage values into the array and count fields. So currently I'm looking at where llc_pool is set and if any of these spots could plausibly have an overflow. |
| Comment by Thomas Bertschinger [ 29/Jun/23 ] |
|
Oops, I wasn't thinking straight this morning but overflowing llc_pool wouldn't affect op_array and op_count since the buffer wouldn't be adjacent to these fields. So I'm still looking for what could cause op_array and op_count to have bad values. (I still think that's the most likely explanation for the issue.) |
| Comment by Gerrit Updater [ 04/Jul/23 ] |
|
"Thomas Bertschinger <bertschinger@lanl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51559 |
| Comment by Thomas Bertschinger [ 04/Jul/23 ] |
|
How to reproduce: # setup: $ lctl set_param mdd.*.append_stripe_count=-1 $ lfs setstripe -o 1,3 /mnt/lustre # touch enough files with the default striping so that every mdt kernel thread probably has the defaults stored in its memory $ for i in {0..100}; do touch /mnt/lustre/x$i; done # now an append should return EINVAL as long as it gets handled by a kernel thread that previously did a create with default stripes $ echo 1 >> /mnt/lustre/f -bash: /mnt/lustre/g: Invalid argument A closely related problem occurs when an append_pool is set, but in this case, the create succeeds, but the append file is created with the default stripes rather than the pool. I haven't identified which patch caused (or uncovered) the issue yet – I didn't see anything obvious in the patches merged shortly before the first test failure occurred. So I'll attempt a git bisect to try to find what caused this and will update if I get that answer. |
| Comment by Gerrit Updater [ 07/Jul/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51602 |
| Comment by Gerrit Updater [ 19/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51559/ |
| Comment by Gerrit Updater [ 19/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51602/ |
| Comment by Peter Jones [ 19/Aug/23 ] |
|
Landed for 2.16 |