[LU-12040] File lost during recovery Created: 04/Mar/19 Updated: 21/Jul/20 Resolved: 10/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
In 2011y, Johann was introduce a wire protocol changes. commit f90abfdc961debae069804307dcbc883b50c137c Author: Johann Lombardi <johann@whamcloud.com> Date: Thu Dec 15 01:00:00 2011 +0100 LU-169 lov: add generation number to LOV EA This commit replace an unused field 'stripe_offset' in server reply with layout generation. Client create a file in directory with pool assigned, but server failed. Client tries a resend open+create call but it have silence failed on replay with EINVAL in lod_verify_v1v3 as '0' isn't part of lod pool indexes. |
| Comments |
| Comment by Gerrit Updater [ 04/Mar/19 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/34369 |
| Comment by Gerrit Updater [ 04/Mar/19 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/34371 |
| Comment by Vladimir Saveliev [ 04/Mar/19 ] |
|
https://review.whamcloud.com/#/c/34370/ and https://review.whamcloud.com/34371 are two possible solutions for the problem. https://review.whamcloud.com/34369 is a test illustrating the issue. |
| Comment by Andreas Dilger [ 04/Mar/19 ] |
|
I would say that patch: commit 89693927f0b065d44fdc496f6b49539118570104
LU-8998 lod: accomodate to composite layout
Modify the LOD to make it support the composite layout:
:
:
- Object allocation code is adjusted to not only check the used
OSTs in this round of allocation, but also the used OSTs in the existing layout components..
Reviewed-on: https://review.whamcloud.com/24823
is what triggered this to be a problem in newer releases, which was landed as v2_9_55_0-14-g8969392. |
| Comment by Cory Spitz [ 24/Apr/19 ] |
|
Is https://review.whamcloud.com/34371 ready to land? It has Code-Review +1 from Andreas Dilger and Mike Pershin, and Verified +1 from Jenkins and Maloo. |
| Comment by Andreas Dilger [ 24/Apr/19 ] |
|
This patch and the prerequisite patch are in the master-next branch and should probably land by next week, depending on how integration testing goes. |
| Comment by Gerrit Updater [ 30/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34371/ |
| Comment by Peter Jones [ 30/Apr/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 21/May/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34919 |
| Comment by Andreas Dilger [ 19/Jul/19 ] |
|
We're seeing continuous failures on replay-single test_134 when this patch is backported to b2_12. Shadow, do you know if there are any other patches that this one depends on to work? |
| Comment by Alexey Lyashkov [ 29/Jul/19 ] |
|
Andreas, I think it something wrong with test env |
| Comment by Patrick Farrell (Inactive) [ 29/Jul/19 ] |
|
That error with stripe offset actually seems really likely to be related to this patch, which is changing how the stripe offset is handled...? |
| Comment by Alexey Lyashkov [ 29/Jul/19 ] |
|
Patrik, I'm sorry, but this patch do nothing with "how stripe offset is handled" in create path. This patch affects a replay code path, [ 8981.540879] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 8981.873832] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly [ 8982.658245] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 8982.825716] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 8982.834142] LustreError: 17948:0:(lod_qos.c:1358:lod_alloc_specific()) Start index 0 not found in pool 'pool_134' [ 8983.012511] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true [ 8983.340608] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 [ 8986.891724] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.8.7@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. + + mkdir $DIR/$tdir + $LFS setstripe -p pool_134 $DIR/$tdir + + replay_barrier mds1 + + touch $DIR/$tdir/$tfile <<< create + + fail mds1 As I say before, if someone can replicate this with full debug info and attach this into ticket I can help with understanding a problem, but my view it's something bad with creating object with pool. |
| Comment by Peter Jones [ 10/Sep/19 ] |
|
This is fixed on master so this ticket should be marked RESOLVED. We should track any efforts to address this issue on 2.12.x separately. |