[LU-10911] FLR2: Erasure coding Created: 13/Apr/18 Updated: 28/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | FLR2 | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
|||||||||||||||||||||||||||||||||||
| Severity: | 3 | |||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | |||||||||||||||||||||||||||||||||||
| Description |
| Comments |
| Comment by Nathan Rutman [ 24/Feb/20 ] |
|
Is this still planned for 2.14? Any progress? This ticket doesn't seem to get updated; am I looking in the wrong place? |
| Comment by Andreas Dilger [ 24/Feb/20 ] |
|
The plan is still to get this into 2.14. There are patches in Gerrit that could probably be refreshed. As always, review of the patches would be welcome. |
| Comment by Andreas Dilger [ 24/Feb/20 ] |
|
The patches are in Gerrit under the sub-tasks linked above. LU-12186 thru |
| Comment by Alexey Lyashkov [ 03/Mar/20 ] |
|
Andreas - can be right Gerrit links provided in tickets ? |
| Comment by Gerrit Updater [ 25/Mar/20 ] |
|
[ignore this, patch pushed under wrong ticket #] |
| Comment by Gerrit Updater [ 25/Mar/20 ] |
|
[ignore this, patch pushed under wrong ticket #] |
| Comment by Andreas Dilger [ 23/Apr/20 ] |
|
Bobijam, That will ensure that the EC code is included as part of the 2.14 release, and gives us more time to improve the build system, fix EC bugs, etc. We would want to have the #ifdef ISAL_ENABLE checks for this code anyway, so that Lustre can still build if ISA-L is not available/usable for some systems. We shouldn't leave it like disabled for a long time, because untested code is going to break quickly, but the 2.14 feature landing window is supposed to close on April 30 (already 2 weeks late), and I think there are still changes that need to be finished to the before this feature is ready. Those can still be worked on after the feature is landed to master, before the 2.14 final release. |
| Comment by Zhenyu Xu [ 23/Apr/20 ] |
|
yes, great insight, ISAL_ENABLE could be used to protect pre-EC file behavior and smooth the transition. |
| Comment by Alexey Lyashkov [ 23/Apr/20 ] |
|
Andreas, i'm confused. You are OK with landing untested / buggy code? |
| Comment by Alexey Lyashkov [ 09/Jul/20 ] |
|
Can someone provide a better HLD than attached? This document just about some userspace tools, and some common changes for structures. But this document don't describe anything with parity calculation - a specially in case REwrite don't covered a whole data stripes and old data need to be read to calculate a parity. No fail scenario in document, no recovery handling but it looks recovery is very complex in this case. No describing how it have plan avoid a parity rewrite with old data in case two parity updates in flight (CR lock permit this). It have bad describing a lock protection for parity between nodes, in case two nodes have a parallel write for half data stripes. Can design document updated to solve these questions ? |
| Comment by James A Simmons [ 17/Mar/21 ] |
|
Just an update. We have moved the flr branch to the latest master and having been running normal sanity tests. Currently we are fixing various bugs we are encountering. |
| Comment by James A Simmons [ 22/Apr/21 ] |
|
I just did a rebase to the latest master and I get a build error with the latest code due to the landing of lov_foreach_io_layout() and lov_io_fault_store() uses lov_io_layout_at(). Both functions have changed to handle both LCT_DATA and LCT_CODE types. The question is it safe to just pass LCT_DATA in both cases or do we need to examine every component to see what type LCT_* we have? |
| Comment by Zhenyu Xu [ 23/Apr/21 ] |
|
I think it's ok to just pass LCT_DATA in both cases, parity code pages won't be cached after EC IO since they are ephemeral and later EC IO could use other parity components. |
| Comment by James A Simmons [ 04/May/21 ] |
|
In my testing I'm seeing: kernel: Lustre: DEBUG MARKER: == sanity test 130g: FIEMAP (overstripe file) ================================================ |
| Comment by Andreas Dilger [ 04/May/21 ] |
That is probably introduced by patches from There is a prototype patch in |
| Comment by James A Simmons [ 28/Jun/23 ] |
|
An outside party has contacted our group at ORNL so we pushed the current prototype for early review with them. This project is at the beta code stage. |
| Comment by Alexey Lyashkov [ 28/Jun/23 ] |
|
James, can you drop some comments about recovery with FLR2 ? how it planed to be find which stripe is good and which is outdated and needs to be reconstructed. |