[LU-14] live replacement of OST Created: 16/Nov/10  Updated: 24/Sep/15  Resolved: 23/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Improvement Priority: Critical
Reporter: Lai Siyao Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4246 Test failure on test suite conf-sanit... Closed
is related to LU-1267 LFSCK II: MDT-OST consistency check/r... Resolved
is related to LU-3458 OST not able to register at MGS with ... Open
is related to LU-2018 Questions about using lfsck Resolved
is related to LU-3668 ldiskfs_check_descriptors: Block bitm... Resolved
is related to LU-3575 'mkfs.lustre --writeconf' not working... Resolved
is related to LU-5722 memory allocation deadlock under lu_c... Resolved
is related to LU-4204 typo in new conf-sanity subtest Resolved
is related to LU-266 Need a better, automated way to recov... Resolved
Bugzilla ID: 24,128
Rank (Obsolete): 7701

 Description   

Hot replace:
1 - Disable your OST on MDT (lctl deactivate)
2 - Empty your OST
3 - Backup the magic files (last_rcvd, LAST_ID, CONFIG/*)
4 - Deactivate the OST on all clients also.
5 - Unmount the OST
6 - Replace, reformat using same index
7 - Put back the backup magic files.
8 - Restart the OST.
9 - Activate the OST everywhere.

It probably wouldn't be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted. Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index.

  • the new OST would be best off to start allocating objects at the LAST_ID
    of the old OST, so that there is no risk of confusion between objects
  • the MDT contains the old LAST_ID in it's lov_objids file, and it sends this
    to the OST at connection time, this is no problem
  • currently the new OST will refuse to allow the MDT to connect, because it
    detects that the old LAST_ID value from the MDT is inconsistent with its
    own value
  • it would be relatively straight forward to have the OST detect if the local
    LAST_ID value was "new" and use the MDT value instead
  • the danger is if the LAST_ID file was lost for some reason (e.g. corruption
    causes e2fsck to erase it). in that case, the OST startup code should be
    smart enough to regenerate LAST_ID based on walking the object directories,
    which would also avoid the need to do this in e2fsck/lfsck (which can only
    run offline)
  • in cases where the on-disk LAST_ID is much lower than the MDT-supplied
    value, the OST should just skip precreation of all the intermediate objects
    and just start using the new MDT value
  • the only other thing is to avoid the case where a "new" OST is accidentally
    assigned the same index, when that isn't what is wanted. There needs to be
    some way to "prime" the new OST (that is NOT the default for a newly
    formatted OST), or conversely tell the MDT that it should signal the new
    OST to take the place of the old one, so that there are not any mistakes


 Comments   
Comment by Lai Siyao [ 19/Nov/10 ]

Did some tests, finished 30% code.

Comment by bschubert [ 19/Nov/10 ]

I just noticed this here, while it is still easy to browse through all the open issues

Just for your information, the offline approach: https://bugzilla.lustre.org/show_bug.cgi?id=22734

Comment by Lai Siyao [ 19/Nov/10 ]

Thanks for pointing this out, which explains a lot of details on LAST_ID recovery!

Comment by Lai Siyao [ 05/Dec/10 ]

Code is ready, and in inspection.

Comment by Andreas Dilger [ 09/Oct/12 ]

It probably makes sense for Fan Yong to implement this as part of the LFSCK project, so that an OST can recover from some common forms of corruption.

The existing patch is at http://review.whamcloud.com/141, but needs to be refreshed.

Comment by nasf (Inactive) [ 29/Dec/12 ]

It will be considered in LFSCK phase II.

Comment by Andreas Dilger [ 07/May/13 ]

In discussions during LU-2886 patch http://review.whamcloud.com/6199 inspection, it was proposed to improve the on-disk format of the LAST_ID file:

struct last_id_ondisk {
        __u64 lio_next_oid;
        __u32 lio_magic;
        __u32 lio_cksum;
};

and ofd_seq_load() (maybe rename this to ofd_seq_last_oid_read()?), ofd_seq_last_oid_write() and ll_recover_lost_found_objs.c should updated to handle both an old 8-byte LAST_ID file, and this new 16-byte format. If the on-disk LAST_ID file is corrupted (bad lio_magic, bad lio_cksum, lio_next_oid > OBIF_MAX_OID for seq != 0, lio_next_oid > IDIF_MAX_OID for fid_seq == 0) it would be treated the same as if it where missing, and this LAST_ID recovery code should traverse the object directories for that group and rebuild the LAST_ID file.

This would avoid the case where the LAST_ID file has some random garbage in it and causes an inconsistency between the MDT's and OST's understanding of what the next valid OID is.

Comment by Andreas Dilger [ 07/Jun/13 ]

The one missing part of this process is to be able to use a newly formatted OST in place if an old one with the same index if the last_rcvd and mountdata files are not accessible. The last_rcvd file will be recreated at mount time with default parameters (should normally be ok), but mkfs.lustre will create the mountdata file with the LDD_F_VIRGIN flag always set. It should be possible to add a --replace option to mkfs.lustre so that the MGS doesn't refuse the OST to connect because the index is in use.

Comment by Andreas Dilger [ 23/Aug/13 ]

I've pushed http://review.whamcloud.com/7443 for "mkfs.lustre --replace", and the OST precreating only recent objects if the MDT lov_objid is much larger than the OST LAST_ID. This replaces the old patch in http://review.whamcloud.com/141.

Comment by Peter Jones [ 26/Sep/13 ]

So is there still further work to complete for this ticket or does the recent landing mean that this ticket can be closed?

Comment by nasf (Inactive) [ 07/Oct/13 ]

We still need the patch for rebuilding LAST_ID file:
http://review.whamcloud.com/6997

Comment by Bob Glossman (Inactive) [ 04/Nov/13 ]

backport to b2_4: http://review.whamcloud.com/8159

Comment by Jian Yu [ 22/Nov/13 ]

Patch landed on Lustre b2_4 branch.

Comment by Jian Yu [ 26/Nov/13 ]

backport to b2_4: http://review.whamcloud.com/8159

The new-added conf-sanity test 69 introduced regression failure on interop testing:
https://maloo.whamcloud.com/test_sets/4c4bf322-54af-11e3-9029-52540035b04c

The patch also introduced conf-sanity test 72 and 73 regression failures:
https://maloo.whamcloud.com/test_sets/4c4bf322-54af-11e3-9029-52540035b04c
https://maloo.whamcloud.com/test_sets/1ee70c90-4d17-11e3-9c23-52540035b04c
https://maloo.whamcloud.com/test_sets/acaa49a4-4c5c-11e3-826a-52540035b04c
https://maloo.whamcloud.com/test_sets/e3329b42-473a-11e3-89d8-52540035b04c
https://maloo.whamcloud.com/test_sets/61470bfa-45f4-11e3-810a-52540035b04c

Before Lustre b2_4 build #57 (which contains the patch), conf-sanity test 72 and 73 always passed on Lustre b2_4 branch.

Comment by Jian Yu [ 26/Nov/13 ]

The new-added conf-sanity test 69 also introduced regression failure on ZFS testing:
https://maloo.whamcloud.com/test_sets/57656c1e-5605-11e3-8e94-52540035b04c

Comment by Jian Yu [ 27/Nov/13 ]

Patches for adding Lustre version check codes into conf-sanity test 69:
master branch: http://review.whamcloud.com/8411
b2_5 branch: http://review.whamcloud.com/8413
b2_4 branch: http://review.whamcloud.com/8412

Comment by Andreas Dilger [ 13/Dec/13 ]

Patch http://review.whamcloud.com/6997 is implementing LAST_ID rebuild after corruption, and also handles the case where the MDT and OST are out of sync about the LAST_ID value.

Comment by Peter Jones [ 23/Dec/13 ]

Closing as remaining work is tracked under LU-1267

Generated at Sat Feb 10 01:02:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.