Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16475

Reusing OST indexes after lctl del_ost

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 9223372036854775807

    Description

      I have been investigating the possibility of reusing OST indexes after lctl del_ost and I wanted to describe the current known issues and ideas for improvements in a ticket to get some feedback.

      We have been using lctl del_ost in production (backported to 2.12) on two different systems, and it worked great, as long as one doesn't intend to reuse the indexes of the deleted OSTs.

      lctl del_ost removes OSTs llog entries on the MGS in CONFIGS/fsname-MDT* and CONFIG/fsname-client. The MGS propagates those changes to MDTs and client. However, as long as the MDTs and clients are not restarted, they keep in-memory references to the deleted OSTs.

      Let's test after removing an OST as follow:

       # lctl conf_param newfir-OST0000.osc.active=0 # deactive the OST
       # lctl --device MGS del_ost --target newfir-OST0000 # remove OST from the config
      

      Using the following command on the MDS, we can see that the deleted OST (here OST0000) is still referenced:

      mds# lctl get_param osc.{*}OST{*}.prealloc_status
      osc.newfir-OST0000-osc-MDT0000.prealloc_status=-108
      osc.newfir-OST0001-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0002-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0003-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0004-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0005-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0006-osc-MDT0000.prealloc_status=0
      osc.newfir-OST0007-osc-MDT0000.prealloc_status=0
      

      On the client, it can be seen with a "inactive device" in lfs df -v:

      [root@fir-rbh03 ~]# lfs df -v /newfir
      UUID 1K-blocks Used Available Use% Mounted on
      newfir-MDT0000_UUID 9056940 5548 8233744 1% /newfir[MDT:0]
      OST0000 : inactive device
      newfir-OST0001_UUID 148751801588 240244 147251465344 1% /newfir[OST:1] f
      newfir-OST0002_UUID 148751801588 175304 147251530284 1% /newfir[OST:2] f
      newfir-OST0003_UUID 148751801588 292072 147251413516 1% /newfir[OST:3] f
      newfir-OST0004_UUID 148751801588 299544 147251406044 1% /newfir[OST:4] f
      newfir-OST0005_UUID 148751801588 323452 147251382136 1% /newfir[OST:5] f
      newfir-OST0006_UUID 148751801588 226332 147251479256 1% /newfir[OST:6] f
      newfir-OST0007_UUID 148751801588 274664 147251430924 1% /newfir[OST:7] f
      
      filesystem_summary: 1041262611116 1831612 1030760107504 1% /newfir
      

      Ideally, all references to the OST should be removed after lctl del_ost, so that we can just reuse the OST index as if it were never used before.. But that seems quite a big endeavour.

      Now, if we remount the MDTs and clients, it's much better, there is no trace of the deleted OST in memory anymore. In theory, we should be able to reuse the OST index in that case. However, I found a problem with the current implementation of lctl del_ost which keeps a configuration file under CONFIGS/fsname-OST0000 on the MGS for the deleted OST. This is a problem that probably should be fixed (I will try to submit a patch for that). Indeed if we try to start a fresh OST with the same index after del_ost and a full restart of all targets, we still get the following error:

      Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 141-4: The config log for newfir-OST0000 already exists, yet the server claims it never registered. It may have been reformatted, or the index changed. writeconf the MDT to regenerate all logs.
      Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 55854:0:(mgs_llog.c:4351:mgs_write_log_target()) Can't write logs for newfir-OST0000 (-114)
      Jan 13 15:10:55 fir-io1-s1 kernel: LustreError: 55854:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write newfir-OST0000 log (-114)
      

      it comes from mgs/mgs_llog.c in mgs_write_log_ost():

      3089         /* If the ost log already exists, that means that someone reformatted
      3090            the ost and it called target_add again. */
      3091         if (!mgs_log_is_empty(env, mgs, mti->mti_svname)) {
      3092                 LCONSOLE_ERROR_MSG(0x141, "The config log for %s already "
      3093                                    "exists, yet the server claims it never "
      3094                                    "registered. It may have been reformatted, "
      3095                                    "or the index changed. writeconf the MDT to "
      3096                                    "regenerate all logs.\n", mti->mti_svname);
      3097                 RETURN(-EALREADY);
      3098         }
      

      By manually removing CONFIGS/newfir-OST0000 from the MGS after del_ost, this error goes away, and then mounting a freshly formatted OST with the same index seems to work.

      The last thing I haven't heavily tested is the LAST_ID issue. In my tests, working only with few files, this doesn't seem to be an issue (does not trigger a lfsck layout check to try to repair it), but I wonder if that could be a problem in production, when LAST_ID is very high, as it doesn't seem to be reset to 0 when I check osc.newfir-OST0000-osc-MDT0000.prealloc_last_id when registering the replacement OST. I am wondering if there is a way to do ensure it is reset on del_ost? (where is it stored? perhaps this is also something we can clean on del_ost?)

      Just to clarify, note that we have never used mkfs.lustre --replace here, as we do actually want the new OST to register to the MGS+MDTs.

      Attachments

        Issue Links

          Activity

            People

              sthiell Stephane Thiell
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: