Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19519

PFL: bogus lmm_stripe_offset on uninitialized component blocking lfs migrate

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.15.6
    • None
    • 2.15.4+ el7.9 servers, 2.15.6+ el9.5 clients
    • 4
    • 9223372036854775807

    Description

      Hello! Weird issue I wanted to report.

      This is on our large Oak filesystem, with OST indexes 192-751. Previous indexes 0-191 were in use at some point but since been removed.

      To remove additional OSTs (192-271), recently I migrated 200M+ files using lfs migrate to other OSTs, but I ended up with ~50 errors at the end. The error is always because those old files have an uninitialized PFL component with a "deprecated" lmm_stripe_offset. I'm actually not sure why there is a lmm_stripe_offset set for those. Should be -1. Could have been a mistake at some point.

      Anyway, this is what such file looks like:

      # lfs getstripe /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
      /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
        lcm_layout_gen:    4
        lcm_mirror_count:  1
        lcm_entry_count:   2
          lcme_id:             1
          lcme_mirror_id:      0
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   2199023255552
            lmm_stripe_count:  1
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 199
            lmm_objects:
            - 0: { l_ost_idx: 199, l_fid: [0x34c0000402:0x64d639:0x0] }
      
          lcme_id:             2
          lcme_mirror_id:      0
          lcme_flags:          0
          lcme_extent.e_start: 2199023255552
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  8
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 165
      

      The second component is not initialized (lcme_flags is 0), but lmm_stripe_offset is set to 165 that doesn't exist anymore.

      That condition seems to be blocking any migration type I tried:

      # lfs migrate -c 1 -i 500 /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
      lfs migrate: cannot get group lock: Invalid argument (22)
      error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument
      
      lfs migrate -D /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
      lfs migrate: cannot get group lock: Invalid argument (22)
      error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument
      
      # lfs migrate -E 256G -c 1 -S 1M  -E 16T -c 16 -S 1M    -E -1 -c 128 -S 1M /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
      lfs migrate: cannot get group lock: Invalid argument (22)
      error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument
      

      The MDT shows this:

      Oct 24 10:05:41 oak-md1-s1 kernel: LustreError: 20257:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
      Oct 24 10:34:02 oak-md1-s1 kernel: LustreError: 10325:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
      Oct 24 10:43:19 oak-md1-s1 kernel: LustreError: 10261:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
      Oct 24 10:58:20 oak-md1-s1 kernel: LustreError: 22729:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
      Oct 24 10:58:34 oak-md1-s1 kernel: LustreError: 10322:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
      

      We're reporting this because we don't think the original lmm_stripe_offset on an uninitialized component should block the migration.

      Thanks!
      Stéphane

      Attachments

        Activity

          People

            paf0186 Patrick Farrell
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: