Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.15.6
-
None
-
2.15.4+ el7.9 servers, 2.15.6+ el9.5 clients
-
4
-
9223372036854775807
Description
Hello! Weird issue I wanted to report.
This is on our large Oak filesystem, with OST indexes 192-751. Previous indexes 0-191 were in use at some point but since been removed.
To remove additional OSTs (192-271), recently I migrated 200M+ files using lfs migrate to other OSTs, but I ended up with ~50 errors at the end. The error is always because those old files have an uninitialized PFL component with a "deprecated" lmm_stripe_offset. I'm actually not sure why there is a lmm_stripe_offset set for those. Should be -1. Could have been a mistake at some point.
Anyway, this is what such file looks like:
# lfs getstripe /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
/oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log
lcm_layout_gen: 4
lcm_mirror_count: 1
lcm_entry_count: 2
lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 2199023255552
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 199
lmm_objects:
- 0: { l_ost_idx: 199, l_fid: [0x34c0000402:0x64d639:0x0] }
lcme_id: 2
lcme_mirror_id: 0
lcme_flags: 0
lcme_extent.e_start: 2199023255552
lcme_extent.e_end: EOF
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 165
The second component is not initialized (lcme_flags is 0), but lmm_stripe_offset is set to 165 that doesn't exist anymore.
That condition seems to be blocking any migration type I tried:
# lfs migrate -c 1 -i 500 /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log lfs migrate: cannot get group lock: Invalid argument (22) error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument lfs migrate -D /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log lfs migrate: cannot get group lock: Invalid argument (22) error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument # lfs migrate -E 256G -c 1 -S 1M -E 16T -c 16 -S 1M -E -1 -c 128 -S 1M /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log lfs migrate: cannot get group lock: Invalid argument (22) error: lfs migrate: /oak/stanford/groups/henderj/stfan/code/kaldi/egs/brain2text/s5/exp/exp_archive/tri1_nl_1000_tg_3000_ni_35/log/acc.28.6.log: cannot get group lock: Invalid argument
The MDT shows this:
Oct 24 10:05:41 oak-md1-s1 kernel: LustreError: 20257:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool '' Oct 24 10:34:02 oak-md1-s1 kernel: LustreError: 10325:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool '' Oct 24 10:43:19 oak-md1-s1 kernel: LustreError: 10261:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool '' Oct 24 10:58:20 oak-md1-s1 kernel: LustreError: 22729:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool '' Oct 24 10:58:34 oak-md1-s1 kernel: LustreError: 10322:0:(lod_qos.c:1266:lod_ost_alloc_specific()) Start index 165 not found in pool ''
We're reporting this because we don't think the original lmm_stripe_offset on an uninitialized component should block the migration.
Thanks!
Stéphane