[LU-13730] Check need to mirror Extend on a WRITE_PENDING (wp) FLR file Created: 30/Jun/20  Updated: 31/Aug/23  Resolved: 20/Mar/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: New Feature Priority: Minor
Reporter: Qian Yingjin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7073 racer with OST object migration hangs... Resolved
is related to LU-14512 prohibit extend file with stale mirror Closed
Rank (Obsolete): 9223372036854775807

 Description   

In the following test scripts, it caused the kernel panic:

        local file=$DIR/$tfile

        $LFS mirror create -N -S 4M -c 2 -N -S 1M -c -1 $file ||
                error "create mirrored file $file failed"
        echo -n pccro_as_mirror_layout > $file
        echo "FLR layout before PCC-RO attach '$file':"
        $LFS getstripe -v $file

        $LFS mirror extend -N -S 8M -c -1 $file ||
        error "mirror extend $file failed"
        echo -e "\nFLR layout after extend a mirror:"
        $LFS getstripe -v $file

        echo -e "\nWrite '$file' after mirror extend:"
        echo -n write_after_mirror_extend > $file
        $LFS getstripe -v $file

The output for the test script:

mnt/lustre/f21i.sanity-pcc
composite_header:
lcm_magic: 0x0BD60BD0
lcm_size: 288
lcm_flags: wp
lcm_layout_gen: 3
lcm_mirror_count: 2
lcm_entry_count: 2
components:
- lcme_id: 65537
lcme_mirror_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 128
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x3
lmm_fid: [0x200000401:0x3:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
lmm_objects:
- 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
- 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }- lcme_id: 131074
lcme_mirror_id: 2
lcme_flags: init,stale
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 208
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x3
lmm_fid: [0x200000401:0x3:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 1
lmm_objects:
- 0: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
- 1: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }
FLR layout after extend a mirror:
/mnt/lustre/f21i.sanity-pcc
composite_header:
lcm_magic: 0x0BD60BD0
lcm_size: 416
lcm_flags: wp
lcm_layout_gen: 4
lcm_mirror_count: 3
lcm_entry_count: 3
components:
- lcme_id: 65537
lcme_mirror_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 176
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x3
lmm_fid: [0x200000401:0x3:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
lmm_objects:
- 0: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
- 1: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }- lcme_id: 131074
lcme_mirror_id: 2
lcme_flags: init,stale
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 256
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x3
lmm_fid: [0x200000401:0x3:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 1
lmm_objects:
- 0: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
- 1: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }- lcme_id: 196609
lcme_mirror_id: 3
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 336
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x3
lmm_fid: [0x200000401:0x3:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 8388608
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
lmm_objects:
- 0: { l_ost_idx: 0, l_fid: [0x100000000:0x5:0x0] }
- 1: { l_ost_idx: 1, l_fid: [0x100010000:0x4:0x0] }
Write '/mnt/lustre/f21i.sanity-pcc' after mirror extend:

The panic crash dump:

[27530.518341] LustreError: 70249:0:(lod_object.c:7733:lod_declare_update_write_pending()) ASSERTION( primary < 0 ) failed: [0x200000401:0x3:0x0] has multiple primary: 3 / 1
[27530.519329] LustreError: 70249:0:(lod_object.c:7733:lod_declare_update_write_pending()) LBUG
[27530.519834] Pid: 70249, comm: mdt00_003 3.10.0-957.12.2.el7_lustre.2.12.55_47_gf6497eb.x86_64 #1 SMP Mon Jul 1 20:06:03 CST 2019
[27530.519835] Call Trace:
[27530.519842] [<ffffffffc06e762c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[27530.519848] [<ffffffffc06e794c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[27530.519851] [<ffffffffc13a9c89>] lod_declare_update_write_pending+0x969/0xa40 [lod]
[27530.519860] [<ffffffffc13aa5b8>] lod_declare_layout_change+0x858/0xe20 [lod]
[27530.519864] [<ffffffffc1244bf3>] mdd_declare_layout_change+0x63/0x130 [mdd]
[27530.519870] [<ffffffffc124e419>] mdd_layout_change+0x9d9/0x18e0 [mdd]

It seems that FLR can only extend a mirror component for a FLR file in LCM_FL_RDONLY state. 

When the file in LCM_FL_WRITING_PENDING state, the write after mirror extend will found more than one primary mirrors which will cause panic.

Two possible solutions are:

  1. Check whether the file is in LCM_FL_RDONLY state before extend a mirror to a given file. If not, return invalid error code immediately.
  2. In lod_declare_update_write_pending, when find more than one candidates for primary, pick one as the primary and stale other synced mirrors.


 Comments   
Comment by Andreas Dilger [ 29/Jan/21 ]

Hit this several times during racer with file migrate (patch https://review.whamcloud.com/13669 "LU-7073 tests: Add file migration to racer":
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special1-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special2-ldiskfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special7-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special7-ldiskfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special9-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special10-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special1-zfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special4-zfs-DNE-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special8-zfs-centos7_x86_64-centos7_x86_64/
https://testing-archive.whamcloud.com/gerrit-janitor/13869/testresults/racer-special10-zfs-DNE-centos7_x86_64-centos7_x86_64/

It definitely seems like a but that is easily triggered between file migrate or mirror (not sure which) and some other operation on the file.

Comment by Gerrit Updater [ 29/Jan/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41368
Subject: LU-13730 tests: add file mirroring to racer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9784b3dd8855b98807a689527881bb47aaa98d9f

Comment by Gerrit Updater [ 11/Mar/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/42003
Subject: LU-13730 lod: don't confuse stale with primary flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2ac23c725e95a953c2dfb025743af7ce469eb0b0

Comment by Gerrit Updater [ 13/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42003/
Subject: LU-13730 lod: don't confuse stale with primary flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 571f3cf1115973d0fdaf6d5244bfeee230b52989

Comment by Peter Jones [ 20/Mar/21 ]

Landed for 2.15

Comment by Andreas Dilger [ 29/Aug/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/52168
Subject: LU-13170 tests: enable mirror extend in racer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0beef90e6ec7f2f3b920b6894ebba7365f5bdff

Comment by Gerrit Updater [ 31/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41368/
Subject: LU-13730 tests: add file mirroring to racer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 324aa79eb5a907be207ea6d41e3efcb4980a6f54

Generated at Sat Feb 10 03:03:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.