[LU-13794] changing comp-flags using lfs setstripe could get stuck when ost is down Created: 17/Jul/20  Updated: 17/Jul/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dongyang Li Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

we are testing FLR setup and found out lfs setstripe could get stuck.

After creating the FLR layout and lfs mirror resync:

[root@bss022 test_file_replica]# lfs getstripe -v test-file1 
test-file1
composite_header:
  lcm_magic:         0x0BD60BD0
  lcm_size:          272
  lcm_flags:         ro
  lcm_layout_gen:    11
  lcm_mirror_count:  2
  lcm_entry_count:   2
components:
  - lcme_id:             65537
    lcme_mirror_id:      1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    lcme_offset:         128
    lcme_size:           72
    sub_layout:
      lmm_magic:         0x0BD30BD0
      lmm_seq:           0xa80000d30
      lmm_object_id:     0x6
      lmm_fid:           [0xa80000d30:0x6:0x0]
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 9
      lmm_pool:          primary
      lmm_objects:
      - 0: { l_ost_idx: 9, l_fid: [0x440000419:0x17e2:0x0] }  - lcme_id:             131074
    lcme_mirror_id:      2
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   EOF
    lcme_offset:         200
    lcme_size:           72
    sub_layout:
      lmm_magic:         0x0BD30BD0
      lmm_seq:           0xa80000d30
      lmm_object_id:     0x6
      lmm_fid:           [0xa80000d30:0x6:0x0]
      lmm_stripe_count:  1
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0
      lmm_layout_gen:    0
      lmm_stripe_offset: 13
      lmm_pool:          secondary
      lmm_objects:
      - 0: { l_ost_idx: 13, l_fid: [0x2800000408:0x1802:0x0] }

after putting ost9 offline we are still able to read the file.

but in order to be able to write the file we need to set the preferred flag on the other component:

lfs setstripe --comp-set -I 131074 --comp-flags=prefer test-file1

however it will get stuck because lfs is trying to flush the client cache to ost9 which is offline.

[<ffffffffc1433f94>] osc_io_data_version_end+0x34/0x190 [osc]
[<ffffffffc0fc4ee0>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffc0e0a0bb>] lov_io_end_wrapper+0xdb/0xe0 [lov]
[<ffffffffc0e0ad38>] lov_io_data_version_end+0x78/0x1d0 [lov]
[<ffffffffc0fc4ee0>] cl_io_end+0x60/0x150 [obdclass]
[<ffffffffc0fc779a>] cl_io_loop+0xda/0x1c0 [obdclass]
[<ffffffffc1513bcb>] ll_ioc_data_version+0x20b/0x340 [lustre]
[<ffffffffc15283e0>] ll_file_ioctl+0x19d0/0x49f0 [lustre]
[<ffffffffb665d9e0>] do_vfs_ioctl+0x3a0/0x5a0
[<ffffffffb665dc81>] SyS_ioctl+0xa1/0xc0
[<ffffffffb6b8cede>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff


 Comments   
Comment by Gerrit Updater [ 17/Jul/20 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/39411
Subject: LU-13794 util: changing comp-flags get stuck when OST is down
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c426989c5b8deae0d0ecfad0bd380bbbf98dea17

Comment by Andreas Dilger [ 17/Jul/20 ]

I don't think the user should have to set the preferred mirror when writing to an FLR file with a failed OST. Definitely the MDS should automatically pick a mirror that is not missing objects to avoid this problem.

In some cases, there may be a race condition where an OST goes offline right after the MDS selected it for a mirror, but I don't think applies here. If a user noticed the problem and has time to run "lfs setstripe" then the MDS has had lots of time to detect the problem itself and skip the mirror with that OST.

Generated at Sat Feb 10 03:04:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.