Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
When lfs mirror resync attempted to resync a file, it hangs in uninterruptible 'I' state. A server thread went into an infinite loop.
lfs getstripe on the file being resynced:
# lfs getstripe -v file
composite_header:
lcm_magic: 0x0BD60BD0
lcm_size: 4056
lcm_flags: wp
lcm_layout_gen: 1589327495
lcm_mirror_count: 3
lcm_entry_count: 8
components:
- lcme_id: 1
lcme_mirror_id: 0
lcme_flags: init,stale,prefer
lcme_extent.e_start: 0
lcme_extent.e_end: 1073741824
lcme_offset: 416
lcme_size: 72
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 28
lmm_pool: ddn_ssd
lmm_objects:
- 0: \{ l_ost_idx: 28, l_fid: [0x1640000412:0x48899:0x0] }
- lcme_id: 2
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 4294967296
lcme_offset: 488
lcme_size: 144
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 42
lmm_pool: ddn_ssd
lmm_objects:
- 0: \{ l_ost_idx: 42, l_fid: [0x68000040f:0x49ff0:0x0] }
- 1: \{ l_ost_idx: 41, l_fid: [0x8c000040e:0x4a64b:0x0] }
- 2: \{ l_ost_idx: 44, l_fid: [0x980000402:0x49f4c:0x0] }
- 3: \{ l_ost_idx: 47, l_fid: [0x9c0000411:0x4ae3d:0x0] }
- lcme_id: 3
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 4294967296
lcme_extent.e_end: 17179869184
lcme_offset: 632
lcme_size: 240
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 18
lmm_pool: ddn_ssd
lmm_objects:
- 0: \{ l_ost_idx: 18, l_fid: [0x17c0000412:0xede98:0x0] }
- 1: \{ l_ost_idx: 67, l_fid: [0xcc000040e:0x4a6b5:0x0] }
- 2: \{ l_ost_idx: 5, l_fid: [0x1540000412:0x4a4a7:0x0] }
- 3: \{ l_ost_idx: 66, l_fid: [0x110000040e:0x4ac7d:0x0] }
- 4: \{ l_ost_idx: 64, l_fid: [0x1680000412:0x4a52f:0x0] }
- 5: \{ l_ost_idx: 17, l_fid: [0x1840000407:0xf3d83:0x0] }
- 6: \{ l_ost_idx: 31, l_fid: [0x1280000413:0x4ae67:0x0] }
- 7: \{ l_ost_idx: 65, l_fid: [0x1a80000407:0x4abf3:0x0] }
- lcme_id: 4
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 17179869184
lcme_extent.e_end: 137438953472
lcme_offset: 872
lcme_size: 432
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 16
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 18
lmm_pool: ddn_ssd
lmm_objects:
- 0: \{ l_ost_idx: 18, l_fid: [0x17c0000412:0xedea0:0x0] }
- 1: \{ l_ost_idx: 35, l_fid: [0x18c0000407:0x4b5ee:0x0] }
- 2: \{ l_ost_idx: 77, l_fid: [0x1dc0000403:0x4a436:0x0] }
- 3: \{ l_ost_idx: 57, l_fid: [0x144000040f:0x4b0ca:0x0] }
- 4: \{ l_ost_idx: 81, l_fid: [0xfc0000409:0x4af46:0x0] }
- 5: \{ l_ost_idx: 46, l_fid: [0x1180000412:0x4ab66:0x0] }
- 6: \{ l_ost_idx: 6, l_fid: [0x16c0000412:0x4b01d:0x0] }
- 7: \{ l_ost_idx: 29, l_fid: [0x1380000413:0x4a815:0x0] }
- 8: \{ l_ost_idx: 68, l_fid: [0xd0000040e:0x4aee9:0x0] }
- 9: \{ l_ost_idx: 54, l_fid: [0x3c0000411:0x49eeb:0x0] }
- 10: \{ l_ost_idx: 55, l_fid: [0x940000401:0x4ac74:0x0] }
- 11: \{ l_ost_idx: 76, l_fid: [0x480000406:0x4a9c1:0x0] }
- 12: \{ l_ost_idx: 7, l_fid: [0x1340000413:0x49e91:0x0] }
- 13: \{ l_ost_idx: 5, l_fid: [0x1540000412:0x4a4b1:0x0] }
- 14: \{ l_ost_idx: 31, l_fid: [0x1280000413:0x4ae6e:0x0] }
- 15: \{ l_ost_idx: 56, l_fid: [0x1480000410:0x4b1a3:0x0] }
- lcme_id: 5
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 137438953472
lcme_extent.e_end: 549755813888
lcme_offset: 1304
lcme_size: 1320
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 53
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 71
lmm_pool: ddn_ssd
lmm_objects:
- 0: \{ l_ost_idx: 71, l_fid: [0x1800000412:0x49e39:0x0] }
- 1: \{ l_ost_idx: 54, l_fid: [0x3c0000411:0x49ef2:0x0] }
- 2: \{ l_ost_idx: 56, l_fid: [0x1480000410:0x4b1a5:0x0] }
- 3: \{ l_ost_idx: 79, l_fid: [0xc8000040e:0x49e03:0x0] }
- 4: \{ l_ost_idx: 6, l_fid: [0x16c0000412:0x4b027:0x0] }
- 5: \{ l_ost_idx: 31, l_fid: [0x1280000413:0x4ae77:0x0] }
- 6: \{ l_ost_idx: 20, l_fid: [0x1300000413:0xf0137:0x0] }
- 7: \{ l_ost_idx: 21, l_fid: [0x1240000413:0xfc154:0x0] }
- 8: \{ l_ost_idx: 58, l_fid: [0x1400000411:0x4add5:0x0] }
- 9: \{ l_ost_idx: 41, l_fid: [0x8c000040e:0x4a664:0x0] }
- 10: \{ l_ost_idx: 4, l_fid: [0x1200000413:0x49647:0x0] }
- 11: \{ l_ost_idx: 67, l_fid: [0xcc000040e:0x4a6c9:0x0] }
- 12: \{ l_ost_idx: 80, l_fid: [0xf8000040f:0x4a881:0x0] }
- 13: \{ l_ost_idx: 22, l_fid: [0xdc000040e:0xf8dd5:0x0] }
- 14: \{ l_ost_idx: 16, l_fid: [0xf0000040f:0xf64b7:0x0] }
- 15: \{ l_ost_idx: 23, l_fid: [0x1980000407:0xf9a46:0x0] }
- 16: \{ l_ost_idx: 76, l_fid: [0x480000406:0x4a9d3:0x0] }
- 17: \{ l_ost_idx: 17, l_fid: [0x1840000407:0xf3d98:0x0] }
- 18: \{ l_ost_idx: 82, l_fid: [0x780000406:0x4ac13:0x0] }
- 19: \{ l_ost_idx: 32, l_fid: [0xe4000040e:0x4b2c9:0x0] }
- 20: \{ l_ost_idx: 68, l_fid: [0xd0000040e:0x4aef9:0x0] }
- 21: \{ l_ost_idx: 83, l_fid: [0xa40000408:0x4a5c2:0x0] }
- 22: \{ l_ost_idx: 44, l_fid: [0x980000402:0x49f6c:0x0] }
- 23: \{ l_ost_idx: 7, l_fid: [0x1340000413:0x49ea4:0x0] }
- 24: \{ l_ost_idx: 66, l_fid: [0x110000040e:0x4ac99:0x0] }
- 25: \{ l_ost_idx: 29, l_fid: [0x1380000413:0x4a82d:0x0] }
- 26: \{ l_ost_idx: 70, l_fid: [0x1740000411:0x4a7ec:0x0] }
- 27: \{ l_ost_idx: 42, l_fid: [0x68000040f:0x4a014:0x0] }
- 28: \{ l_ost_idx: 57, l_fid: [0x144000040f:0x4b0e4:0x0] }
- 29: \{ l_ost_idx: 45, l_fid: [0x300000414:0x4b324:0x0] }
- 30: \{ l_ost_idx: 64, l_fid: [0x1680000412:0x4a54f:0x0] }
- 31: \{ l_ost_idx: 65, l_fid: [0x1a80000407:0x4ac0f:0x0] }
- 32: \{ l_ost_idx: 53, l_fid: [0x13c0000411:0x4a931:0x0] }
- 33: \{ l_ost_idx: 69, l_fid: [0x1580000412:0x4b18a:0x0] }
- 34: \{ l_ost_idx: 30, l_fid: [0xe8000040f:0x4a644:0x0] }
- 35: \{ l_ost_idx: 33, l_fid: [0x1900000407:0x4a435:0x0] }
- 36: \{ l_ost_idx: 47, l_fid: [0x9c0000411:0x4ae60:0x0] }
- 37: \{ l_ost_idx: 18, l_fid: [0x17c0000412:0xedec1:0x0] }
- 38: \{ l_ost_idx: 55, l_fid: [0x940000401:0x4ac97:0x0] }
- 39: \{ l_ost_idx: 43, l_fid: [0x14c000040f:0x4a10e:0x0] }
- 40: \{ l_ost_idx: 8, l_fid: [0xf4000040f:0x4a208:0x0] }
- 41: \{ l_ost_idx: 52, l_fid: [0x400000413:0x48b6f:0x0] }
- 42: \{ l_ost_idx: 19, l_fid: [0x12c0000413:0xeaedf:0x0] }
- 43: \{ l_ost_idx: 9, l_fid: [0xec000040f:0x4b15b:0x0] }
- 44: \{ l_ost_idx: 78, l_fid: [0xe0000040e:0x4b116:0x0] }
- 45: \{ l_ost_idx: 40, l_fid: [0x640000414:0x4920c:0x0] }
- 46: \{ l_ost_idx: 28, l_fid: [0x1640000412:0x49dd1:0x0] }
- 47: \{ l_ost_idx: 77, l_fid: [0x1dc0000403:0x4a456:0x0] }
- 48: \{ l_ost_idx: 81, l_fid: [0xfc0000409:0x4af70:0x0] }
- 49: \{ l_ost_idx: 34, l_fid: [0x1600000412:0x4b1dd:0x0] }
- 50: \{ l_ost_idx: 46, l_fid: [0x1180000412:0x4ab8a:0x0] }
- 51: \{ l_ost_idx: 35, l_fid: [0x18c0000407:0x4b610:0x0] }
- 52: \{ l_ost_idx: 59, l_fid: [0x1500000410:0x49c79:0x0] }
- lcme_id: 6
lcme_mirror_id: 0
lcme_flags: init
lcme_extent.e_start: 549755813888
lcme_extent.e_end: EOF
lcme_offset: 2624
lcme_size: 696
sub_layout:
lmm_magic: 0x0BD30BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 27
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 36
lmm_pool: ddn_hdd
lmm_objects:
- 0: \{ l_ost_idx: 36, l_fid: [0x100000040f:0x33dc6a:0x0] }
- 1: \{ l_ost_idx: 38, l_fid: [0x10c000040e:0x33ddf7:0x0] }
- 2: \{ l_ost_idx: 51, l_fid: [0x900000406:0x33e7cf:0x0] }
- 3: \{ l_ost_idx: 25, l_fid: [0x1d80000403:0x340ae7:0x0] }
- 4: \{ l_ost_idx: 0, l_fid: [0x1940000407:0x340972:0x0] }
- 5: \{ l_ost_idx: 3, l_fid: [0x1b40000407:0x33e13b:0x0] }
- 6: \{ l_ost_idx: 62, l_fid: [0x1b80000407:0x33d8b9:0x0] }
- 7: \{ l_ost_idx: 13, l_fid: [0x1c80000403:0x340899:0x0] }
- 8: \{ l_ost_idx: 14, l_fid: [0x1a00000407:0x33de5c:0x0] }
- 9: \{ l_ost_idx: 2, l_fid: [0x1ac0000407:0x340614:0x0] }
- 10: \{ l_ost_idx: 75, l_fid: [0x1780000419:0x33e6ae:0x0] }
- 11: \{ l_ost_idx: 15, l_fid: [0x1d40000403:0x33db62:0x0] }
- 12: \{ l_ost_idx: 50, l_fid: [0x4c000040d:0x33e1ef:0x0] }
- 13: \{ l_ost_idx: 26, l_fid: [0x1a40000407:0x33fb7f:0x0] }
- 14: \{ l_ost_idx: 48, l_fid: [0x50000040f:0x34074c:0x0] }
- 15: \{ l_ost_idx: 39, l_fid: [0x104000040f:0x33f7a9:0x0] }
- 16: \{ l_ost_idx: 1, l_fid: [0x1c40000407:0x33e142:0x0] }
- 17: \{ l_ost_idx: 12, l_fid: [0x1880000407:0x33ed31:0x0] }
- 18: \{ l_ost_idx: 73, l_fid: [0x1bc0000407:0x33e393:0x0] }
- 19: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x34015b:0x0] }
- 20: \{ l_ost_idx: 27, l_fid: [0x1cc0000403:0x33dc28:0x0] }
- 21: \{ l_ost_idx: 60, l_fid: [0x1c00000407:0x34092d:0x0] }
- 22: \{ l_ost_idx: 63, l_fid: [0x1b00000407:0x33de28:0x0] }
- 23: \{ l_ost_idx: 74, l_fid: [0x15c0000412:0x33ee07:0x0] }
- 24: \{ l_ost_idx: 72, l_fid: [0x1700000412:0x33ee5b:0x0] }
- 25: \{ l_ost_idx: 49, l_fid: [0xa00000411:0x33dea2:0x0] }
- 26: \{ l_ost_idx: 61, l_fid: [0x1d00000403:0x33e0e4:0x0] }
- lcme_id: 65537
lcme_mirror_id: 1
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 1073741824
lcme_offset: 3320
lcme_size: 56
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 1
lmm_stripe_offset: 28
lmm_objects:
- 0: \{ l_ost_idx: 28, l_fid: [0x1640000412:0x48899:0x0] }
- lcme_id: 131073
lcme_mirror_id: 2
lcme_flags: init,stale
lcme_extent.e_start: 0
lcme_extent.e_end: EOF
lcme_offset: 3376
lcme_size: 680
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0xb0000fafa
lmm_object_id: 0x851e
lmm_fid: [0xb0000fafa:0x851e:0x0]
lmm_stripe_count: 27
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 27
lmm_stripe_offset: 24
lmm_objects:
- 0: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x356fcf:0x0] }
- 1: \{ l_ost_idx: 63, l_fid: [0x1b00000407:0x358557:0x0] }
- 2: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x3587f1:0x0] }
- 3: \{ l_ost_idx: 26, l_fid: [0x1a40000407:0x356a98:0x0] }
- 4: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x357156:0x0] }
- 5: \{ l_ost_idx: 26, l_fid: [0x1a40000407:0x357243:0x0] }
- 6: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x35a886:0x0] }
- 7: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x3572d9:0x0] }
- 8: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x35a1dd:0x0] }
- 9: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x357b95:0x0] }
- 10: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x358988:0x0] }
- 11: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x356823:0x0] }
- 12: \{ l_ost_idx: 26, l_fid: [0x1a40000407:0x3570ae:0x0] }
- 13: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x357e40:0x0] }
- 14: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x35911c:0x0] }
- 15: \{ l_ost_idx: 26, l_fid: [0x1a40000407:0x357aa4:0x0] }
- 16: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x357fd9:0x0] }
- 17: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x3555b2:0x0] }
- 18: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x356c52:0x0] }
- 19: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x3575e8:0x0] }
- 20: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x35a9de:0x0] }
- 21: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x359b44:0x0] }
- 22: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x355d17:0x0] }
- 23: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x35777c:0x0] }
- 24: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x352bd3:0x0] }
- 25: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x358b26:0x0] }
- 26: \{ l_ost_idx: 24, l_fid: [0x19c0000407:0x358fc0:0x0] }
Backtrace of the client process. It waits on RPC response:
cat /proc/960708/stack [<0>] ptlrpc_set_wait+0x5cf/0x740 [ptlrpc] [<0>] ptlrpc_queue_wait+0x84/0x230 [ptlrpc] [<0>] ldlm_cli_enqueue+0x496/0x9e0 [ptlrpc] [<0>] mdc_enqueue_base+0x20c/0xb40 [mdc] [<0>] mdc_intent_lock+0x269/0x580 [mdc] [<0>] lmv_intent_lookup+0x280/0xa00 [lmv] [<0>] lmv_intent_lock+0x311/0x390 [lmv] [<0>] ll_intent_lock+0x204/0x880 [lustre] [<0>] ll_layout_intent+0x15b/0x2a0 [lustre] [<0>] ll_layout_write_intent+0x56/0x100 [lustre] [<0>] vvp_io_fini+0x1ef/0x860 [lustre] [<0>] vvp_io_setattr_fini+0x40/0x70 [lustre] [<0>] cl_io_fini+0x77/0x230 [obdclass] [<0>] cl_setattr_ost+0x156/0x380 [lustre] [<0>] ll_setattr_raw+0xa3f/0xf90 [lustre] [<0>] notify_change+0x303/0x510 [<0>] do_truncate+0x80/0xe0 [<0>] do_ftruncate+0xfb/0x150 [<0>] __x64_sys_ftruncate+0x38/0x70 [<0>] do_syscall_64+0x5c/0xe0 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
The debug log on MDS. It infinite loop tries to update attributes of two objects. The original layout gen for each iteration is kept the same, while the targeted version is incremented with each attempt. Also the master layout gen is also incremented on high rate:
00002000:00000002:17.0:1773393788.292592:0:20118:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d944d, oa_valid 0x1008001 00002000:00000002:16.0:1773393788.292592:0:36642:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d944d, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.292596:0:20118:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00080000:00000002:16.0:1773393788.292596:0:36642:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.308046:0:18626:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d944e, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.308050:0:18626:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.308427:0:18626:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d944e, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.308430:0:18626:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.330578:0:36642:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d944f, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.330581:0:36642:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.331272:0:36642:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d944f, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.331276:0:36642:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.348367:0:36629:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d9450, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.348371:0:36629:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.348636:0:36629:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d9450, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.348639:0:36629:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.366355:0:20716:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d9451, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.366358:0:20716:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.366534:0:20118:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d9451, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.366537:0:20118:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.386779:0:18628:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d9452, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.386783:0:18628:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.387026:0:18628:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d9452, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.387030:0:18628:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00000004:80000000:10.0:1773393788.395610:0:76351:0:(lod_object.c:1241:lod_obj_stripe_attr_set_cb()) [0x17c0000412:0xede98:0x0]: set layout version: 1653445715, comp_idx: 2 00000004:80000000:10.0:1773393788.395900:0:76351:0:(lod_object.c:1241:lod_obj_stripe_attr_set_cb()) [0x17c0000412:0xedea0:0x0]: set layout version: 1653445715, comp_idx: 3 00002000:00000002:16.0:1773393788.402885:0:36635:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d9453, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.402889:0:36635:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:17.0:1773393788.403066:0:18626:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d9453, oa_valid 0x1008001 00080000:00000002:17.0:1773393788.403069:0:18626:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00000004:80000000:10.0:1773393788.409239:0:76351:0:(lod_object.c:1241:lod_obj_stripe_attr_set_cb()) [0x17c0000412:0xede98:0x0]: set layout version: 1653445716, comp_idx: 2 00000004:80000000:10.0:1773393788.409538:0:76351:0:(lod_object.c:1241:lod_obj_stripe_attr_set_cb()) [0x17c0000412:0xedea0:0x0]: set layout version: 1653445716, comp_idx: 3 00002000:00000002:16.0:1773393788.414976:0:36629:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xedea0:0x0] layout version 0x5abb04c6 -> 0x628d9454, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.414980:0:36629:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xedea0:0x0] set xattr 'trusted.fid' with size 52 00002000:00000002:16.0:1773393788.415724:0:20716:0:(ofd_objects.c:636:ofd_object_ff_update()) scratch-OST0012:[0xb0000fafa:0x851e:0x0]:[0x17c0000412:0xede98:0x0] layout version 0x5abb04c6 -> 0x628d9454, oa_valid 0x1008001 00080000:00000002:16.0:1773393788.415728:0:20716:0:(osd_handler.c:5161:osd_xattr_set()) [0x17c0000412:0xede98:0x0] set xattr 'trusted.fid' with size 52
The following is a working theory yet not confirmed enough.
1) This component lcme_id: 131073 lcme_mirror_id: 2 has multiple object with the same OST24 index, whis makes it looking as a corrupted layout.
2) In the log, we see two distinct OST objects—[0x17c0000412:0xede98:0x0] (Comp 2) and [0x17c0000412:0xedea0:0x0] (Comp 3)—both reporting to scratch-OST0012.
The loop is occurring in the lod_obj_stripe_attr_set_cb() callback. This function is responsible for synchronizing the layout version from the MDS to the OSTs to ensure the OSTs know which version of the file layout is "actual".
3) The lfs mirror resync process is trying to update the layout version (the 0x628d944d... incrementing values).
4) the MDS is sending layout update RPCs to the OSTs. MDS sets layout version
on Comp 2 and Comp 3 to the same value X.
However, if these components are logically overlapping or pointing to the same object ID incorrectly, the completion of one update is triggering a "stale" or "changed" notification that forces the MDS to increment the layout version again to maintain consistency.
5) As a result, the layout version is incremented every few milliseconds and MDS is stuck in the callback loop trying to catch up to a version number that it keeps moving forward itself.
lod_obj_stripe_attr_set_cb() function handles the completion of attribute sets. It is likely receiving completions for the same OST index multiple times for a single logical operation. If Stripe 0 and Stripe 2 are both on OST 24, and the MDS tries to update the "Layout Version," the OST might accept one and reject the other, or the completion of one might trigger a "stale layout" notification for the other, causing the MDS to increment the version and retry infinite number of times.
The proposed fix
I do not know how the file's layout was switched to the corrupted state how to prevent the corruption. The purpose of this ticket is to analyze what happens when Lustre processes such corrupted layouts and prevent the fatal condition by gracefully rejecting the operation and reporting the error condition.
Likely the Lustre code is skipping a crucial validation step.
- Strict XATTR size check: the size of the extended attribute read from disk is exactly equal to the expected size of a layout with N stripes.
- FID Uniqueness check: verify that for every mirror FIDs are unique
- Layout version retry limit: if the layout version increments more than a reasonable threshold (e.g., 100 times) within a single transaction, the code should abort with an error rather than keep spinning infinetely.
Attachments
Issue Links
- is related to
-
LU-20011 lfs mirror: Invalid layout: Incomplete mirror - must go to EOF
-
- Open
-