[LU-14719] "lfs migrate -m" creates broken agent inodes when target MDT full Created: 28/May/21 Updated: 06/Feb/24 Resolved: 17/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | migration_improvements | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
mds02 kernel: LustreError: 8471:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x560d:0x0] add dot dotdot error: rc = -28 mds02 kernel: LustreError: 8471:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524 mds02 kernel: LustreError: 8471:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524 mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x5613:0x0] add dot dotdot error: rc = -28 mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) Skipped 2 previous similar messages mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524 mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) Skipped 5 previous similar messages mds02 kernel: LustreError: 22380:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524 When e2fsck is run on the filesystem (LU-14710) it reports that the "." and ".." entries are corrupted, and can also report that the HTree index is corrupted: Directory entry for '.' in ... (1032783) is big. Split? no Second entry '3.3.0' (inode=538027 fid=[0x380020941:0x4c38:0x0]) in directory inode 1032783 should be '..' Fix? no This should be handled better by the MDS:
|
| Comments |
| Comment by Gerrit Updater [ 12/Apr/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47039 |
| Comment by Gerrit Updater [ 12/Apr/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47040 |
| Comment by Gerrit Updater [ 25/Apr/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47128 |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47040/ |
| Comment by Gerrit Updater [ 17/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47128/ |
| Comment by Gerrit Updater [ 25/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47039/ |
| Comment by Gerrit Updater [ 23/Nov/22 ] |
|
|
| Comment by Gerrit Updater [ 23/Nov/22 ] |
|
|
| Comment by Gerrit Updater [ 23/Nov/22 ] |
|
|
| Comment by Andreas Dilger [ 23/Nov/22 ] |
|
Lai, it looks like patch https://review.whamcloud.com/47867 " Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure. According to the console logs, the client is dropped from recovery, and the "rm -rf $DIR/$tdir/striped_dir" command is not replayed from the client: [ 108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160 [ 108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107 [ 134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0 req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0' [ 134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to (at 192.168.203.164@tcp) Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally. |
| Comment by Lai Siyao [ 25/Nov/22 ] |
|
I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs. |
| Comment by Andreas Dilger [ 25/Nov/22 ] |
|
I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing. Note that test_111g already enables full debug logging so you should be able to see it in the latest test results: |
| Comment by Gerrit Updater [ 25/Nov/22 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49249 |
| Comment by Lai Siyao [ 26/Nov/22 ] |
|
The failure in replay-single 111g is lod_trans_space_check() |
| Comment by Gerrit Updater [ 17/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49249/ |
| Comment by Peter Jones [ 17/Feb/23 ] |
|
Landed for 2.16 |