[LU-14719] "lfs migrate -m" creates broken agent inodes when target MDT full Created: 28/May/21  Updated: 06/Feb/24  Resolved: 17/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: migration_improvements

Issue Links:
Related
is related to LU-14710 check_dot() does not handle dirdata/F... Open
is related to LU-15868 LFSCK fix inconsistencies in director... Resolved
is related to LU-15001 improve recovery of interrupted direc... Open
is related to LU-16467 lod_trans_space_check() fails with -2... Open
is related to LU-13832 "lfs migrate -m" leads to inconsisten... Resolved
is related to LU-14211 DNE3: mechanism to interrupt and resu... Open
is related to LU-11776 add "lfs find" support for directory ... Resolved
is related to LU-15990 "lfs find" to scan for directory hash... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
mds02 kernel: LustreError: 8471:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x560d:0x0] add dot dotdot error: rc = -28
mds02 kernel: LustreError: 8471:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
mds02 kernel: LustreError: 8471:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524
mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) lfs1-MDT0001: [0x200026991:0x5613:0x0] add dot dotdot error: rc = -28
mds02 kernel: LustreError: 22380:0:(osd_handler.c:3892:osd_create_local_agent_inode()) Skipped 2 previous similar messages
mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) lfs1-MDT0001-osd: Oops, can not rollback index_delete yet: rc = -524
mds02 kernel: LustreError: 22380:0:(out_lib.c:1190:out_tx_index_delete_undo()) Skipped 5 previous similar messages
mds02 kernel: LustreError: 22380:0:(out_handler.c:915:out_tx_end()) lfs1-MDT0001-osd: undo for lustre/target/out_handler.c:454: rc = -524

When e2fsck is run on the filesystem (LU-14710) it reports that the "." and ".." entries are corrupted, and can also report that the HTree index is corrupted:

Directory entry for '.' in ... (1032783) is big.
Split? no
Second entry '3.3.0' (inode=538027 fid=[0x380020941:0x4c38:0x0]) in directory inode 1032783 should be '..'
Fix? no

This should be handled better by the MDS:

  • check that the target MDT has enough space before starting migration
  • if ENOSPC is returned before directory is migrated then clean up unused agent inodes


 Comments   
Comment by Gerrit Updater [ 12/Apr/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47039
Subject: LU-14719 lod: distributed transaction check space
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 633dec326173f72cb3b451ec43e08119a553e5df

Comment by Gerrit Updater [ 12/Apr/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47040
Subject: LU-14719 utils: dir migration stop on error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5eff26f75e239fb9764cd9947d7f08deab001bbe

Comment by Gerrit Updater [ 25/Apr/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47128
Subject: LU-14719 osp: add inode watermark
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e34b90897c1f47ec692b34c11b4d2482366fa90c

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47040/
Subject: LU-14719 utils: dir migration stop on error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9ca348e8769d2c613082eeaeaf2775e22625e970

Comment by Gerrit Updater [ 17/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47128/
Subject: LU-14719 osp: add inode watermark
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 336eb696299e1c9731bd1443f05e5d814314ed36

Comment by Gerrit Updater [ 25/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47039/
Subject: LU-14719 lod: distributed transaction check space
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6aee406c84b6b8fddf08b560acfcdf7c13c97e63

Comment by Gerrit Updater [ 23/Nov/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49216
Subject: LU-14719 tests: fix replay-single/111g version check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4065ec7afbed4f3aaf94a1c4a0026701267d2963

Comment by Gerrit Updater [ 23/Nov/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49217
Subject: LU-14719 tests: find replay-single/111g breakage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c47c196635aef7e911b1717332d38a321b7c1ac

Comment by Gerrit Updater [ 23/Nov/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49218
Subject: LU-14719 tests: find replay-single/111g breakage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 842c76da8ea4049c3b846ed1dd7a2b40e1bd877b

Comment by Andreas Dilger [ 23/Nov/22 ]

Lai, it looks like patch https://review.whamcloud.com/47867 "LU-14719 lod: distributed transaction check space" broke replay-single.sh test_111g. We kind of knew that based on Oleg's comments on an earlier version of the patch, but the broken version check caused the test to be permanently skipped.

Could you please look at fixing this. This patch predates the abort_recov_mdt patch, so that can't be the cause of the test_111g failure.

According to the console logs, the client is dropped from recovery, and the "rm -rf $DIR/$tdir/striped_dir" command is not replayed from the client:

[  108.039795] LustreError: 13372:0:(ldlm_lockd.c:2569:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1669164818 with bad export cookie 14558261718912942160
[  108.039842] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
[  134.093090] LustreError: 1851:0:(client.c:3253:ptlrpc_replay_interpret()) @@@ status -107, old was 0  req@ffff88010b421300 x1750246084321408/t4294967302(4294967302) o36->lustre-MDT0001-mdc-ffff8800d749c800@192.168.203.164@tcp:12/10 lens 496/456 e 1 to 0 dl 1669164853 ref 2 fl Interpret:RQU/4/0 rc -107/-107 job:'rm.0'
[  134.108916] Lustre: lustre-MDT0001-mdc-ffff8800d749c800: Connection restored to  (at 192.168.203.164@tcp)

Full debug logs are available, though this is also a 100% failure and can hopefully also be reproduced locally.
https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-special2-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Comment by Lai Siyao [ 25/Nov/22 ]

I can't reproduce in local test system, does this fail janitor only? I will enable full debug and check logs.

Comment by Andreas Dilger [ 25/Nov/22 ]

I think it only affects janitor testing, since the patch https://review.whamcloud.com/49216 passed normal autotest. However, it fails 100% on Janitor testing.

Note that test_111g already enables full debug logging so you should be able to see it in the latest test results:
https://testing.whamcloud.com/gerrit-janitor/26658/testresults/replay-single-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

Comment by Gerrit Updater [ 25/Nov/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49249
Subject: LU-14719 lod: skip space check error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f2becc7a5b653e8da37a178c1757d5757d5c1e36

Comment by Lai Siyao [ 26/Nov/22 ]

The failure in replay-single 111g is lod_trans_space_check()>dt_statfs()>osp_statfs() returns -ENOTCONN during replay. I'll push a patch to make lod_trans_space_check() fail in low space only, but skip other failures which will be detected in operation processing.

Comment by Gerrit Updater [ 17/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49249/
Subject: LU-14719 lod: ignore space check error in recovery
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e44489f2f29a2e50883f9bbdec491b65ca92a692

Comment by Peter Jones [ 17/Feb/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:12:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.