[LU-14399] mount MDT takes very long with hsm enable Created: 03/Feb/21 Updated: 14/Feb/22 Resolved: 18/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Minh Diep | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
https://testing.whamcloud.com/test_sets/20fcc789-47a3-44a2-a17e-d6f39fbb3ec0 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We observed that when mounting MDT with HSM enable, mount command take minutes compare to seconds as before. We saw this in the log [53618.238941] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2; mount -t lustre -o localrecov /dev/mapper/mds2_flakey /mnt/lustre-mds2 [53618.624098] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [53720.390690] Lustre: 1722736:0:(mdt_coordinator.c:1114:mdt_hsm_cdt_start()) lustre-MDT0001: trying to init HSM before MDD [53720.392834] LustreError: 1722736:0:(mdt_coordinator.c:1125:mdt_hsm_cdt_start()) lustre-MDT0001: cannot take the layout locks needed for registered restore: -2 [53720.398049] LustreError: 1722741:0:(mdt_coordinator.c:1090:mdt_hsm_cdt_start()) lustre-MDT0001: Coordinator already started or stopping [53720.400681] Lustre: lustre-MDT0001: Imperative Recovery not enabled, recovery window 60-180 [53720.424872] Lustre: lustre-MDT0001: in recovery but waiting for the first client to connect [53720.953893] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check [53722.067555] Lustre: DEBUG MARKER: Seems related to |
| Comments |
| Comment by Gerrit Updater [ 03/Feb/21 ] |
|
John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41409 |
| Comment by Sergey Cheremencev [ 04/Feb/21 ] |
/* wait until MDD initialize hsm actions llog */
while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
schedule_timeout_interruptible(cfs_time_seconds(1));
i++;
}
if (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state))
CWARN("%s: trying to init HSM before MDD\n", mdt_obd_name(mdt));
|
| Comment by Gerrit Updater [ 04/Feb/21 ] |
|
John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41415 |
| Comment by Minh Diep [ 05/Feb/21 ] |
|
here is the sequence that I did to hit this bug
|
| Comment by Sergey Cheremencev [ 08/Feb/21 ] |
Thank you. Now the problem is clear. I'll provide a fix in a few days. |
| Comment by Gerrit Updater [ 08/Feb/21 ] |
|
Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/41445 |
| Comment by Gerrit Updater [ 11/Mar/21 ] |
|
Sergey Cheremencev (sergey.cheremencev@hpe.com) uploaded a new patch: https://review.whamcloud.com/42005 |
| Comment by Gerrit Updater [ 18/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41445/ |
| Comment by Peter Jones [ 18/Jan/22 ] |
|
Landed for 2.15 |
| Comment by Alex Zhuravlev [ 20/Jan/22 ] |
|
the patch just landed fails every run on my setup:
|
| Comment by Sergey Cheremencev [ 20/Jan/22 ] |
|
Hello Alex, what setup did you use? It doesn't fail on my local VM on the latest master(0feec5a3). |
| Comment by Alex Zhuravlev [ 20/Jan/22 ] |
|
well, essentially just a local VM: FSTYPE=ldiskfs MDSCOUNT=2 MDSSIZE=300000 OSTSIZE=400000 OSTCOUNT=2 LOGDIR=/tmp/ltest-logs REFORMAT=yes HONOR_EXCEPT=y bash conf-sanity.sh |
| Comment by Sergey Cheremencev [ 21/Jan/22 ] |
|
Alex, it still doesn't fail on my local VM. |
| Comment by Alex Zhuravlev [ 22/Jan/22 ] |
[ 8878.042110] Lustre: Found index 0 for lustre-MDT0000, updating log [ 8878.044766] Lustre: Modifying parameter lustre-MDT0000.mdt.identity_upcall in log lustre-MDT0000 [ 8898.880071] Lustre: 497204:0:(mdt_coordinator.c:1145:mdt_hsm_cdt_start()) lustre-MDT0000: trying to init HSM before MDD [ 8898.888824] LustreError: 497204:0:(mdt_coordinator.c:1156:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2 [ 8899.150565] Lustre: DEBUG MARKER: conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock I guess "trying to init HSM before MDD" is a hint? |
| Comment by Sergey Cheremencev [ 25/Jan/22 ] |
IMO, it hints your build doesn't include " static void cdt_start_pending_restore(struct mdt_device *mdt,
struct coordinator *cdt)
{
struct mdt_thread_info *cdt_mti;
unsigned int i = 0;
int rc; /* wait until MDD initialize hsm actions llog */
while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
schedule_timeout_interruptible(cfs_time_seconds(1));
i++;
}
if (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state))
CWARN("%s: trying to init HSM before MDD\n", mdt_obd_name(mdt));
"trying to init HSM before MDD" message should be printed by cdt_start_pending_restore, while in your case it is mdt_hsm_cdt_start. |
| Comment by Alex Zhuravlev [ 26/Jan/22 ] |
|
well, I'm running master branch, so clearly - while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) { + while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout * 2) { not sure how good it is... |