[LU-14536] kiblnd does resend for IB_CM_REJ_INVALID_SERVICE_ID Created: 19/Mar/21 Updated: 05/Apr/22 Resolved: 15/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.9, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Dongyang Li | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
when connecting to a host which is not up, for each discovery we will try retry_count(see kiblnd_check_reconnect) * lnet_retry_count(for resend) times. and for each ost when mounting the mdt, we will process attach add_conn(if ost has failover node) and add_osc so 3 times discovery. mounting of mdt when other nodes are not up can take very long, making customer think the mount is stuck. |
| Comments |
| Comment by Gerrit Updater [ 19/Mar/21 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/42109 |
| Comment by Dongyang Li [ 19/Mar/21 ] |
|
even without resend we are still retrying 5 times for each discovery, and for each ost from the conf llog we will try discovery once. I'm wondering should we even retry at all if there's no listener. |
| Comment by Gerrit Updater [ 19/Mar/21 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/42111 |
| Comment by Dongyang Li [ 26/Mar/21 ] |
|
I manged to get access to the site experiencing the issue and got some numbers: when all the servers are up, mounting the targets on mds1: [root@lmds-vm1 o2iblnd]# cat mount.sh modprobe lnet modprobe lustre modprobe libcfs modprobe ksocklnd modprobe obdclass modprobe ptlrpc modprobe ldiskfs modprobe osd_ldiskfs modprobe ko2iblnd vgchange -ay vg_mdt0000_lustrefs --config 'activation{volume_list=["vg_mdt0000_lustrefs"]}' vgchange -ay vg_mgs --config 'activation{volume_list=["vg_mgs"]}' mount -t lustre -o max_sectors_kb=0 /dev/mapper/vg_mgs-mgs /lustre/mgs mount -t lustre -o max_sectors_kb=0 /dev/mapper/vg_mdt0000_lustrefs-mdt0000 /lustre/lustrefs/mdt0000 mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0000 /lustre/lustrefs/ost0000 mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0001 /lustre/lustrefs/ost0001 mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0400 /lustre/lustrefs/ost0400 mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0401 /lustre/lustrefs/ost0401 mounting mdt0000 took about 30mins [12863.422884] LNet: HW NUMA nodes: 1, HW CPU cores: 20, npartitions: 10 [12863.424484] alg: No test for adler32 (adler32-zlib) [12864.250078] Lustre: Lustre: Build Version: 2.12.6 [12864.356817] LNet: Using FastReg for registration [12864.396832] LNet: Added LNI 10.149.10.21@o2ib [8/640/0/180] [12864.434556] LNet: Added LNI 10.149.11.21@o2ib [8/640/0/180] [12864.895181] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelall oc [12866.435582] Lustre: MGS: Connection restored to 8529a39c-6bcb-c902-92ad-9af110ac39df (at 0@lo) [12866.819906] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: acl,user_xattr,errors=remount-ro,no_mbcache,node lalloc [14652.389489] Lustre: lustrefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [14652.395640] Lustre: lustrefs-MDT0000: in recovery but waiting for the first client to connect [14675.527074] Lustre: lustrefs-MDT0000: Connection restored to 10.149.10.21@o2ib (at 0@lo) don't know how much time would it take for the OSTs on the host, mount script was terminated when it was working on mdt0000 with patch 42109: [root@lmds-vm1 o2iblnd]# time mount.sh real 0m15.763s user 0m0.483s sys 0m6.796s with patch 42109 + 42111: [root@lmds-vm1 o2iblnd]# time mount.sh real 0m8.166s user 0m0.453s sys 0m6.703s |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42109/ |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42111/ |
| Comment by Peter Jones [ 15/Apr/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 10/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45510 |
| Comment by Gerrit Updater [ 10/Nov/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45511 |
| Comment by Gerrit Updater [ 20/Dec/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45895 |
| Comment by Gerrit Updater [ 20/Dec/21 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45896 |
| Comment by Gerrit Updater [ 30/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45895/ |
| Comment by Gerrit Updater [ 30/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45896/ |
| Comment by Malcolm Haak (Inactive) [ 05/Apr/22 ] |
|
The b2_14 backport causes servers to panic with null dereference error at MDT mount. Can we get this looked into please |
| Comment by Kim Sebo [ 05/Apr/22 ] |
|
|