[LU-14536] kiblnd does resend for IB_CM_REJ_INVALID_SERVICE_ID Created: 19/Mar/21  Updated: 05/Apr/22  Resolved: 15/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.9, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Dongyang Li Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: None

Attachments: PNG File lustre.png    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

when connecting to a host which is not up, for each discovery we will try retry_count(see kiblnd_check_reconnect) * lnet_retry_count(for resend) times.

and for each ost when mounting the mdt, we will process attach add_conn(if ost has failover node) and add_osc so 3 times discovery.

mounting of mdt when other nodes are not up can take very long, making customer think the mount is stuck.



 Comments   
Comment by Gerrit Updater [ 19/Mar/21 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/42109
Subject: LU-14536 o2iblnd: don't resend if there's no listener
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 29f79fe50652ac048949df4b8e4a2eafa235ebbf

Comment by Dongyang Li [ 19/Mar/21 ]

even without resend we are still retrying 5 times for each discovery, and for each ost from the conf llog we will try discovery once.

I'm wondering should we even retry at all if there's no listener.

Comment by Gerrit Updater [ 19/Mar/21 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/42111
Subject: LU-14536 obi2lnd: don't try to reconnect if there's no listener
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e164c7533c393be60ab35472743868eec6452129

Comment by Dongyang Li [ 26/Mar/21 ]

I manged to get access to the site experiencing the issue and got some numbers:
Note the site has 415 OSTs, tested with latest 2.12

when all the servers are up, mounting the targets on mds1:

[root@lmds-vm1 o2iblnd]# cat mount.sh
modprobe lnet
modprobe lustre
modprobe libcfs
modprobe ksocklnd
modprobe obdclass
modprobe ptlrpc
modprobe ldiskfs
modprobe osd_ldiskfs
modprobe ko2iblnd
vgchange -ay vg_mdt0000_lustrefs --config 'activation{volume_list=["vg_mdt0000_lustrefs"]}'
vgchange -ay vg_mgs --config 'activation{volume_list=["vg_mgs"]}'
mount -t lustre -o max_sectors_kb=0 /dev/mapper/vg_mgs-mgs /lustre/mgs
mount -t lustre -o max_sectors_kb=0 /dev/mapper/vg_mdt0000_lustrefs-mdt0000 /lustre/lustrefs/mdt0000
mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0000 /lustre/lustrefs/ost0000
mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0001 /lustre/lustrefs/ost0001
mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0400 /lustre/lustrefs/ost0400
mount -t lustre -o max_sectors_kb=0 /dev/ddn/lustrefs_ost0401 /lustre/lustrefs/ost0401

mounting mdt0000 took about 30mins

[12863.422884] LNet: HW NUMA nodes: 1, HW CPU cores: 20, npartitions: 10
[12863.424484] alg: No test for adler32 (adler32-zlib)
[12864.250078] Lustre: Lustre: Build Version: 2.12.6
[12864.356817] LNet: Using FastReg for registration
[12864.396832] LNet: Added LNI 10.149.10.21@o2ib [8/640/0/180]
[12864.434556] LNet: Added LNI 10.149.11.21@o2ib [8/640/0/180]
[12864.895181] LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelall
oc
[12866.435582] Lustre: MGS: Connection restored to 8529a39c-6bcb-c902-92ad-9af110ac39df (at 0@lo)
[12866.819906] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: acl,user_xattr,errors=remount-ro,no_mbcache,node
lalloc
[14652.389489] Lustre: lustrefs-MDT0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[14652.395640] Lustre: lustrefs-MDT0000: in recovery but waiting for the first client to connect
[14675.527074] Lustre: lustrefs-MDT0000: Connection restored to 10.149.10.21@o2ib (at 0@lo)

don't know how much time would it take for the OSTs on the host, mount script was terminated

when it was working on mdt0000

with patch 42109:

[root@lmds-vm1 o2iblnd]# time mount.sh
real	0m15.763s
user	0m0.483s
sys	0m6.796s

with patch 42109 + 42111:

[root@lmds-vm1 o2iblnd]# time mount.sh
real	0m8.166s
user	0m0.453s
sys	0m6.703s
Comment by Gerrit Updater [ 15/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42109/
Subject: LU-14536 o2iblnd: don't resend if there's no listener
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0ab06eb9d865a47ea3e09880a41a9e8f0a78b6a6

Comment by Gerrit Updater [ 15/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/42111/
Subject: LU-14536 obi2lnd: don't try to reconnect if there's no listener
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67ba3ce23d32266eabd5f8c56fa78d65920455e8

Comment by Peter Jones [ 15/Apr/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 10/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45510
Subject: LU-14536 o2iblnd: don't resend if there's no listener
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 92acd551b2c97e2800181541e501815219bfc753

Comment by Gerrit Updater [ 10/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45511
Subject: LU-14536 obi2lnd: don't try to reconnect if there's no listener
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 33fe975fa96d9dbacda1e14c2147dfb276da12dd

Comment by Gerrit Updater [ 20/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45895
Subject: LU-14536 o2iblnd: don't resend if there's no listener
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 9fe25415e52a73bde7e53871403266a5b5db859a

Comment by Gerrit Updater [ 20/Dec/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45896
Subject: LU-14536 obi2lnd: don't try to reconnect if there's no listener
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 6ec51a85d49231e50694f0503406e548efbd6f17

Comment by Gerrit Updater [ 30/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45895/
Subject: LU-14536 o2iblnd: don't resend if there's no listener
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: da6e6953305be165798772d820ec59a0a209b604

Comment by Gerrit Updater [ 30/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45896/
Subject: LU-14536 obi2lnd: don't try to reconnect if there's no listener
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3bd965a786afb04317acd1d8eb1708e594a1fc91

Comment by Malcolm Haak (Inactive) [ 05/Apr/22 ]

The b2_14 backport causes servers to panic with null dereference error at MDT mount. Can we get this looked into please

Comment by Kim Sebo [ 05/Apr/22 ]

Generated at Sat Feb 10 03:10:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.