Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
Provide kernel API for adding peer and peer NI
Implement LNetAddPeer() and LNetAddPeerNI() APIs to allow other
kernel modules to add peer and peer NIs to LNet.
Peers created via these APIs are not marked as having been configured
by DLC. As such, they can be overwritten by discovery.
Attachments
Activity
but the logs that are provided so far do not show the problem (AFAICT).
I'm fine with another ticket or any either way, that's totally fine. but not sure why are you saying above. the logs I pasted above show that in once case config log's processing took 30 seconds and in a good case (with the patch reverted) it took less a second. isn't this a problem?
bzzz I think this ticket should be closed and a new ticket opened for this issue. I am not saying there is no problem, but the logs that are provided so far do not show the problem (AFAICT).
well, still the fact that reverting that patch fixed the problem. I guess Guarang can confirm.
Yes, the NIDs being discovered are different, but my point was that in both cases you have three blocking discovery calls for each "MGS", and those calls contribute the same amount of time to the failed mount with or without the patch.
+1 on reviving LU-14668.
Hi Chris,
I think there's a difference in the examples you posted:
"With LU-14661" trace shows client trying to discover 1.1.1.1@tcp, then to 1.1.1.2@tcp, then 2.2.2.1@tcp and finally 2.2.2.2@tcp.
"Without" trace shows client trying to discover 1.1.1.1@tcp, then 2.2.2.1@tcp and that's it.
I thought that this difference in behaviour was responsible for the difference in time it took for the mount to fail, because in the "with" case there were more nids being discovered sequentially, but your example demonstrates it is not so as it looks like the same number of discovery attempts is made per peer regardless of destination nid.
Either way, like I said before, I don't see anything wrong with LU-14661. I'm planning to revisit LU-14668 though to see if we can get it to work though as it would speed things up vs discovering nids one at a time, especially when some nids are unreachable.
Thanks,
Serguei.
Couple points:
We should probably open a new ticket for this issue/discussion.
I was experimenting with the mount command which lists nids as X1,X2:Y1,Y2 [...] With
LU-14661patch, LNet tries to connect to all listed nids.
To be clear, a client mount attempt performs exactly the same with or without LU-14661 when passed "X1,X2:Y1,Y2" MGS nid format.
With LU-14661 (master commit 6f490275b0):
[root@ct7-mds1 lustre-filesystem]# mount -t lustre 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre /mnt/lustre mount.lustre: mount 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre at /mnt/lustre failed: Input/output error Is the MGS running? [root@ct7-mds1 lustre-filesystem]# lctl dk > /tmp/dk.log [root@ct7-mds1 lustre-filesystem]# grep -e lnet_health_check -e TRACE -e lnet_discover_peer_locked /tmp/dk.log | egrep -e 1.1.1. -e 2.2.2. -e lnet_discover_peer_locked 00000400:00000200:0.0:1673986593.772495:0:5947:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:0.0:1673986593.772524:0:5509:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 0 00000400:00000200:0.0:1673986652.199830:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986652.199877:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 1.1.1.2@tcp(1.1.1.1@tcp:1.1.1.2@tcp) <?> : GET try# 1 00000400:00000200:0.0:1673986703.226629:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->1.1.1.2@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986703.226700:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.2@tcp:1.1.1.1@tcp) <?> : GET try# 2 00000400:00000200:0.0:1673986721.024165:0:5502:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986721.024196:0:5947:0:(peer.c:2578:lnet_discover_peer_locked()) peer 1.1.1.1@tcp NID 1.1.1.1@tcp: -110. discovery complete 00000400:00000200:0.0:1673986721.027992:0:5947:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:0.0:1673986721.028022:0:5509:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 0 00000400:00000200:0.0:1673986778.248439:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986778.248497:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 2.2.2.2@tcp(2.2.2.1@tcp:2.2.2.2@tcp) <?> : GET try# 1 00000400:00000200:0.0:1673986829.257114:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->2.2.2.2@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986829.257158:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.2@tcp:2.2.2.1@tcp) <?> : GET try# 2 00000400:00000200:0.0:1673986848.256819:0:5505:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673986848.256871:0:5947:0:(peer.c:2578:lnet_discover_peer_locked()) peer 2.2.2.1@tcp NID 2.2.2.1@tcp: -110. discovery complete ... [root@ct7-mds1 lustre-filesystem]# echo 1673986848.256871 - 1673986593.772495 | bc 254.484376 [root@ct7-mds1 lustre-filesystem]#
Without LU-14661 (master commit 6f490275b0 + revert of 16321de596f6395153be6cbb6192250516963077):
[root@ct7-mds1 lustre-filesystem]# mount -t lustre 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre /mnt/lustre mount.lustre: mount 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre at /mnt/lustre failed: Input/output error Is the MGS running? [root@ct7-mds1 lustre-filesystem]# lctl dk > /tmp/dk.log2 [root@ct7-mds1 lustre-filesystem]# grep -e lnet_health_check -e TRACE -e lnet_discover_peer_locked /tmp/dk.log2 | egrep -e 1.1.1. -e 2.2.2. -e lnet_discover_peer_locked 00000400:00000200:0.0:1673988188.818846:0:7434:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:0.0:1673988188.818874:0:7032:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 0 00000400:00000200:0.0:1673988249.480663:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988249.480727:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 1 00000400:00000200:0.0:1673988300.480228:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988300.480289:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 2 00000400:00000200:0.0:1673988316.033279:0:7025:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988316.033351:0:7434:0:(peer.c:2578:lnet_discover_peer_locked()) peer 1.1.1.1@tcp NID 1.1.1.1@tcp: -110. discovery complete 00000400:00000200:0.0:1673988316.034632:0:7434:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1 00000400:00000200:0.0:1673988316.034665:0:7032:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 0 00000400:00000200:0.0:1673988375.486277:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988375.486332:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 1 00000400:00000200:0.0:1673988426.497680:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988426.497744:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 2 00000400:00000200:0.0:1673988443.264923:0:7028:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT 00000400:00000200:0.0:1673988443.264993:0:7434:0:(peer.c:2578:lnet_discover_peer_locked()) peer 2.2.2.1@tcp NID 2.2.2.1@tcp: -110. discovery complete ... [root@ct7-mds1 lustre-filesystem]# echo 1673988443.264993 - 1673988188.818846 | bc 254.446147 [root@ct7-mds1 lustre-filesystem]#
We can see the amount of time spent in blocking discovery calls is identical with or without the patch.
Again, the LU-14661 patch does not add any calls to discover NIDs. LU-14661 simply creates a peer entry in peer table so that LNet is aware of all NIDs available on server.
All traffic is driven by Lustre - either by calling LNetPrimaryNID while processing llog or by trying to send RPCs.
Yes, if you attempt to mount the MGS/MDT0 and you have a lot of unreachable NIDs in the llog then this can cause delays in starting MGS/MDT0, but it is not clear to me how LU-14661 patch makes this worse. If anything, it should improve certain situations where targets are only available on subset of configured interfaces.
I reviewed the attached log files but they do not seem to show an identical experiment with/without LU-14661. In the "successful" case I see this mount activity:
hornc@C02V50B9HTDG Downloads % grep lustre_fill_super exa61_lustre_debug_mgs_logs_successful_failover 00000020:00000001:2.0:1669879252.327292:0:1363:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:2.0:1669879252.327293:0:1363:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d65f79c000 00000020:01000004:2.0:1669879252.327519:0:1363:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdl 00000020:00000001:14.0:1669879252.330511:0:1373:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:14.0:1669879252.330512:0:1373:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d67ce4e000 00000020:01000004:14.0:1669879252.330803:0:1373:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdh 00000020:00000001:7.0:1669879252.333040:0:1364:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:7.0:1669879252.333041:0:1364:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d65c0e4000 00000020:01000004:7.0:1669879252.333251:0:1364:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdi 00000020:00000001:1.0:1669879252.336037:0:1368:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:1.0:1669879252.336037:0:1368:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99b235271800 00000020:01000004:1.0:1669879252.336229:0:1368:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdm 00000020:00000001:18.0:1669879252.839585:0:2016:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:18.0:1669879252.839587:0:2016:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d67cca0800 00000020:01000004:18.0:1669879252.839789:0:2016:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0003_testfs-mdt0003 00000020:00000001:3.0:1669879252.898029:0:2180:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:3.0:1669879252.898031:0:2180:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d650fd0800 00000020:01000004:3.0:1669879252.898227:0:2180:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0001_testfs-mdt0001 00000020:00000001:3.0:1669879294.415448:0:1373:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:3.0:1669879294.415449:0:1373:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdh complete 00000020:00000001:4.0:1669879294.459274:0:1364:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:4.0:1669879294.459275:0:1364:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdi complete 00000020:00000001:4.0:1669879294.504129:0:1363:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:4.0:1669879294.504129:0:1363:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdl complete 00000020:00000001:7.0:1669879294.547133:0:1368:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:7.0:1669879294.547134:0:1368:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdm complete 00000020:00000001:9.0:1669879434.214316:0:2016:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:9.0:1669879434.214317:0:2016:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/mapper/vg_mdt0003_testfs-mdt0003 complete 00000020:00000001:6.0:1669879434.306276:0:2180:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:6.0:1669879434.306277:0:2180:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/mapper/vg_mdt0001_testfs-mdt0001 complete hornc@C02V50B9HTDG Downloads %
We can see MDT0003/MDT0001 are started on this server.
Here's the "failure" case:
00000020:01200004:18.0:1669887849.684402:0:26184:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000006da5383a 00000020:01000004:18.0:1669887849.684416:0:26184:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mgs-mgs 00000020:01200004:4.0F:1669887851.305182:0:26399:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000083d05345 00000020:01000004:4.0:1669887851.305239:0:26399:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0000_testfs-mdt0000 00000020:01200004:18.0:1669887896.681846:0:26695:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000011e1fd8a 00000020:01000004:18.0:1669887896.681899:0:26695:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdf 00000020:01200004:8.0:1669887896.689944:0:26694:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000fd561f91 00000020:01000004:8.0:1669887896.689984:0:26694:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdg 00000020:01200004:14.0:1669888000.593093:0:34933:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000015e2fa5 00000020:01000004:14.0:1669888000.593106:0:34933:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mgs-mgs 00000020:01200004:4.0:1669888003.512820:0:35427:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000b90f4eca 00000020:01000004:4.0:1669888003.512868:0:35427:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0000_testfs-mdt0000 00000020:01200004:17.0:1669888003.512942:0:35428:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000005d688404 00000020:01000004:17.0:1669888003.512993:0:35428:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdf 00000020:01200004:15.0:1669888158.886882:0:50631:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000fc1a622b 00000020:01000004:15.0:1669888158.886894:0:50631:0:(obd_mount.c:1632:lustre_fill_super()) Mounting client testfs-client 00000020:00000001:12.0:1669888790.461036:0:79837:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:12.0:1669888790.461037:0:79837:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000003e0cdb4c 00000020:01000004:12.0:1669888790.461270:0:79837:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdl 00000020:00000001:8.0:1669888790.462048:0:79841:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:8.0:1669888790.462049:0:79841:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000d89b60bc 00000020:01000004:8.0:1669888790.462268:0:79841:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdh 00000020:00000001:14.0:1669888790.465089:0:79842:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:14.0:1669888790.465090:0:79842:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000000a7dd246 00000020:01000004:14.0:1669888790.465316:0:79842:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdm 00000020:00000001:13.0:1669888790.466087:0:79843:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:13.0:1669888790.466088:0:79843:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000076093839 00000020:01000004:13.0:1669888790.466360:0:79843:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdi 00000020:00000001:18.0:1669888790.967385:0:80366:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:18.0:1669888790.967386:0:80366:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000602ad4f3 00000020:01000004:18.0:1669888790.967614:0:80366:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0003_testfs-mdt0003 00000020:00000001:16.0:1669888790.993980:0:80397:0:(obd_mount.c:1603:lustre_fill_super()) Process entered 00000020:01200004:16.0:1669888790.993981:0:80397:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000003510bf98 00000020:01000004:16.0:1669888790.994203:0:80397:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0001_testfs-mdt0001 00000020:00000001:14.0:1669888833.193485:0:79841:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:14.0:1669888833.193486:0:79841:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdh complete 00000020:00000001:18.0:1669888833.236872:0:79842:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:18.0:1669888833.236873:0:79842:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdm complete 00000020:00000001:3.0:1669888833.289210:0:79837:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:3.0:1669888833.289211:0:79837:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdl complete 00000020:00000001:13.0:1669888833.341585:0:79843:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0) 00000020:00000004:13.0:1669888833.341586:0:79843:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdi complete
Here we see MGS, MDT0000, a client mount, MDT0003 and MDT0001 are started on this node.
If we look at just portion of both logs that cover mount of MDT0003 and MDT0001 then it appears they spend about the same amount of time. So, again, it is not clear why LU-14661 patch is to blame.
I suspect that you are actually seeing a delay in mounting MGS/MDT0, and maybe LU-14661 patch is somehow making this worse, but I do not see evidence of it.
Chris Horn can we discover new NIDs asynchronously so that the thread processing llog doesn't block?
This is the idea in https://jira.whamcloud.com/browse/LU-14668 but those patches have not landed.
hornc can we discover new NIDs asynchronously so that the thread processing llog doesn't block?
IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.
Agree.
IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.
Please open a new ticket and we can continue discussion there.