[LU-14661] Provide kernel API for adding peer/peer NI Created: 30/Apr/21  Updated: 15/Apr/23  Resolved: 24/Jan/23

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Attachments: File exa61_lustre_debug_mgs_logs_successful_failover.gz     File exa62_mgs_lustre_logs_failover_with_trace.gz    
Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

Provide kernel API for adding peer and peer NI

Implement LNetAddPeer() and LNetAddPeerNI() APIs to allow other
kernel modules to add peer and peer NIs to LNet.

Peers created via these APIs are not marked as having been configured
by DLC. As such, they can be overwritten by discovery.



 Comments   
Comment by Gerrit Updater [ 30/Apr/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43508
Subject: LU-14661 lnet: Check if discovery toggled off in ping reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b9beeb8b776eed1b9ecb4e274afa57cec3666fc1

Comment by Gerrit Updater [ 30/Apr/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43509
Subject: LU-14661 lnet: Provide kernel API for adding peer and peer NI
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 71c89ce80d38c528d1103d9e28a8b22adabc6e48

Comment by Gerrit Updater [ 30/Apr/21 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/43510
Subject: LU-14661 obdclass: Add peer/peer NI when processing llog
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7451b3d1b5865230e2882c450812660379d00539

Comment by Gerrit Updater [ 08/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43508/
Subject: LU-14661 lnet: Check if discovery toggled off in ping reply
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 143893381d428466d4c71e075a041a9cbbd28818

Comment by Peter Jones [ 09/Jun/21 ]

Landed for 2.15

Comment by Chris Horn [ 09/Jun/21 ]

pjones there are still two outstanding patches for this ticket, so I am reopening it.

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43509/
Subject: LU-14661 lnet: Provide kernel API for adding peers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ac201366ad5700edc860c139955af8a09bf53a1a

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43510/
Subject: LU-14661 obdclass: Add peer/peer NI when processing llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16321de596f6395153be6cbb6192250516963077

Comment by Peter Jones [ 18/Aug/21 ]

Everything seems to have landed now

Comment by Alex Zhuravlev [ 01/Dec/22 ]

we met a problem with this patch - config log can contain unavailable NIDs, which result in sequence like this:

00000040:00001000:12.0:1669888833.903403:0:82546:0:(llog.c:605:llog_process_thread()) processing rec 0x00000000453477cb type 0x10620000
00000040:00001000:12.0:1669888833.903403:0:82546:0:(llog.c:611:llog_process_thread()) after swabbing, type=0x10620000 idx=26
00000040:00001000:12.0:1669888833.903404:0:82546:0:(llog.c:713:llog_process_thread()) lrh_index: 26 lrh_len: 120 (4768 remains)
00000040:00001000:12.0:1669888833.903404:0:82546:0:(llog.c:730:llog_process_thread()) index: 26, lh_last_idx: 254 synced_idx: 0 lgh_last_idx: 254
00000020:00000001:12.0:1669888833.903405:0:82546:0:(obd_config.c:1807:class_config_llog_handler()) Process entered
00000020:00000010:12.0:1669888833.903405:0:82546:0:(obd_config.c:2023:class_config_llog_handler()) kmalloced '(lcfg_new)': 96 at 00000000c5e5bbaf.
00000020:00000080:12.0:1669888833.903406:0:82546:0:(obd_config.c:1399:class_process_config()) processing cmd: cf00b
00000020:00000001:12.0:1669888833.903408:0:82546:0:(obd_config.c:965:class_add_conn()) Process entered
00000020:00000001:12.0:1669888833.903408:0:82546:0:(obd_class.h:776:obd_add_conn()) Process entered
00010000:00000001:12.0:1669888833.903409:0:82546:0:(ldlm_lib.c:67:import_set_conn()) Process entered
00000100:00000200:12.0:1669888833.903410:0:82546:0:(events.c:580:ptlrpc_uuid_to_peer()) 172.16.0.80@o2ib->12345-172.16.0.80@o2ib
00000100:00000001:12.0:1669888833.903410:0:82546:0:(connection.c:84:ptlrpc_connection_get()) Process entered
00000400:00000200:12.0:1669888833.903411:0:82546:0:(api-ni.c:1405:lnet_nid_cpt_hash()) Match nid 172.16.0.80@o2ib to cpt 3
00000400:00000200:12.0:1669888833.903415:0:82546:0:(peer.c:2052:lnet_peer_queue_for_discovery()) Queue peer 172.16.0.80@o2ib: 0
00000400:00000200:12.0:1669888833.903416:0:82546:0:(peer.c:2370:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:8.0:1669888867.664662:0:82546:0:(peer.c:2420:lnet_discover_peer_locked()) peer 172.16.0.80@o2ib NID 172.16.0.80@o2ib: -110. discovery complete
00000400:00000200:8.0:1669888867.664664:0:82546:0:(peer.c:1320:LNetPrimaryNID()) NID 172.16.0.80@o2ib primary NID 172.16.0.80@o2ib rc -110

notice 30+ second delay. a log may contain number of unavailable NIDs. so overall config llog processing can take hundreds of seconds. the problem gets worse when few services need to start (in case of failover, for example) - they all are serialized on cl_mgc_mutex. sometimes mounting takes long enough so HA decides it makes no progress and initiate another failover.

Comment by Chris Horn [ 01/Dec/22 ]

bzzz Do you have a reproducer? If so, can you see if https://review.whamcloud.com/#/c/fs/lustre-release/+/47322/ helps?

Comment by Chris Horn [ 01/Dec/22 ]

Also, it isn't clear to me, based on the output you provided, that this problem is related to my patch. Peers are discovered as part of llog processing with our without LU-14661 patch, right?

Comment by Alex Zhuravlev [ 01/Dec/22 ]

Do you have a reproducer? If so, can you see if https://review.whamcloud.com/#/c/fs/lustre-release/+/47322/ helps?

yes, we do have a reproducer, will try that patch.

Also, it isn't clear to me, based on the output you provided, that this problem is related to my patch

I'm not really familiar with lnet code, but the patch adds this:

+               rc = LNetAddPeer(entry->un_nids, entry->un_nid_count);

and I've got one succesful pass with that patch reverted. we'll do more testing of course.

Comment by Alex Zhuravlev [ 01/Dec/22 ]

we've tested with the patch reverted on two different systems, successfuly.

Comment by Chris Horn [ 01/Dec/22 ]

but the patch adds this:

+ rc = LNetAddPeer(entry->un_nids, entry->un_nid_count);

Yes, but LNetAddPeer() does not directly result in discovery. Discovery is triggered by call to LNetPrimaryNID() in ptlrpc_connection_get(), and these calls should exist with or without my patch. So we'll need to understand why the behavior is different with the patch.
Can you provide debug logs from reproducer with and without and the patch for analysis?

Comment by Alex Zhuravlev [ 01/Dec/22 ]

hornc please check the lines I posted 2 hours ago - this is a case where llog processing took very long.
now similar part for a "good" case:

00000020:00000080:4.0:1669847133.980952:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf00b
00010000:00080000:4.0:1669847133.980956:0:19526:0:(ldlm_lib.c:99:import_set_conn()) imp ffff9a54fb49b000@testfs-OST0001-osc-MDT0002: found existing conn 172.16.0.75@o2ib
00000020:00000080:4.0:1669847133.980957:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980957:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.77@o2ib to nid 0x50000ac10004d (172.16.0.77@o2ib)
00000020:00000080:4.0:1669847133.980958:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980959:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.77@o2ib to nid 0x50000ac10004e (172.16.0.78@o2ib)
00000020:00000080:4.0:1669847133.980960:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf00b
00010000:00080000:4.0:1669847133.980961:0:19526:0:(ldlm_lib.c:116:import_set_conn()) imp ffff9a54fb49b000@testfs-OST0001-osc-MDT0002: add connection 172.16.0.77@o2ib at tail
00000020:00000080:4.0:1669847133.980962:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980963:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.79@o2ib to nid 0x50000ac10004f (172.16.0.79@o2ib)
00000020:00000080:4.0:1669847133.980964:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980964:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.79@o2ib to nid 0x50000ac100050 (172.16.0.80@o2ib)
00000020:00000080:4.0:1669847133.980965:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf00b
00010000:00080000:4.0:1669847133.980967:0:19526:0:(ldlm_lib.c:116:import_set_conn()) imp ffff9a54fb49b000@testfs-OST0001-osc-MDT0002: add connection 172.16.0.79@o2ib at tail
00000020:00000080:4.0:1669847133.980968:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980969:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.81@o2ib to nid 0x50000ac100051 (172.16.0.81@o2ib)
00000020:00000080:4.0:1669847133.980970:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf005
00000020:00000080:4.0:1669847133.980970:0:19526:0:(obd_config.c:1401:class_process_config()) adding mapping from uuid 172.16.0.81@o2ib to nid 0x50000ac100052 (172.16.0.82@o2ib)
00000020:00000080:4.0:1669847133.980971:0:19526:0:(obd_config.c:1389:class_process_config()) processing cmd: cf00b
Comment by Chris Horn [ 01/Dec/22 ]

The log for good case doesn't seem to show the same log processing. e.g. there are no calls to ptlrpc_uuid_to_peer()/ptlrpc_connection_get(). Do you see those calls anywhere in the log? Is "net" trace enabled in debug mask in the log for good case? Can you compare the calls to LNetPrimaryNID() in good/bad case? e.g. count the number of LNetPrimaryNID() calls and compare in good/bad case?

Comment by Alex Zhuravlev [ 02/Dec/22 ]

hornc please check the log I've just attached. this was gathered using pre-LU-14661 version.

Comment by Chris Horn [ 02/Dec/22 ]

bzzz Can you please attach the full log for the bad case as well?

Comment by Chris Horn [ 02/Dec/22 ]

And can you provide the exact Lustre hash(es) being tested?

Comment by Alex Zhuravlev [ 02/Dec/22 ]

Alex Zhuravlev Can you please attach the full log for the bad case as well?

just attached

Comment by Chris Horn [ 02/Dec/22 ]

Is LNet peer discovery disabled in the good case but enabled in the bad case?

Comment by Alex Zhuravlev [ 02/Dec/22 ]

Is LNet peer discovery disabled in the good case but enabled in the bad case?

not quite ready to answer, will have to check with collegue.
this is what we've found yesterday: same branch, same setup - gets stuck with LU-14661 and does well with LU-14661 reverted.

Comment by Alex Zhuravlev [ 07/Dec/22 ]

If so, can you see if https://review.whamcloud.com/#/c/fs/lustre-release/+/47322/ helps?

according to the testing, no, it doesn't help.

Comment by Serguei Smirnov [ 09/Dec/22 ]

I was experimenting with the mount command which lists nids as X1,X2:Y1,Y2 

In my experiment all nids were fake, so all attempts to connect were expected to fail. I was just checking which nids LNet would try to connect to.

Basically, it looks like before LU-14661 patch, LNet would only try to connect to X1 and Y1 when discovering X and Y respectively. With LU-14661 patch, LNet tries to connect to all listed nids. I think this behavior is expected though and is good in the sense that it gives the mount the chance to succeed if X1 is down but X2 is up.

Comment by Alex Zhuravlev [ 09/Dec/22 ]

the problem that mount takes too long in such a scenario, HA gives up and initiates a failover which can hit the same problem.
would it be possible to make such a "try to connect" async? so mount process (adding nids from the config) is not blocked trying to connect one by one?

Comment by Gaurang Tapase [ 09/Dec/22 ]

We can definitely improve HA agents to wait a bit more (right now they wait for 450s before giving up). I had tried doubling the timeout (900s), still saw the problem. We need a definite timeout before which the mount would succeed.

Comment by Alex Zhuravlev [ 09/Dec/22 ]

IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.

Comment by Gaurang Tapase [ 09/Dec/22 ]

IMO, increasing timeout isn't the optimial way - mount takes very long (so failover), users have bad experience, etc.

Agree.

Comment by Alex Zhuravlev [ 14/Dec/22 ]

hornc can we discover new NIDs asynchronously so that the thread processing llog doesn't block?

Comment by Chris Horn [ 17/Jan/23 ]

Couple points:
We should probably open a new ticket for this issue/discussion.

I was experimenting with the mount command which lists nids as X1,X2:Y1,Y2 [...] With LU-14661 patch, LNet tries to connect to all listed nids.

To be clear, a client mount attempt performs exactly the same with or without LU-14661 when passed "X1,X2:Y1,Y2" MGS nid format.

With LU-14661 (master commit 6f490275b0):

[root@ct7-mds1 lustre-filesystem]# mount -t lustre 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre /mnt/lustre
mount.lustre: mount 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre at /mnt/lustre failed: Input/output error
Is the MGS running?
[root@ct7-mds1 lustre-filesystem]# lctl dk > /tmp/dk.log
[root@ct7-mds1 lustre-filesystem]# grep -e lnet_health_check -e TRACE -e lnet_discover_peer_locked /tmp/dk.log | egrep -e 1.1.1. -e 2.2.2. -e lnet_discover_peer_locked
00000400:00000200:0.0:1673986593.772495:0:5947:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:0.0:1673986593.772524:0:5509:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 0
00000400:00000200:0.0:1673986652.199830:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986652.199877:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 1.1.1.2@tcp(1.1.1.1@tcp:1.1.1.2@tcp) <?> : GET try# 1
00000400:00000200:0.0:1673986703.226629:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->1.1.1.2@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986703.226700:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.2@tcp:1.1.1.1@tcp) <?> : GET try# 2
00000400:00000200:0.0:1673986721.024165:0:5502:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986721.024196:0:5947:0:(peer.c:2578:lnet_discover_peer_locked()) peer 1.1.1.1@tcp NID 1.1.1.1@tcp: -110. discovery complete
00000400:00000200:0.0:1673986721.027992:0:5947:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:0.0:1673986721.028022:0:5509:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 0
00000400:00000200:0.0:1673986778.248439:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986778.248497:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 2.2.2.2@tcp(2.2.2.1@tcp:2.2.2.2@tcp) <?> : GET try# 1
00000400:00000200:0.0:1673986829.257114:0:5506:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->2.2.2.2@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986829.257158:0:5510:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.2@tcp:2.2.2.1@tcp) <?> : GET try# 2
00000400:00000200:0.0:1673986848.256819:0:5505:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673986848.256871:0:5947:0:(peer.c:2578:lnet_discover_peer_locked()) peer 2.2.2.1@tcp NID 2.2.2.1@tcp: -110. discovery complete
...
[root@ct7-mds1 lustre-filesystem]# echo 1673986848.256871 - 1673986593.772495 | bc
254.484376
[root@ct7-mds1 lustre-filesystem]#

Without LU-14661 (master commit 6f490275b0 + revert of 16321de596f6395153be6cbb6192250516963077):

[root@ct7-mds1 lustre-filesystem]# mount -t lustre 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre /mnt/lustre
mount.lustre: mount 1.1.1.1@tcp,1.1.1.2@tcp:2.2.2.1@tcp,2.2.2.2@tcp:/lustre at /mnt/lustre failed: Input/output error
Is the MGS running?
[root@ct7-mds1 lustre-filesystem]# lctl dk > /tmp/dk.log2
[root@ct7-mds1 lustre-filesystem]# grep -e lnet_health_check -e TRACE -e lnet_discover_peer_locked /tmp/dk.log2 | egrep -e 1.1.1. -e 2.2.2. -e lnet_discover_peer_locked
00000400:00000200:0.0:1673988188.818846:0:7434:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:0.0:1673988188.818874:0:7032:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 0
00000400:00000200:0.0:1673988249.480663:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988249.480727:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 1
00000400:00000200:0.0:1673988300.480228:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988300.480289:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 1.1.1.1@tcp(1.1.1.1@tcp:1.1.1.1@tcp) <?> : GET try# 2
00000400:00000200:0.0:1673988316.033279:0:7025:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->1.1.1.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988316.033351:0:7434:0:(peer.c:2578:lnet_discover_peer_locked()) peer 1.1.1.1@tcp NID 1.1.1.1@tcp: -110. discovery complete
00000400:00000200:0.0:1673988316.034632:0:7434:0:(peer.c:2528:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:0.0:1673988316.034665:0:7032:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 0
00000400:00000200:0.0:1673988375.486277:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988375.486332:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.11@tcp(10.73.20.11@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 1
00000400:00000200:0.0:1673988426.497680:0:7029:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.11@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988426.497744:0:7033:0:(lib-move.c:2009:lnet_handle_send()) TRACE: 10.73.20.12@tcp(10.73.20.12@tcp:<?>) -> 2.2.2.1@tcp(2.2.2.1@tcp:2.2.2.1@tcp) <?> : GET try# 2
00000400:00000200:0.0:1673988443.264923:0:7028:0:(lib-msg.c:811:lnet_health_check()) health check: 10.73.20.12@tcp->2.2.2.1@tcp: GET: LOCAL_TIMEOUT
00000400:00000200:0.0:1673988443.264993:0:7434:0:(peer.c:2578:lnet_discover_peer_locked()) peer 2.2.2.1@tcp NID 2.2.2.1@tcp: -110. discovery complete
...
[root@ct7-mds1 lustre-filesystem]# echo 1673988443.264993 - 1673988188.818846 | bc
254.446147
[root@ct7-mds1 lustre-filesystem]#

We can see the amount of time spent in blocking discovery calls is identical with or without the patch.

Again, the LU-14661 patch does not add any calls to discover NIDs. LU-14661 simply creates a peer entry in peer table so that LNet is aware of all NIDs available on server.

All traffic is driven by Lustre - either by calling LNetPrimaryNID while processing llog or by trying to send RPCs.

Yes, if you attempt to mount the MGS/MDT0 and you have a lot of unreachable NIDs in the llog then this can cause delays in starting MGS/MDT0, but it is not clear to me how LU-14661 patch makes this worse. If anything, it should improve certain situations where targets are only available on subset of configured interfaces.

I reviewed the attached log files but they do not seem to show an identical experiment with/without LU-14661. In the "successful" case I see this mount activity:

hornc@C02V50B9HTDG Downloads % grep lustre_fill_super exa61_lustre_debug_mgs_logs_successful_failover
00000020:00000001:2.0:1669879252.327292:0:1363:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:2.0:1669879252.327293:0:1363:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d65f79c000
00000020:01000004:2.0:1669879252.327519:0:1363:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdl
00000020:00000001:14.0:1669879252.330511:0:1373:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:14.0:1669879252.330512:0:1373:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d67ce4e000
00000020:01000004:14.0:1669879252.330803:0:1373:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdh
00000020:00000001:7.0:1669879252.333040:0:1364:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:7.0:1669879252.333041:0:1364:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d65c0e4000
00000020:01000004:7.0:1669879252.333251:0:1364:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdi
00000020:00000001:1.0:1669879252.336037:0:1368:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:1.0:1669879252.336037:0:1368:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99b235271800
00000020:01000004:1.0:1669879252.336229:0:1368:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdm
00000020:00000001:18.0:1669879252.839585:0:2016:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:18.0:1669879252.839587:0:2016:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d67cca0800
00000020:01000004:18.0:1669879252.839789:0:2016:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0003_testfs-mdt0003
00000020:00000001:3.0:1669879252.898029:0:2180:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:3.0:1669879252.898031:0:2180:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb ffff99d650fd0800
00000020:01000004:3.0:1669879252.898227:0:2180:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0001_testfs-mdt0001
00000020:00000001:3.0:1669879294.415448:0:1373:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:3.0:1669879294.415449:0:1373:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdh complete
00000020:00000001:4.0:1669879294.459274:0:1364:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:4.0:1669879294.459275:0:1364:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdi complete
00000020:00000001:4.0:1669879294.504129:0:1363:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:4.0:1669879294.504129:0:1363:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdl complete
00000020:00000001:7.0:1669879294.547133:0:1368:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:7.0:1669879294.547134:0:1368:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdm complete
00000020:00000001:9.0:1669879434.214316:0:2016:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:9.0:1669879434.214317:0:2016:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/mapper/vg_mdt0003_testfs-mdt0003 complete
00000020:00000001:6.0:1669879434.306276:0:2180:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:6.0:1669879434.306277:0:2180:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/mapper/vg_mdt0001_testfs-mdt0001 complete
hornc@C02V50B9HTDG Downloads %

We can see MDT0003/MDT0001 are started on this server.

Here's the "failure" case:

00000020:01200004:18.0:1669887849.684402:0:26184:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000006da5383a
00000020:01000004:18.0:1669887849.684416:0:26184:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mgs-mgs
00000020:01200004:4.0F:1669887851.305182:0:26399:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000083d05345
00000020:01000004:4.0:1669887851.305239:0:26399:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0000_testfs-mdt0000
00000020:01200004:18.0:1669887896.681846:0:26695:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000011e1fd8a
00000020:01000004:18.0:1669887896.681899:0:26695:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdf
00000020:01200004:8.0:1669887896.689944:0:26694:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000fd561f91
00000020:01000004:8.0:1669887896.689984:0:26694:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdg
00000020:01200004:14.0:1669888000.593093:0:34933:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000015e2fa5
00000020:01000004:14.0:1669888000.593106:0:34933:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mgs-mgs
00000020:01200004:4.0:1669888003.512820:0:35427:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000b90f4eca
00000020:01000004:4.0:1669888003.512868:0:35427:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0000_testfs-mdt0000
00000020:01200004:17.0:1669888003.512942:0:35428:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000005d688404
00000020:01000004:17.0:1669888003.512993:0:35428:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdf
00000020:01200004:15.0:1669888158.886882:0:50631:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000fc1a622b
00000020:01000004:15.0:1669888158.886894:0:50631:0:(obd_mount.c:1632:lustre_fill_super()) Mounting client testfs-client
00000020:00000001:12.0:1669888790.461036:0:79837:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:12.0:1669888790.461037:0:79837:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000003e0cdb4c
00000020:01000004:12.0:1669888790.461270:0:79837:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdl
00000020:00000001:8.0:1669888790.462048:0:79841:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:8.0:1669888790.462049:0:79841:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000d89b60bc
00000020:01000004:8.0:1669888790.462268:0:79841:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdh
00000020:00000001:14.0:1669888790.465089:0:79842:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:14.0:1669888790.465090:0:79842:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000000a7dd246
00000020:01000004:14.0:1669888790.465316:0:79842:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdm
00000020:00000001:13.0:1669888790.466087:0:79843:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:13.0:1669888790.466088:0:79843:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 0000000076093839
00000020:01000004:13.0:1669888790.466360:0:79843:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/sdi
00000020:00000001:18.0:1669888790.967385:0:80366:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:18.0:1669888790.967386:0:80366:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 00000000602ad4f3
00000020:01000004:18.0:1669888790.967614:0:80366:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0003_testfs-mdt0003
00000020:00000001:16.0:1669888790.993980:0:80397:0:(obd_mount.c:1603:lustre_fill_super()) Process entered
00000020:01200004:16.0:1669888790.993981:0:80397:0:(obd_mount.c:1605:lustre_fill_super()) VFS Op: sb 000000003510bf98
00000020:01000004:16.0:1669888790.994203:0:80397:0:(obd_mount.c:1659:lustre_fill_super()) Mounting server from /dev/mapper/vg_mdt0001_testfs-mdt0001
00000020:00000001:14.0:1669888833.193485:0:79841:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:14.0:1669888833.193486:0:79841:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdh complete
00000020:00000001:18.0:1669888833.236872:0:79842:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:18.0:1669888833.236873:0:79842:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdm complete
00000020:00000001:3.0:1669888833.289210:0:79837:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:3.0:1669888833.289211:0:79837:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdl complete
00000020:00000001:13.0:1669888833.341585:0:79843:0:(obd_mount.c:1677:lustre_fill_super()) Process leaving via out (rc=0 : 0 : 0x0)
00000020:00000004:13.0:1669888833.341586:0:79843:0:(obd_mount.c:1684:lustre_fill_super()) Mount /dev/sdi complete

Here we see MGS, MDT0000, a client mount, MDT0003 and MDT0001 are started on this node.

If we look at just portion of both logs that cover mount of MDT0003 and MDT0001 then it appears they spend about the same amount of time. So, again, it is not clear why LU-14661 patch is to blame.

I suspect that you are actually seeing a delay in mounting MGS/MDT0, and maybe LU-14661 patch is somehow making this worse, but I do not see evidence of it.

Chris Horn can we discover new NIDs asynchronously so that the thread processing llog doesn't block?

This is the idea in https://jira.whamcloud.com/browse/LU-14668 but those patches have not landed.

Comment by Serguei Smirnov [ 17/Jan/23 ]

Hi Chris,

I think there's a difference in the examples you posted:  

"With LU-14661" trace shows client trying to discover 1.1.1.1@tcp, then to 1.1.1.2@tcp, then 2.2.2.1@tcp and finally 2.2.2.2@tcp.  

"Without" trace shows client trying to discover 1.1.1.1@tcp, then 2.2.2.1@tcp and that's it.

I thought that this difference in behaviour was responsible for the difference in time it took for the mount to fail, because in the "with" case there were more nids being discovered sequentially, but your example demonstrates it is not so as it looks like the same number of discovery attempts is made per peer regardless of destination nid.

Either way, like I said before, I don't see anything wrong with LU-14661. I'm planning to revisit LU-14668 though to see if we can get it to work though as it would speed things up vs discovering nids one at a time, especially when some nids are unreachable.

Thanks,

Serguei.

Comment by Chris Horn [ 18/Jan/23 ]

Yes, the NIDs being discovered are different, but my point was that in both cases you have three blocking discovery calls for each "MGS", and those calls contribute the same amount of time to the failed mount with or without the patch.

+1 on reviving LU-14668.

Comment by Alex Zhuravlev [ 19/Jan/23 ]

well, still the fact that reverting that patch fixed the problem. I guess Guarang can confirm.

Comment by Chris Horn [ 23/Jan/23 ]

bzzz I think this ticket should be closed and a new ticket opened for this issue. I am not saying there is no problem, but the logs that are provided so far do not show the problem (AFAICT).

Comment by Alex Zhuravlev [ 24/Jan/23 ]

but the logs that are provided so far do not show the problem (AFAICT).

I'm fine with another ticket or any either way, that's totally fine. but not sure why are you saying above. the logs I pasted above show that in once case config log's processing took 30 seconds and in a good case (with the patch reverted) it took less a second. isn't this a problem?

Comment by Chris Horn [ 24/Jan/23 ]

Please open a new ticket and we can continue discussion there.

Generated at Sat Feb 10 03:11:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.