[LU-14381] Client stuck using single NID when multiple are available Created: 28/Jan/21 Updated: 12/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Filing this as a bug, but it might be user-error. The problem I'm seeing is that clients are only attempting to connect to one of two available NIDs on the servers. Both clients and servers are configured with two NIDs, one on @tcp and on one @tcp1. Discovery/multi-rail is disabled. If I simulate failure of the @tcp NID on the OSS then I would expect the client to eventually try the @tcp1 NID, but that never happens. I tried modifying test-framework.sh so that it formats the OST with --servicenode=192.168.2.34@tcp,192.168.2.35@tcp1, but that didn't make any difference. Is this working as expected? Is this a bug? Am I missing some necessary config to allow client to use either NID? Edit 1: There was a suggestion to put the interfaces on separate subnets. I tried that and it did not resolve the issue. See this comment. Edit 2: Attached -1 debug log from the client client.dklog Details on how to reproduce follow. Version under test: sles15s01:/home/hornc/fs2 # ./LUSTRE-VERSION-GEN 2.13.57_71_gb538826 sles15s01:/home/hornc/fs2 # LNet configuration is tcp(eth0) and tcp1(eth1) with LNet peer discovery disabled: sles15s01:/home/hornc/fs2/lustre/tests # cat /etc/modprobe.d/lustre.conf options libcfs libcfs_debug=320735104 options libcfs libcfs_subsystem_debug=-2049 options lnet lnet_peer_discovery_disabled=1 options lnet ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 192.168.2.[31,33,35,37,42,43,44,45]" sles15s01:/home/hornc/fs2/lustre/tests # Test-framework config: sles15s01:/home/hornc/fs2/lustre/tests # cat cfg/hornc.sh # facet hosts MDSCOUNT=1 mds_HOST=sles15s01 MDSDEV1=/dev/sdc OSTCOUNT=1 ost_HOST=sles15s03 OSTDEV1=/dev/sde CLIENTCOUNT=1 RCLIENTS="sles15c01" PDSH="pdsh -S -Rssh -w" SHARED_DIRECTORY="/shared/testing" MGSNID="192.168.2.30@tcp,192.168.2.31@tcp1" . /home/hornc/fs2/lustre/tests/cfg/ncli.sh sles15s01:/home/hornc/fs2/lustre/tests # Use llmount.sh to stand up filesystem: sles15s01:/home/hornc/fs2/lustre/tests # NAME=hornc LOAD_MODULES_REMOTE=true VERBOSE=true /home/hornc/fs2/lustre/tests/llmount.sh
...
sles15s01:/home/hornc/fs2/lustre/tests # pdsh -w sles15s0[1,3],sles15c01 lctl list_nids | dshbak -c
----------------
sles15s01
----------------
192.168.2.30@tcp
192.168.2.31@tcp1
----------------
sles15c01
----------------
192.168.2.38@tcp
192.168.2.42@tcp1
----------------
sles15s03
----------------
192.168.2.34@tcp
192.168.2.35@tcp1
sles15s01:/home/hornc/fs2 # pdsh -w sles15s0[1,3],sles15c01 'lnetctl global show' | dshbak -c
----------------
sles15c01,sles15s[01,03]
----------------
global:
numa_range: 0
max_intf: 200
discovery: 0
drop_asym_route: 0
retry_count: 2
transaction_timeout: 50
health_sensitivity: 100
recovery_interval: 1
router_sensitivity: 100
lnd_timeout: 16
response_tracking: 3
sles15s01:/home/hornc/fs2 #
Unmount lustre from the client then re-mount with a drop rule in place, so that any traffic the client sends to eth0 on the OSS (a.k.a. 192.168.2.34@tcp) is dropped: sles15c01:~ # umount /mnt/lustre sles15c01:~ # lctl net down; lustre_rmmod LNET busy sles15c01:~ # tf_start.sh Loading modules from /home/hornc/fs2/lustre detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs ../libcfs/libcfs/libcfs options: 'libcfs_debug=320735104 libcfs_subsystem_debug=-2049' ../lnet/lnet/lnet options: 'lnet_peer_discovery_disabled=1 ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 192.168.2.[31,33,35,37,42,43,44,45]" accept=all' quota/lquota options: 'hash_lqs_cur_bits=3' sles15c01:~ # lctl set_param debug=+'net rpctrace' debug=+net rpctrace sles15c01:~ # lctl get_param debug debug= super ioctl neterror net warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck sles15c01:~ # lctl net_drop_add -s *@tcp -d 192.168.2.34@tcp -r 1 -e remote_timeout Added drop rule 255.255.255.255@tcp->192.168.2.34@tcp (1/1) sles15c01:~ # mount -t lustre -o user_xattr,flock 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre /mnt/lustre sles15c01:~ # lfs check servers lfs check: error: check 'lustre-OST0000-osc-ffff9c00a8b70000': Resource temporarily unavailable (11) lustre-MDT0000-mdc-ffff9c00a8b70000 active. sles15c01:~ # Wait a couple minutes and dump the log: sles15c01:~ # sleep 120; lfs check servers; lctl dk > /tmp/dk.log lfs check: error: check 'lustre-OST0000-osc-ffff9c00a8b70000': Resource temporarily unavailable (11) lustre-MDT0000-mdc-ffff9c00a8b70000 active. sles15c01:~ # With net trace enabled, LNet logs every send that it performs. Those entries look like this: 00000400:00000200:0.0:1611862288.864372:0:11737:0:(lib-move.c:1833:lnet_handle_send()) TRACE: 192.168.2.38@tcp(192.168.2.38@tcp:<?>) -> 192.168.2.34@tcp(192.168.2.34@tcp:192.168.2.34@tcp) <?> : GET try# 0 So if LNet sent any message to an @tcp1 NID then we should have record of it in the log. We can see in the debug log that we haven't sent any messages to any tcp1 NIDs: sles15c01:~ # grep tcp1 /tmp/dk.log 00000400:02000000:0.0:1611862138.860810:0:11727:0:(api-ni.c:2336:lnet_startup_lndni()) Added LNI 192.168.2.42@tcp1 [8/256/0/180] 00000020:01000004:0.0:1611862159.890388:0:12205:0:(obd_mount.c:968:lmd_print()) device: 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre 00000020:00000080:0.0:1611862159.890446:0:12205:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid MGC192.168.2.30@tcp_0 to nid 0x20001c0a8021f (192.168.2.31@tcp1) 00000020:00000080:0.0:1611862159.900401:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.30@tcp to nid 0x20001c0a8021f (192.168.2.31@tcp1) 00000020:01000004:0.0:1611862159.903289:0:12218:0:(obd_mount.c:1004:lustre_check_exclusion()) Check exclusion lustre-OST0000 (0) in 0 of 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre 00000020:00000080:0.0:1611862159.903298:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x20001c0a80223 (192.168.2.35@tcp1) 00000020:00000004:0.0:1611862160.910881:0:12205:0:(obd_mount.c:1683:lustre_fill_super()) Mount 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre complete sles15c01:~ # grep 192.168.2.35 /tmp/dk.log 00000020:00000080:0.0:1611862159.903298:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x20001c0a80223 (192.168.2.35@tcp1) sles15c01:~ # Statistics on the OSS confirm no traffic on tcp1: sles15s03:~ # lnetctl net show --net tcp1 -v 2 | egrep -e send_count -e recv_count
send_count: 0
recv_count: 0
sles15s03:~ #
|
| Comments |
| Comment by Andreas Dilger [ 28/Jan/21 ] |
|
Have you tried this with eth0 and eth1 on different subnets? There are strange things in the kernel TCP code that make it difficult to guarantee sending from one interface or the other when both are on the same subnet. |
| Comment by Chris Horn [ 28/Jan/21 ] |
|
I can try that though I'm pretty certain it won't make any difference. The tracing on the client is clear. The presence of the net drop rule on the client ensures that any traffic being sent to the @tcp NID will be dropped before it even hits the wire (nay, before it even hits the kernel networking code). Furthermore, the tracing makes it clear that no attempt is made to send to any @tcp1 NID. So whether the OSS is inappropriately using eth0 when it should use eth1 or vice versa doesn't make any difference. It never receives a single connect request from the client, because those connect RPCs are being sent to the @tcp nid and are thus dropped before hitting the wire. |
| Comment by Chris Horn [ 28/Jan/21 ] |
|
Confirmed that putting the interfaces on different networks doesn't resolve issue: sles15s01:~ # pdsh -w sles15s0[1,3],sles15c01 lctl list_nids | dshbak -c ---------------- sles15s01 ---------------- 192.168.2.30@tcp 10.10.0.50@tcp1 ---------------- sles15c01 ---------------- 192.168.2.38@tcp 10.10.0.54@tcp1 ---------------- sles15s03 ---------------- 192.168.2.34@tcp 10.10.0.52@tcp1 sles15s01:~ # sles15c01:~ # umount /mnt/lustre sles15c01:~ # lctl net down; lustre_rmmod LNET busy sles15c01:~ # tf_start.sh Loading modules from /home/hornc/fs2/lustre detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs ../libcfs/libcfs/libcfs options: 'libcfs_debug=320735104 libcfs_subsystem_debug=-2049' ../lnet/lnet/lnet options: 'lnet_peer_discovery_disabled=1 ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 10.10.0.[50,52,54]" accept=all' quota/lquota options: 'hash_lqs_cur_bits=3' sles15c01:~ # lctl set_param debug=+"net rpctrace" debug=+net rpctrace sles15c01:~ # lctl get_param debug debug= super ioctl neterror net warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck sles15c01:~ # lctl net_drop_add -s *@tcp -d 192.168.2.34@tcp -r 1 -e remote_timeout Added drop rule 255.255.255.255@tcp->192.168.2.34@tcp (1/1) sles15c01:~ # mount -t lustre -o user_xattr,flock 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre /mnt/lustre sles15c01:~ # lfs check servers lfs check: error: check 'lustre-OST0000-osc-ffff9ab8f6caf000': Resource temporarily unavailable (11) lustre-MDT0000-mdc-ffff9ab8f6caf000 active. sles15c01:~ # sleep 120; lfs check servers; lctl dk > /tmp/dk.log lfs check: error: check 'lustre-OST0000-osc-ffff9ab8f6caf000': Resource temporarily unavailable (11) lustre-MDT0000-mdc-ffff9ab8f6caf000 active. sles15c01:~ # grep tcp1 /tmp/dk.log 00000400:02000000:0.0:1611866400.918012:0:7175:0:(api-ni.c:2336:lnet_startup_lndni()) Added LNI 10.10.0.54@tcp1 [8/256/0/180] 00000020:01000004:0.0:1611866458.583906:0:7653:0:(obd_mount.c:968:lmd_print()) device: 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre 00000020:00000080:0.0:1611866458.583963:0:7653:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid MGC192.168.2.30@tcp_0 to nid 0x200010a0a0032 (10.10.0.50@tcp1) 00000020:00000080:0.0:1611866458.594953:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.30@tcp to nid 0x200010a0a0032 (10.10.0.50@tcp1) 00000020:01000004:0.0:1611866458.598044:0:7666:0:(obd_mount.c:1004:lustre_check_exclusion()) Check exclusion lustre-OST0000 (0) in 0 of 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre 00000020:00000080:0.0:1611866458.598052:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x200010a0a0034 (10.10.0.52@tcp1) 00000020:00000004:0.0:1611866459.606590:0:7653:0:(obd_mount.c:1683:lustre_fill_super()) Mount 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre complete sles15c01:~ # grep 10.10.0.52 /tmp/dk.log 00000020:00000080:0.0:1611866458.598052:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x200010a0a0034 (10.10.0.52@tcp1) sles15c01:~ # |
| Comment by Amir Shehata (Inactive) [ 29/Jan/21 ] |
|
Below is the call flow for setting a connection
+-< import_set_conn
+-< import_set_conn_priority
| +-< ptlrpc_recover_import
| | +-< mdc_iocontrol
| | +-< osc_iocontrol
| | +-< osp_iocontrol
| | +-< ptlrpc_reconnect_import
| | +-< ldebugfs_import_seq_write
| | +-< ptlrpc_set_import_active
+-< client_import_add_conn
| +-< client_obd_setup
| | +-< mgc_setup
| | +-< osc_setup_common
| | +-< lwp_setup
| | +-< osp_init0
+-< client_import_dyn_add_conn
| +-< mgc_apply_recover_logs
| | +-< mgc_process_recover_nodemap_log
the first path is from the osc_setup_common() that's on initial connect |
| Comment by Chris Horn [ 03/Feb/21 ] |
|
I ran an experiment to test the above theory. I modifed LNetDist to discover the supplied dstnid, and if that fails then we return < 0. I confirmed that this change is sufficient on the initial mount to get a working connection. i.e. the client uses the working @tcp1 NID rather than the broken @tcp NID. However, if I modify the drop rule to drop @tcp1 traffic and remove the drop rule for @tcp traffic, then we hit the same issue. Client continually tries to connect using @tcp1 and never tries the @tcp NID. I tried another experiment where I modified the test-framework.sh so that it formatted the OST using --servicenode=192.168.2.34@tcp:192.168.2.35@tcp1 (i.e. specifying the NIDs like failover partners) and that does work to resolve the issue (perhaps as expected). |
| Comment by Amir Shehata (Inactive) [ 04/Feb/21 ] |
|
The second test where you do "--servicenode=192.168.2.34@tcp:192.168.2.35@tcp1" Does that work without the LNetDist change? or does it need the LNetDist change to work? |