[LU-14381] Client stuck using single NID when multiple are available Created: 28/Jan/21  Updated: 12/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File client.dklog    
Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Filing this as a bug, but it might be user-error. The problem I'm seeing is that clients are only attempting to connect to one of two available NIDs on the servers. Both clients and servers are configured with two NIDs, one on @tcp and on one @tcp1. Discovery/multi-rail is disabled. If I simulate failure of the @tcp NID on the OSS then I would expect the client to eventually try the @tcp1 NID, but that never happens. I tried modifying test-framework.sh so that it formats the OST with --servicenode=192.168.2.34@tcp,192.168.2.35@tcp1, but that didn't make any difference. Is this working as expected? Is this a bug? Am I missing some necessary config to allow client to use either NID?

Edit 1: There was a suggestion to put the interfaces on separate subnets. I tried that and it did not resolve the issue. See this comment.

Edit 2: Attached -1 debug log from the client client.dklog . Note this debug log was captured after I changed the configuration to put the interfaces on separate networks as noted in this comment.

Details on how to reproduce follow.

Version under test:

sles15s01:/home/hornc/fs2 # ./LUSTRE-VERSION-GEN
2.13.57_71_gb538826
sles15s01:/home/hornc/fs2 #

LNet configuration is tcp(eth0) and tcp1(eth1) with LNet peer discovery disabled:

sles15s01:/home/hornc/fs2/lustre/tests # cat /etc/modprobe.d/lustre.conf
options libcfs libcfs_debug=320735104
options libcfs libcfs_subsystem_debug=-2049
options lnet lnet_peer_discovery_disabled=1
options lnet ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 192.168.2.[31,33,35,37,42,43,44,45]"
sles15s01:/home/hornc/fs2/lustre/tests #

Test-framework config:

sles15s01:/home/hornc/fs2/lustre/tests # cat cfg/hornc.sh
# facet hosts
MDSCOUNT=1

mds_HOST=sles15s01
MDSDEV1=/dev/sdc

OSTCOUNT=1

ost_HOST=sles15s03
OSTDEV1=/dev/sde

CLIENTCOUNT=1
RCLIENTS="sles15c01"
PDSH="pdsh -S -Rssh -w"

SHARED_DIRECTORY="/shared/testing"
MGSNID="192.168.2.30@tcp,192.168.2.31@tcp1"

. /home/hornc/fs2/lustre/tests/cfg/ncli.sh
sles15s01:/home/hornc/fs2/lustre/tests #

Use llmount.sh to stand up filesystem:

sles15s01:/home/hornc/fs2/lustre/tests # NAME=hornc LOAD_MODULES_REMOTE=true VERBOSE=true /home/hornc/fs2/lustre/tests/llmount.sh
...
sles15s01:/home/hornc/fs2/lustre/tests # pdsh -w sles15s0[1,3],sles15c01 lctl list_nids | dshbak -c
----------------
sles15s01
----------------
192.168.2.30@tcp
192.168.2.31@tcp1
----------------
sles15c01
----------------
192.168.2.38@tcp
192.168.2.42@tcp1
----------------
sles15s03
----------------
192.168.2.34@tcp
192.168.2.35@tcp1
sles15s01:/home/hornc/fs2 # pdsh -w sles15s0[1,3],sles15c01 'lnetctl global show' | dshbak -c
----------------
sles15c01,sles15s[01,03]
----------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 0
    drop_asym_route: 0
    retry_count: 2
    transaction_timeout: 50
    health_sensitivity: 100
    recovery_interval: 1
    router_sensitivity: 100
    lnd_timeout: 16
    response_tracking: 3
sles15s01:/home/hornc/fs2 #

Unmount lustre from the client then re-mount with a drop rule in place, so that any traffic the client sends to eth0 on the OSS (a.k.a. 192.168.2.34@tcp) is dropped:

sles15c01:~ # umount /mnt/lustre
sles15c01:~ # lctl net down; lustre_rmmod
LNET busy
sles15c01:~ # tf_start.sh
Loading modules from /home/hornc/fs2/lustre
detected 1 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../libcfs/libcfs/libcfs options: 'libcfs_debug=320735104 libcfs_subsystem_debug=-2049'
../lnet/lnet/lnet options: 'lnet_peer_discovery_disabled=1 ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 192.168.2.[31,33,35,37,42,43,44,45]" accept=all'
quota/lquota options: 'hash_lqs_cur_bits=3'
sles15c01:~ # lctl set_param debug=+'net rpctrace'
debug=+net rpctrace
sles15c01:~ # lctl get_param debug
debug=
super ioctl neterror net warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
sles15c01:~ # lctl net_drop_add -s *@tcp -d 192.168.2.34@tcp -r 1 -e remote_timeout
Added drop rule 255.255.255.255@tcp->192.168.2.34@tcp (1/1)
sles15c01:~ # mount -t lustre -o user_xattr,flock 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre /mnt/lustre
sles15c01:~ # lfs check servers
lfs check: error: check 'lustre-OST0000-osc-ffff9c00a8b70000': Resource temporarily unavailable (11)
lustre-MDT0000-mdc-ffff9c00a8b70000 active.
sles15c01:~ #

Wait a couple minutes and dump the log:

sles15c01:~ # sleep 120; lfs check servers; lctl dk > /tmp/dk.log
lfs check: error: check 'lustre-OST0000-osc-ffff9c00a8b70000': Resource temporarily unavailable (11)
lustre-MDT0000-mdc-ffff9c00a8b70000 active.
sles15c01:~ #

With net trace enabled, LNet logs every send that it performs. Those entries look like this:

00000400:00000200:0.0:1611862288.864372:0:11737:0:(lib-move.c:1833:lnet_handle_send()) TRACE: 192.168.2.38@tcp(192.168.2.38@tcp:<?>) -> 192.168.2.34@tcp(192.168.2.34@tcp:192.168.2.34@tcp) <?> : GET try# 0

So if LNet sent any message to an @tcp1 NID then we should have record of it in the log.

We can see in the debug log that we haven't sent any messages to any tcp1 NIDs:

sles15c01:~ # grep tcp1 /tmp/dk.log
00000400:02000000:0.0:1611862138.860810:0:11727:0:(api-ni.c:2336:lnet_startup_lndni()) Added LNI 192.168.2.42@tcp1 [8/256/0/180]
00000020:01000004:0.0:1611862159.890388:0:12205:0:(obd_mount.c:968:lmd_print()) device:  192.168.2.30@tcp,192.168.2.31@tcp1:/lustre
00000020:00000080:0.0:1611862159.890446:0:12205:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid MGC192.168.2.30@tcp_0 to nid 0x20001c0a8021f (192.168.2.31@tcp1)
00000020:00000080:0.0:1611862159.900401:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.30@tcp to nid 0x20001c0a8021f (192.168.2.31@tcp1)
00000020:01000004:0.0:1611862159.903289:0:12218:0:(obd_mount.c:1004:lustre_check_exclusion()) Check exclusion lustre-OST0000 (0) in 0 of 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre
00000020:00000080:0.0:1611862159.903298:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x20001c0a80223 (192.168.2.35@tcp1)
00000020:00000004:0.0:1611862160.910881:0:12205:0:(obd_mount.c:1683:lustre_fill_super()) Mount 192.168.2.30@tcp,192.168.2.31@tcp1:/lustre complete
sles15c01:~ # grep 192.168.2.35 /tmp/dk.log
00000020:00000080:0.0:1611862159.903298:0:12218:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x20001c0a80223 (192.168.2.35@tcp1)
sles15c01:~ #

Statistics on the OSS confirm no traffic on tcp1:

sles15s03:~ # lnetctl net show --net tcp1 -v 2 | egrep -e send_count -e recv_count
              send_count: 0
              recv_count: 0
sles15s03:~ #


 Comments   
Comment by Andreas Dilger [ 28/Jan/21 ]

Have you tried this with eth0 and eth1 on different subnets? There are strange things in the kernel TCP code that make it difficult to guarantee sending from one interface or the other when both are on the same subnet.

Comment by Chris Horn [ 28/Jan/21 ]

I can try that though I'm pretty certain it won't make any difference. The tracing on the client is clear. The presence of the net drop rule on the client ensures that any traffic being sent to the @tcp NID will be dropped before it even hits the wire (nay, before it even hits the kernel networking code). Furthermore, the tracing makes it clear that no attempt is made to send to any @tcp1 NID. So whether the OSS is inappropriately using eth0 when it should use eth1 or vice versa doesn't make any difference. It never receives a single connect request from the client, because those connect RPCs are being sent to the @tcp nid and are thus dropped before hitting the wire.

Comment by Chris Horn [ 28/Jan/21 ]

Confirmed that putting the interfaces on different networks doesn't resolve issue:

sles15s01:~ # pdsh -w sles15s0[1,3],sles15c01 lctl list_nids | dshbak -c
----------------
sles15s01
----------------
192.168.2.30@tcp
10.10.0.50@tcp1
----------------
sles15c01
----------------
192.168.2.38@tcp
10.10.0.54@tcp1
----------------
sles15s03
----------------
192.168.2.34@tcp
10.10.0.52@tcp1
sles15s01:~ #

sles15c01:~ # umount /mnt/lustre
sles15c01:~ # lctl net down; lustre_rmmod
LNET busy
sles15c01:~ # tf_start.sh
Loading modules from /home/hornc/fs2/lustre
detected 1 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../libcfs/libcfs/libcfs options: 'libcfs_debug=320735104 libcfs_subsystem_debug=-2049'
../lnet/lnet/lnet options: 'lnet_peer_discovery_disabled=1 ip2nets="tcp(eth0) 192.168.2.[30,32,34,36,38,39,40,41]; tcp1(eth1) 10.10.0.[50,52,54]" accept=all'
quota/lquota options: 'hash_lqs_cur_bits=3'
sles15c01:~ # lctl set_param debug=+"net rpctrace"
debug=+net rpctrace
sles15c01:~ # lctl get_param debug
debug=
super ioctl neterror net warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck
sles15c01:~ # lctl net_drop_add -s *@tcp -d 192.168.2.34@tcp -r 1 -e remote_timeout
Added drop rule 255.255.255.255@tcp->192.168.2.34@tcp (1/1)
sles15c01:~ # mount -t lustre -o user_xattr,flock 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre /mnt/lustre
sles15c01:~ # lfs check servers
lfs check: error: check 'lustre-OST0000-osc-ffff9ab8f6caf000': Resource temporarily unavailable (11)
lustre-MDT0000-mdc-ffff9ab8f6caf000 active.
sles15c01:~ # sleep 120; lfs check servers; lctl dk > /tmp/dk.log
lfs check: error: check 'lustre-OST0000-osc-ffff9ab8f6caf000': Resource temporarily unavailable (11)
lustre-MDT0000-mdc-ffff9ab8f6caf000 active.
sles15c01:~ # grep tcp1 /tmp/dk.log
00000400:02000000:0.0:1611866400.918012:0:7175:0:(api-ni.c:2336:lnet_startup_lndni()) Added LNI 10.10.0.54@tcp1 [8/256/0/180]
00000020:01000004:0.0:1611866458.583906:0:7653:0:(obd_mount.c:968:lmd_print()) device:  192.168.2.30@tcp,10.10.0.50@tcp1:/lustre
00000020:00000080:0.0:1611866458.583963:0:7653:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid MGC192.168.2.30@tcp_0 to nid 0x200010a0a0032 (10.10.0.50@tcp1)
00000020:00000080:0.0:1611866458.594953:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.30@tcp to nid 0x200010a0a0032 (10.10.0.50@tcp1)
00000020:01000004:0.0:1611866458.598044:0:7666:0:(obd_mount.c:1004:lustre_check_exclusion()) Check exclusion lustre-OST0000 (0) in 0 of 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre
00000020:00000080:0.0:1611866458.598052:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x200010a0a0034 (10.10.0.52@tcp1)
00000020:00000004:0.0:1611866459.606590:0:7653:0:(obd_mount.c:1683:lustre_fill_super()) Mount 192.168.2.30@tcp,10.10.0.50@tcp1:/lustre complete
sles15c01:~ # grep 10.10.0.52 /tmp/dk.log
00000020:00000080:0.0:1611866458.598052:0:7666:0:(obd_config.c:1383:class_process_config()) adding mapping from uuid 192.168.2.34@tcp to nid 0x200010a0a0034 (10.10.0.52@tcp1)
sles15c01:~ #
Comment by Amir Shehata (Inactive) [ 29/Jan/21 ]

Below is the call flow for setting a connection

  +-< import_set_conn
    +-< import_set_conn_priority
    | +-< ptlrpc_recover_import
    | | +-< mdc_iocontrol
    | | +-< osc_iocontrol
    | | +-< osp_iocontrol
    | | +-< ptlrpc_reconnect_import
    | | +-< ldebugfs_import_seq_write
    | | +-< ptlrpc_set_import_active
    +-< client_import_add_conn
    | +-< client_obd_setup
    | | +-< mgc_setup
    | | +-< osc_setup_common
    | | +-< lwp_setup
    | | +-< osp_init0
    +-< client_import_dyn_add_conn
    | +-< mgc_apply_recover_logs
    | | +-< mgc_process_recover_nodemap_log

the first path is from the osc_setup_common() that's on initial connect
the second one is from the recover_import so on failover
both of them eventually funnel into ptlrpc_uuid_to_peer()
which calls LNetDist() (as far as I understand)
so if LNetDist() takes into consideration the status of the NID, reachable or not (it doesn't right now)
then it seems like we should be able to switch the NID that's currently used if it goes down.
Because LNetDist() only uses the LNet configuration to determine the distance, then the potentially down NID will always be selected.

Comment by Chris Horn [ 03/Feb/21 ]

I ran an experiment to test the above theory. I modifed LNetDist to discover the supplied dstnid, and if that fails then we return < 0. I confirmed that this change is sufficient on the initial mount to get a working connection. i.e. the client uses the working @tcp1 NID rather than the broken @tcp NID. However, if I modify the drop rule to drop @tcp1 traffic and remove the drop rule for @tcp traffic, then we hit the same issue. Client continually tries to connect using @tcp1 and never tries the @tcp NID.

I tried another experiment where I modified the test-framework.sh so that it formatted the OST using --servicenode=192.168.2.34@tcp:192.168.2.35@tcp1 (i.e. specifying the NIDs like failover partners) and that does work to resolve the issue (perhaps as expected).

Comment by Amir Shehata (Inactive) [ 04/Feb/21 ]

The second test where you do "--servicenode=192.168.2.34@tcp:192.168.2.35@tcp1" Does that work without the LNetDist change? or does it need the LNetDist change to work?

Generated at Sat Feb 10 03:09:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.