Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15446

Local recovery pings on MR nodes may not exercise all available paths

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      Typically, LNet peers do not perform discovery on themselves, so it is often the case that there is a non-MR peer entry for each local interface. For example:

      [root@kjcf01n05 ~]# lctl list_nids
      10.253.100.9@o2ib
      10.253.100.10@o2ib
      [root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.9@o2ib
      peer:
          - primary nid: 10.253.100.9@o2ib
            Multi-Rail: False
            peer ni:
              - nid: 10.253.100.9@o2ib
                state: NA
      [root@kjcf01n05 ~]# lnetctl peer show --nid 10.253.100.10@o2ib
      peer:
          - primary nid: 10.253.100.10@o2ib
            Multi-Rail: False
            peer ni:
              - nid: 10.253.100.10@o2ib
                state: NA
      [root@kjcf01n05 ~]#
      

      Because of this, LNet sets a "preferred" local NI to use when sending traffic to these non-MR peers. This prevents LNet recovery pings from exercising other paths. e.g. consider a peer with two local interfaces, heth0 and heth1. We have the following paths for sending to heth0:

       heth0 -> heth0 heth1 -> heth0 

      And paths for sending to heth1:

       heth0 -> heth1 heth1 -> heth1 

      Because of the preferred NI for non-MR peer logic, whichever path is first chosen will then be used for every future send to that NI (unless the peer entry is deleted, then a new path may be chosen). It is not clear whether these local recovery pings are particularly useful in ascertaining the health of local interfaces, but if they are, then it seems we ought to allow LNet to exercise all possible paths.

      Attachments

        Activity

          [LU-15446] Local recovery pings on MR nodes may not exercise all available paths
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46078/
          Subject: LU-15446 lnet: Don't use pref NI for reserved portal
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a2815441381cb6cee8eb9865d9279541ea04828e

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46078/ Subject: LU-15446 lnet: Don't use pref NI for reserved portal Project: fs/lustre-release Branch: master Current Patch Set: Commit: a2815441381cb6cee8eb9865d9279541ea04828e
          hornc Chris Horn added a comment -

          Test report for LU-15446:

          Build/execute test case from patch:

          [hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/78/46078/3 && git checkout FETCH_HEAD
          remote: Counting objects: 3726, done
          remote: Finding sources: 100% (1/1)
          remote: Total 1 (delta 0), reused 1 (delta 0)
          Unpacking objects: 100% (1/1), done.
          From https://review.whamcloud.com/fs/lustre-release
           * branch                  refs/changes/78/46078/3 -> FETCH_HEAD
          Previous HEAD position was 1eecd524de LU-15440 lnet: lnet_peer_data_present() memory leak
          HEAD is now at b79f82c23c LU-15446 lnet: Don't use pref NI for reserved portal
          [hornc@ct7-adm lustre-filesystem]$ git reset --soft HEAD^
          [hornc@ct7-adm lustre-filesystem]$ git status
          HEAD detached from FETCH_HEAD
          Changes to be committed:
            (use "git restore --staged <file>..." to unstage)
          	modified:   lnet/lnet/lib-move.c
          	modified:   lustre/tests/sanity-lnet.sh
          
          Untracked files:
            (use "git add <file>..." to include in what will be committed)
          	lustre/tests/lutf/Makefile.in
          	lustre/tests/lutf/src/Makefile.in
          
          [hornc@ct7-adm lustre-filesystem]$ git reset HEAD lnet/lnet/lib-move.c
          Unstaged changes after reset:
          M	lnet/lnet/lib-move.c
          [hornc@ct7-adm lustre-filesystem]$ git checkout lnet/lnet/lib-move.c
          Updated 1 path from the index
          [hornc@ct7-adm lustre-filesystem]$ git --no-pager diff --cached
          diff --git a/lustre/tests/sanity-lnet.sh b/lustre/tests/sanity-lnet.sh
          index 72e28eb497..c2d6f345e4 100755
          --- a/lustre/tests/sanity-lnet.sh
          +++ b/lustre/tests/sanity-lnet.sh
          @@ -92,6 +92,7 @@ load_lnet() {
           }
          
           do_lnetctl() {
          +	$LCTL mark "$LNETCTL $@"
           	echo "$LNETCTL $@"
           	$LNETCTL "$@"
           }
          @@ -2348,6 +2349,59 @@ test_217() {
           }
           run_test 217 "Don't leak memory when discovering peer with nnis <= 1"
          
          +test_218() {
          +	reinit_dlc || return $?
          +
          +	[[ ${#INTERFACES[@]} -lt 2 ]] &&
          +		skip "Need two LNet interfaces"
          +
          +	add_net "tcp" "${INTERFACES[0]}" || return $?
          +
          +	local nid1=$($LCTL list_nids | head -n 1)
          +
          +	do_lnetctl ping $nid1 ||
          +		error "ping failed"
          +
          +	add_net "tcp" "${INTERFACES[1]}" || return $?
          +
          +	local nid2=$($LCTL list_nids | tail --lines 1)
          +
          +	do_lnetctl ping $nid2 ||
          +		error "ping failed"
          +
          +	$LCTL net_drop_add -s $nid1 -d $nid1 -e local_error -r 1
          +
          +	do_lnetctl ping $nid1 &&
          +		error "ping should have failed"
          +
          +	local health_recovered
          +	local i
          +
          +	for i in $(seq 1 5); do
          +		health_recovered=$($LNETCTL net show -v 2 |
          +				   grep -c 'health value: 1000')
          +
          +		if [[ $health_recovered -ne 2 ]]; then
          +			echo "Wait 1 second for health to recover"
          +			sleep 1
          +		else
          +			break
          +		fi
          +	done
          +
          +	health_recovered=$($LNETCTL net show -v 2 |
          +			   grep -c 'health value: 1000')
          +
          +	$LCTL net_drop_del -a
          +
          +	[[ $health_recovered -ne 2 ]] &&
          +		do_lnetctl net show -v 2 | egrep -e nid -e health &&
          +		error "Health hasn't recovered"
          +
          +	return 0
          +}
          +run_test 218 "Local recovery pings should exercise all available paths"
          +
           test_230() {
           	# LU-12815
           	echo "Check valid values; Should succeed"
          [hornc@ct7-adm lustre-filesystem]$ make -j 32
          ...
          [root@ct7-adm tests]# cat /etc/modprobe.d/lustre.conf
          options lnet networks=tcp(eth0,eth1)
          [root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 218
          Started at Sat Jan 29 03:06:42 UTC 2022
          ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/030642
          Logging to shared log directory: /tmp/test_logs/2022-01-29/030642
          ct7-adm: executing yml_node
          IOC_LIBCFS_GET_NI error 22: Invalid argument
          Client: 2.14.57.60
          MDS: 2.14.57.60
          OSS: 2.14.57.60
          running: sanity-lnet ONLY=218
          run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
          -----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 03:06:44 UTC 2022
          Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
          excepting tests:
          opening /dev/obd failed: No such file or directory
          hint: the kernel modules may not be loaded
          Stopping clients: ct7-adm /mnt/lustre (opts:-f)
          Stopping clients: ct7-adm /mnt/lustre2 (opts:-f)
          modules unloaded.
          ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
          ip netns exec test_ns ip link set test1pg up
          Loading modules from /home/hornc/lustre-filesystem/lustre
          detected 2 online CPUs by sysfs
          Force libcfs to create 2 CPU partitions
          ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
          ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all'
          ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
          quota/lquota options: 'hash_lqs_cur_bits=3'
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
          net:
              - net type: lo
                local NI(s):
                  - nid: 0@lo
                    status: up
              - net type: tcp
                local NI(s):
                  - nid: 10.0.2.15@tcp
                    status: up
                    interfaces:
                        0: eth0
                  - nid: 10.73.10.10@tcp
                    status: up
                    interfaces:
                        0: eth1
          1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
              link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
              inet 127.0.0.1/8 scope host lo
                 valid_lft forever preferred_lft forever
              inet6 ::1/128 scope host
                 valid_lft forever preferred_lft forever
          2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
              link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
              inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
                 valid_lft 69598sec preferred_lft 69598sec
              inet6 fe80::5054:ff:fe4d:77d3/64 scope link
                 valid_lft forever preferred_lft forever
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
              link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff
              inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1
                 valid_lft forever preferred_lft forever
              inet6 fe80::a00:27ff:fe27:de86/64 scope link
                 valid_lft forever preferred_lft forever
          10: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
              link/ether 5a:38:d6:b4:bd:2f brd ff:ff:ff:ff:ff:ff link-netnsid 0
              inet6 fe80::5838:d6ff:feb4:bd2f/64 scope link
                 valid_lft forever preferred_lft forever
          Cleaning up LNet
          modules unloaded.
          
          == sanity-lnet test 218: Local recovery pings should exercise all available paths ========================================================== 03:06:49 (1643425609)
          Loading LNet and configuring DLC
          ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all'
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth0
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp
          ping:
              - primary nid: 10.0.2.15@tcp
                Multi-Rail: False
                peer ni:
                  - nid: 10.0.2.15@tcp
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth1
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.10.10@tcp
          ping:
              - primary nid: 10.73.10.10@tcp
                Multi-Rail: False
                peer ni:
                  - nid: 10.0.2.15@tcp
                  - nid: 10.73.10.10@tcp
          Added drop rule 10.0.2.15@tcp->10.0.2.15@tcp (1/1)
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp
          manage:
              - ping:
                    errno: -1
                    descr: failed to ping 10.0.2.15@tcp: Input/output error
          
          Wait 1 second for health to recover
          Wait 1 second for health to recover
          Wait 1 second for health to recover
          Wait 1 second for health to recover
          Wait 1 second for health to recover
          Removed 1 drop rules
                  - nid: 0@lo
                    health stats:
                        health value: 0
                  - nid: 10.0.2.15@tcp
                    health stats:
                        health value: 900
                  - nid: 10.73.10.10@tcp
                    health stats:
                        health value: 1000
           sanity-lnet test_218: @@@@@@ FAIL: Health hasn't recovered
            Trace dump:
            = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6336:error()
            = /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh:2399:test_218()
            = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6640:run_one()
            = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6687:run_one_logged()
            = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6513:run_test()
            = /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh:2403:main()
          Dumping lctl log to /tmp/test_logs/2022-01-29/030642/sanity-lnet.test_218.*.1643425617.log
          Dumping logs only on local client.
          FAIL 218 (9s)
          Cleaning up LNet
          opening /dev/obd failed: No such file or directory
          hint: the kernel modules may not be loaded
          modules unloaded.
          sanity-lnet returned 1
          Finished at Sat Jan 29 03:06:59 UTC 2022 in 17s
          ./auster: completed with rc 0
          [root@ct7-adm tests]#
          

          Apply fix and re-test:

          [hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/78/46078/3 && git reset --hard FETCH_HEAD
          From https://review.whamcloud.com/fs/lustre-release
           * branch                  refs/changes/78/46078/3 -> FETCH_HEAD
          HEAD is now at b79f82c23c LU-15446 lnet: Don't use pref NI for reserved portal
          [hornc@ct7-adm lustre-filesystem]$ make -j 32
          ...
          [root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 218
          Started at Sat Jan 29 03:08:13 UTC 2022
          ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/030812
          Logging to shared log directory: /tmp/test_logs/2022-01-29/030812
          ct7-adm: executing yml_node
          IOC_LIBCFS_GET_NI error 22: Invalid argument
          Client: 2.14.57.60
          MDS: 2.14.57.60
          OSS: 2.14.57.60
          running: sanity-lnet ONLY=218
          run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
          -----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 03:08:15 UTC 2022
          Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh
          excepting tests:
          opening /dev/obd failed: No such file or directory
          hint: the kernel modules may not be loaded
          Stopping clients: ct7-adm /mnt/lustre (opts:-f)
          Stopping clients: ct7-adm /mnt/lustre2 (opts:-f)
          modules unloaded.
          ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg
          ip netns exec test_ns ip link set test1pg up
          Loading modules from /home/hornc/lustre-filesystem/lustre
          detected 2 online CPUs by sysfs
          Force libcfs to create 2 CPU partitions
          ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
          ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all'
          ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
          quota/lquota options: 'hash_lqs_cur_bits=3'
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show
          net:
              - net type: lo
                local NI(s):
                  - nid: 0@lo
                    status: up
              - net type: tcp
                local NI(s):
                  - nid: 10.0.2.15@tcp
                    status: up
                    interfaces:
                        0: eth0
                  - nid: 10.73.10.10@tcp
                    status: up
                    interfaces:
                        0: eth1
          1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
              link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
              inet 127.0.0.1/8 scope host lo
                 valid_lft forever preferred_lft forever
              inet6 ::1/128 scope host
                 valid_lft forever preferred_lft forever
          2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
              link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
              inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
                 valid_lft 69508sec preferred_lft 69508sec
              inet6 fe80::5054:ff:fe4d:77d3/64 scope link
                 valid_lft forever preferred_lft forever
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
              link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff
              inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1
                 valid_lft forever preferred_lft forever
              inet6 fe80::a00:27ff:fe27:de86/64 scope link
                 valid_lft forever preferred_lft forever
          11: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
              link/ether 8e:e0:41:a6:d6:1c brd ff:ff:ff:ff:ff:ff link-netnsid 0
              inet6 fe80::8ce0:41ff:fea6:d61c/64 scope link
                 valid_lft forever preferred_lft forever
          Cleaning up LNet
          modules unloaded.
          
          == sanity-lnet test 218: Local recovery pings should exercise all available paths ========================================================== 03:08:20 (1643425700)
          Loading LNet and configuring DLC
          ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all'
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth0
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp
          ping:
              - primary nid: 10.0.2.15@tcp
                Multi-Rail: False
                peer ni:
                  - nid: 10.0.2.15@tcp
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth1
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.10.10@tcp
          ping:
              - primary nid: 10.73.10.10@tcp
                Multi-Rail: False
                peer ni:
                  - nid: 10.0.2.15@tcp
                  - nid: 10.73.10.10@tcp
          Added drop rule 10.0.2.15@tcp->10.0.2.15@tcp (1/1)
          /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp
          manage:
              - ping:
                    errno: -1
                    descr: failed to ping 10.0.2.15@tcp: Input/output error
          
          Removed 1 drop rules
          PASS 218 (2s)
          == sanity-lnet test complete, duration 7 sec ============= 03:08:22 (1643425702)
          Cleaning up LNet
          opening /dev/obd failed: No such file or directory
          hint: the kernel modules may not be loaded
          modules unloaded.
          sanity-lnet returned 0
          Finished at Sat Jan 29 03:08:25 UTC 2022 in 13s
          ./auster: completed with rc 0
          [root@ct7-adm tests]#
          
          hornc Chris Horn added a comment - Test report for LU-15446 : Build/execute test case from patch: [hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/78/46078/3 && git checkout FETCH_HEAD remote: Counting objects: 3726, done remote: Finding sources: 100% (1/1) remote: Total 1 (delta 0), reused 1 (delta 0) Unpacking objects: 100% (1/1), done. From https://review.whamcloud.com/fs/lustre-release * branch refs/changes/78/46078/3 -> FETCH_HEAD Previous HEAD position was 1eecd524de LU-15440 lnet: lnet_peer_data_present() memory leak HEAD is now at b79f82c23c LU-15446 lnet: Don't use pref NI for reserved portal [hornc@ct7-adm lustre-filesystem]$ git reset --soft HEAD^ [hornc@ct7-adm lustre-filesystem]$ git status HEAD detached from FETCH_HEAD Changes to be committed: (use "git restore --staged <file>..." to unstage) modified: lnet/lnet/lib-move.c modified: lustre/tests/sanity-lnet.sh Untracked files: (use "git add <file>..." to include in what will be committed) lustre/tests/lutf/Makefile.in lustre/tests/lutf/src/Makefile.in [hornc@ct7-adm lustre-filesystem]$ git reset HEAD lnet/lnet/lib-move.c Unstaged changes after reset: M lnet/lnet/lib-move.c [hornc@ct7-adm lustre-filesystem]$ git checkout lnet/lnet/lib-move.c Updated 1 path from the index [hornc@ct7-adm lustre-filesystem]$ git --no-pager diff --cached diff --git a/lustre/tests/sanity-lnet.sh b/lustre/tests/sanity-lnet.sh index 72e28eb497..c2d6f345e4 100755 --- a/lustre/tests/sanity-lnet.sh +++ b/lustre/tests/sanity-lnet.sh @@ -92,6 +92,7 @@ load_lnet() { } do_lnetctl() { + $LCTL mark "$LNETCTL $@" echo "$LNETCTL $@" $LNETCTL "$@" } @@ -2348,6 +2349,59 @@ test_217() { } run_test 217 "Don't leak memory when discovering peer with nnis <= 1" +test_218() { + reinit_dlc || return $? + + [[ ${#INTERFACES[@]} -lt 2 ]] && + skip "Need two LNet interfaces" + + add_net "tcp" "${INTERFACES[0]}" || return $? + + local nid1=$($LCTL list_nids | head -n 1) + + do_lnetctl ping $nid1 || + error "ping failed" + + add_net "tcp" "${INTERFACES[1]}" || return $? + + local nid2=$($LCTL list_nids | tail --lines 1) + + do_lnetctl ping $nid2 || + error "ping failed" + + $LCTL net_drop_add -s $nid1 -d $nid1 -e local_error -r 1 + + do_lnetctl ping $nid1 && + error "ping should have failed" + + local health_recovered + local i + + for i in $(seq 1 5); do + health_recovered=$($LNETCTL net show -v 2 | + grep -c 'health value: 1000') + + if [[ $health_recovered -ne 2 ]]; then + echo "Wait 1 second for health to recover" + sleep 1 + else + break + fi + done + + health_recovered=$($LNETCTL net show -v 2 | + grep -c 'health value: 1000') + + $LCTL net_drop_del -a + + [[ $health_recovered -ne 2 ]] && + do_lnetctl net show -v 2 | egrep -e nid -e health && + error "Health hasn't recovered" + + return 0 +} +run_test 218 "Local recovery pings should exercise all available paths" + test_230() { # LU-12815 echo "Check valid values; Should succeed" [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... [root@ct7-adm tests]# cat /etc/modprobe.d/lustre.conf options lnet networks=tcp(eth0,eth1) [root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 218 Started at Sat Jan 29 03:06:42 UTC 2022 ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/030642 Logging to shared log directory: /tmp/test_logs/2022-01-29/030642 ct7-adm: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument Client: 2.14.57.60 MDS: 2.14.57.60 OSS: 2.14.57.60 running: sanity-lnet ONLY=218 run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh -----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 03:06:44 UTC 2022 Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh excepting tests: opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded Stopping clients: ct7-adm /mnt/lustre (opts:-f) Stopping clients: ct7-adm /mnt/lustre2 (opts:-f) modules unloaded. ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg ip netns exec test_ns ip link set test1pg up Loading modules from /home/hornc/lustre-filesystem/lustre detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all' ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' quota/lquota options: 'hash_lqs_cur_bits=3' /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 10.0.2.15@tcp status: up interfaces: 0: eth0 - nid: 10.73.10.10@tcp status: up interfaces: 0: eth1 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0 valid_lft 69598sec preferred_lft 69598sec inet6 fe80::5054:ff:fe4d:77d3/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe27:de86/64 scope link valid_lft forever preferred_lft forever 10: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 5a:38:d6:b4:bd:2f brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::5838:d6ff:feb4:bd2f/64 scope link valid_lft forever preferred_lft forever Cleaning up LNet modules unloaded. == sanity-lnet test 218: Local recovery pings should exercise all available paths ========================================================== 03:06:49 (1643425609) Loading LNet and configuring DLC ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all' /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth0 /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp ping: - primary nid: 10.0.2.15@tcp Multi-Rail: False peer ni: - nid: 10.0.2.15@tcp /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth1 /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.10.10@tcp ping: - primary nid: 10.73.10.10@tcp Multi-Rail: False peer ni: - nid: 10.0.2.15@tcp - nid: 10.73.10.10@tcp Added drop rule 10.0.2.15@tcp->10.0.2.15@tcp (1/1) /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp manage: - ping: errno: -1 descr: failed to ping 10.0.2.15@tcp: Input/output error Wait 1 second for health to recover Wait 1 second for health to recover Wait 1 second for health to recover Wait 1 second for health to recover Wait 1 second for health to recover Removed 1 drop rules - nid: 0@lo health stats: health value: 0 - nid: 10.0.2.15@tcp health stats: health value: 900 - nid: 10.73.10.10@tcp health stats: health value: 1000 sanity-lnet test_218: @@@@@@ FAIL: Health hasn't recovered Trace dump: = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6336:error() = /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh:2399:test_218() = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6640:run_one() = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6687:run_one_logged() = /home/hornc/lustre-filesystem/lustre/tests/test-framework.sh:6513:run_test() = /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh:2403:main() Dumping lctl log to /tmp/test_logs/2022-01-29/030642/sanity-lnet.test_218.*.1643425617.log Dumping logs only on local client. FAIL 218 (9s) Cleaning up LNet opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded modules unloaded. sanity-lnet returned 1 Finished at Sat Jan 29 03:06:59 UTC 2022 in 17s ./auster: completed with rc 0 [root@ct7-adm tests]# Apply fix and re-test: [hornc@ct7-adm lustre-filesystem]$ git fetch https://review.whamcloud.com/fs/lustre-release refs/changes/78/46078/3 && git reset --hard FETCH_HEAD From https://review.whamcloud.com/fs/lustre-release * branch refs/changes/78/46078/3 -> FETCH_HEAD HEAD is now at b79f82c23c LU-15446 lnet: Don't use pref NI for reserved portal [hornc@ct7-adm lustre-filesystem]$ make -j 32 ... [root@ct7-adm tests]# ./auster -N -v sanity-lnet --only 218 Started at Sat Jan 29 03:08:13 UTC 2022 ct7-adm: executing check_logdir /tmp/test_logs/2022-01-29/030812 Logging to shared log directory: /tmp/test_logs/2022-01-29/030812 ct7-adm: executing yml_node IOC_LIBCFS_GET_NI error 22: Invalid argument Client: 2.14.57.60 MDS: 2.14.57.60 OSS: 2.14.57.60 running: sanity-lnet ONLY=218 run_suite sanity-lnet /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh -----============= acceptance-small: sanity-lnet ============----- Sat Jan 29 03:08:15 UTC 2022 Running: bash /home/hornc/lustre-filesystem/lustre/tests/sanity-lnet.sh excepting tests: opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded Stopping clients: ct7-adm /mnt/lustre (opts:-f) Stopping clients: ct7-adm /mnt/lustre2 (opts:-f) modules unloaded. ip netns exec test_ns ip addr add 10.1.2.3/31 dev test1pg ip netns exec test_ns ip link set test1pg up Loading modules from /home/hornc/lustre-filesystem/lustre detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all' ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' quota/lquota options: 'hash_lqs_cur_bits=3' /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 10.0.2.15@tcp status: up interfaces: 0: eth0 - nid: 10.73.10.10@tcp status: up interfaces: 0: eth1 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0 valid_lft 69508sec preferred_lft 69508sec inet6 fe80::5054:ff:fe4d:77d3/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:27:de:86 brd ff:ff:ff:ff:ff:ff inet 10.73.10.10/24 brd 10.73.10.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe27:de86/64 scope link valid_lft forever preferred_lft forever 11: test1pl@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 8e:e0:41:a6:d6:1c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::8ce0:41ff:fea6:d61c/64 scope link valid_lft forever preferred_lft forever Cleaning up LNet modules unloaded. == sanity-lnet test 218: Local recovery pings should exercise all available paths ========================================================== 03:08:20 (1643425700) Loading LNet and configuring DLC ../lnet/lnet/lnet options: 'networks=tcp(eth0,eth1) accept=all' /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl lnet configure /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth0 /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp ping: - primary nid: 10.0.2.15@tcp Multi-Rail: False peer ni: - nid: 10.0.2.15@tcp /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl net add --net tcp --if eth1 /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.73.10.10@tcp ping: - primary nid: 10.73.10.10@tcp Multi-Rail: False peer ni: - nid: 10.0.2.15@tcp - nid: 10.73.10.10@tcp Added drop rule 10.0.2.15@tcp->10.0.2.15@tcp (1/1) /home/hornc/lustre-filesystem/lustre/../lnet/utils/lnetctl ping 10.0.2.15@tcp manage: - ping: errno: -1 descr: failed to ping 10.0.2.15@tcp: Input/output error Removed 1 drop rules PASS 218 (2s) == sanity-lnet test complete, duration 7 sec ============= 03:08:22 (1643425702) Cleaning up LNet opening /dev/obd failed: No such file or directory hint: the kernel modules may not be loaded modules unloaded. sanity-lnet returned 0 Finished at Sat Jan 29 03:08:25 UTC 2022 in 13s ./auster: completed with rc 0 [root@ct7-adm tests]#

          "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46078
          Subject: LU-15446 lnet: Don't use pref NI for reserved portal
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 011a77e02255925eb29deaf6dddb24c2d969152d

          gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46078 Subject: LU-15446 lnet: Don't use pref NI for reserved portal Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 011a77e02255925eb29deaf6dddb24c2d969152d

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: