Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16244

after dropping router during ior lst add_group fails with "create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22", lnetctl ping fails intermittently

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.15.1
    • lustre-2.15.1_7.llnl
      4.18.0-372.26.1.1toss.t4.x86_64
      omnipath
    • 3
    • 9223372036854775807

    Description

      (Update: this ticket was created while investigating another issue which we think we've reproduced.)

      After running a test in which a router (mutt2) was powered off during an ior run, then the router was brought back up, we observe problems in the logs where one of the client nodes involved (mutt18) in the ior keeps reporting its route through mutt2 is going up and down.

      After mutt2 came back up, we observe lctl ping to it to to fail intermittently. Similarly, we see lnet_selftest fail between mutt2 and mutt8.

      After mutt2 is back up, we start seeing LNet and LNetError console log messages:

      • On the router node that was rebooted, mutt2,, console log messages like these with the NID of a compute node
        • "LNetError.*kiblnd_cm_callback.*DISCONNECTED"
        • "LNetError.*kiblnd_reconnect_peer.*Abort reconnection of"
      • On the router nodes that were not rebooted, console log messages like these
        • "LNet:.*kiblnd_handle_rx.*PUT_NACK" with the NID of a garter node (an OSS)
      • On the compute node mutt18, console log messages like these with the NID of the router that was power cycled
        • "LnetError.*kiblnd_post_locked.*Error -22 posting transmit"
        • "LNetError.*lnet_set_route_aliveness.*route to ... has gone from up to down"

      from the mutt18 console logs, 192.168.128.2@o2ib44 is mutt2

      2022-10-27 09:40:42 [577896.982939] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
      2022-10-27 11:42:44 [585218.762479] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
      2022-10-27 11:42:48 [585222.858555] LNetError: 201531:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
      2022-10-27 12:27:25 [587899.915999] LNetError: 201515:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
      2022-10-27 12:27:48 [587923.166182] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
      2022-10-27 12:44:56 [588951.269523] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down
      2022-10-27 12:44:57 [588952.294216] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up

      Original description:

      LNet selftest session cannot be created because "lst add_group fails"

      # lst add_group anodes mutt110@o2ib44
      create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22
      No nodes added successfully, deleting group anodes
      Group is deleted
      
      # lctl ping mutt110@o2ib44
      12345-0@lo
      12345-192.168.128.110@o2ib44
      
      ## I had already loaded the lnet_selftest module on mutt110, but the error is the same either way
      # pdsh -w emutt110 lsmod | grep lnet_selftest
      emutt110: lnet_selftest         270336  0
      emutt110: lnet                  704512  9 osc,ko2iblnd,obdclass,ptlrpc,lnet_selftes ,mgc,lmv,lustre
      emutt110: libcfs                266240  13 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lnet_selftest,mgc,lov,mdc,lmv,lustre

      In addition to the lnet_selftest failure we see intermittent ping failures. So the issue is not lnet_selftest itself (as I first believed) but a more general problem.

      For my tracking purposes, our local ticket is TOSS5812

      Attachments

        1. garter_1-8_debug_with_net.gz
          28.34 MB
        2. mutt_10-11_17-18_27-28_30-31_debug_with_net.gz
          5.11 MB
        3. mutt_1-4_debug_with_net.gz
          15.20 MB
        4. mutt2-mutt18.tar.gz
          4.93 MB
        5. with-debug.try1.tar.bz2
          66.57 MB

        Issue Links

          Activity

            [LU-16244] after dropping router during ior lst add_group fails with "create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22", lnetctl ping fails intermittently

            Hi Cyril,

            After attempting some additional IOR tests, I do see one of the same major symptom as before. That is, the ior job starts out with normal bandwidth, then stops and seems to have short periods of activity, but seems stuck (transferring 0 MB/s) most of the time. The job may or may not time out when in this condition. Anyways, it now looks like dropping a router did trigger the same, or at least very similar symptoms when Olaf did his test Tuesday, so you can go ahead and look at the files Olaf mentioned in the above post with-debug.try1.tar.bz2.

            defazio Gian-Carlo Defazio added a comment - Hi Cyril, After attempting some additional IOR tests, I do see one of the same major symptom as before. That is, the ior job starts out with normal bandwidth, then stops and seems to have short periods of activity, but seems stuck (transferring 0 MB/s) most of the time. The job may or may not time out when in this condition. Anyways, it now looks like dropping a router did trigger the same, or at least very similar symptoms when Olaf did his test Tuesday, so you can go ahead and look at the files Olaf mentioned in the above post with-debug.try1.tar.bz2 .
            ofaaland Olaf Faaland added a comment -

            Hi Cyril,

            For now, ignore with-debug.try1.tar.bz2 which I uploaded.  I see now that following the same steps I'm getting only a subset of the symptoms reported originally.  The lustre servers have been updated with a newer version of Lustre since then, so I'm going to take a look at the patches and see if that might make sense, and maybe revert the servers and try again.

            Olaf

            ofaaland Olaf Faaland added a comment - Hi Cyril, For now, ignore with-debug.try1.tar.bz2 which I uploaded.  I see now that following the same steps I'm getting only a subset of the symptoms reported originally.  The lustre servers have been updated with a newer version of Lustre since then, so I'm going to take a look at the patches and see if that might make sense, and maybe revert the servers and try again. Olaf

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49297
            Subject: LU-16244 debug: add debug messages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4a22255e05d315ebfe140c929257d1e4567cf8a2

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49297 Subject: LU-16244 debug: add debug messages Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4a22255e05d315ebfe140c929257d1e4567cf8a2
            ofaaland Olaf Faaland added a comment - - edited

            Hi Cyril,
            Yes, we can patch Lustre to get more information. If you push the patch to gerrit so it goes through the regular test cycle that would be ideal.
            thanks

            ofaaland Olaf Faaland added a comment - - edited Hi Cyril, Yes, we can patch Lustre to get more information. If you push the patch to gerrit so it goes through the regular test cycle that would be ideal. thanks

            Thank you for the updates with all the logs.

            I am not sure I got all what was going on… I will keep working on the logs. Would it be possible to patch your Lustre version to add more debug info?

            cbordage Cyril Bordage added a comment - Thank you for the updates with all the logs. I am not sure I got all what was going on… I will keep working on the logs. Would it be possible to patch your Lustre version to add more debug info?
            ofaaland Olaf Faaland added a comment -

            mutt is running this lustre: tag 2.15.1_7.llnl, https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl

            ofaaland Olaf Faaland added a comment - mutt is running this lustre: tag 2.15.1_7.llnl, https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl

            I've logs for nodes involved in the stalled ior job, these logs were collected after the job was already stuck.

            garter_1-8_debug_with_net.gz (server)

            mutt_1-4_debug_with_net.gz (router)

            mutt_10-11_17-18_27-28_30-31_debug_with_net.gz (clients)

            defazio Gian-Carlo Defazio added a comment - I've logs for nodes involved in the stalled ior job, these logs were collected after the job was already stuck. garter_1-8_debug_with_net.gz (server) mutt_1-4_debug_with_net.gz (router) mutt_10-11_17-18_27-28_30-31_debug_with_net.gz (clients)
            defazio Gian-Carlo Defazio added a comment - - edited

            I've uploaded mutt2-mutt18.tar.gz which has lnet_selftests and lnetctl pings between mutt2 (router) and mutt18 (client).

            Note that these are not the same nodes and the above posts, they are the nodes mentions in the updated ticket description.

            The lnet_selftest wrapper used is from https://wiki.lustre.org/LNET_Selftest with the nids for mutt2  and mutt18

             

            #!/bin/sh
            #
            # Simple wrapper script for LNET Selftest
            ## Parameters are supplied as environment variables
            # The defaults are reasonable for quick verification.
            # For in-depth benchmarking, increase the time (TM)
            # variable to e.g. 60 seconds, and iterate over
            # concurrency to find optimal values.
            #
            # Reference: http://wiki.lustre.org/LNET_Selftest# Concurrency
            CN=${CN:-32}
            #Size
            SZ=${SZ:-1M}
            # Length of time to run test (secs)
            TM=${TM:-10}
            # Which BRW test to run (read or write)
            BRW=${BRW:-"read"}
            # Checksum calculation (simple or full)
            CKSUM=${CKSUM:-"simple"}# The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
            # LFROM="10.10.2.21@tcp"
            LFROM=192.168.128.18@o2ib44
            # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
            # LTO="10.10.2.22@tcp"
            LTO=192.168.128.2@o2ib44
            
            ### End of customisation.
            
            export LST_SESSION=$$
            echo LST_SESSION = ${LST_SESSION}
            lst new_session lst${BRW}
            lst add_group lfrom ${LFROM}
            lst add_group lto ${LTO}
            lst add_batch bulk_${BRW}
            lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \
              --concurrency=${CN} check=${CKSUM} size=${SZ}
            lst run bulk_${BRW}
            echo -n "Capturing statistics for ${TM} secs "
            lst stat lfrom lto &
            LSTPID=$!
            # Delay loop with interval markers displayed every 5 secs.
            # Test time is rounded up to the nearest 5 seconds.
            i=1
            j=$((${TM}/5))
            if [ $((${TM}%5)) -ne 0 ]; then let j++; fi
            while [ $i -le $j ]; do
              sleep 5
              let i++
            done
            kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1
            echo
            lst show_error lfrom lto
            lst stop bulk_${BRW}
            lst end_session
            
            defazio Gian-Carlo Defazio added a comment - - edited I've uploaded mutt2-mutt18.tar.gz which has lnet_selftests and lnetctl pings between mutt2 (router) and mutt18 (client). Note that these are not the same nodes and the above posts, they are the nodes mentions in the updated ticket description. The lnet_selftest wrapper used is from https://wiki.lustre.org/LNET_Selftest with the nids for mutt2  and mutt18   #!/bin/sh # # Simple wrapper script for LNET Selftest ## Parameters are supplied as environment variables # The defaults are reasonable for quick verification. # For in-depth benchmarking, increase the time (TM) # variable to e.g. 60 seconds, and iterate over # concurrency to find optimal values. # # Reference: http://wiki.lustre.org/LNET_Selftest# Concurrency CN=${CN:-32} #Size SZ=${SZ:-1M} # Length of time to run test (secs) TM=${TM:-10} # Which BRW test to run (read or write) BRW=${BRW:-"read"} # Checksum calculation (simple or full) CKSUM=${CKSUM:-"simple"}# The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs. # LFROM="10.10.2.21@tcp" LFROM=192.168.128.18@o2ib44 # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs. # LTO="10.10.2.22@tcp" LTO=192.168.128.2@o2ib44 ### End of customisation. export LST_SESSION=$$ echo LST_SESSION = ${LST_SESSION} lst new_session lst${BRW} lst add_group lfrom ${LFROM} lst add_group lto ${LTO} lst add_batch bulk_${BRW} lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \   --concurrency=${CN} check=${CKSUM} size=${SZ} lst run bulk_${BRW} echo -n "Capturing statistics for ${TM} secs " lst stat lfrom lto & LSTPID=$! # Delay loop with interval markers displayed every 5 secs. # Test time is rounded up to the nearest 5 seconds. i=1 j=$((${TM}/5)) if [ $((${TM}%5)) -ne 0 ]; then let j++; fi while [ $i -le $j ]; do   sleep 5   let i++ done kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1 echo lst show_error lfrom lto lst stop bulk_${BRW} lst end_session
            cbordage Cyril Bordage added a comment - - edited

            Hello,

            to find out what is going on, I would need debugging logs. For that, could you enable LNet traces with:

            lctl set_param debug=+net
            lctl clear
            

            Then, run your selftest script and dump the logs with:

            lctl dk > logfile.txt
            

            And finally, disable logging with

            lctl set_param debug=-net
            

            Could you give me the values of the parameters for selftest script, to have the context and understand the logs?

            Also, could you do the same for ping and capture an error?

             

            Thank you.

            cbordage Cyril Bordage added a comment - - edited Hello, to find out what is going on, I would need debugging logs. For that, could you enable LNet traces with: lctl set_param debug=+net lctl clear Then, run your selftest script and dump the logs with: lctl dk > logfile.txt And finally, disable logging with lctl set_param debug=-net Could you give me the values of the parameters for selftest script, to have the context and understand the logs? Also, could you do the same for ping and capture an error?   Thank you.

            These outputs are from using the wrapper at the end of https://wiki.lustre.org/LNET_Selftest

            I'm testing between a router node and a client node in these tests over an opa network

            First a node group (of jut one node) fails to add

            [root@mutt29:defazio1]# ./selftest.sh
            LST_SESSION = 309315
            SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
            192.168.128.29@o2ib44 are added to session
            create session RPC failed on 12345-192.168.128.3@o2ib44: Unknown error -110
            No nodes added successfully, deleting group lto
            Group is deleted
            Can't get count of nodes from lto: No such file or directory
            bulk_read is running now
            Capturing statistics for 10 secs Invalid nid: lto
            Failed to get count of nodes from lto: Success
            ./selftest.sh: line 55: kill: (309323) - No such processInvalid nid: lto
            Failed to get count of nodes from lto: Success
            Batch is stopped
            session is ended

            but I also had a run where the `lto` node added successfully, but the test still failed

            [root@mutt29:defazio1]# ./selftest.sh
            LST_SESSION = 309335
            SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
            192.168.128.29@o2ib44 are added to session
            192.168.128.3@o2ib44 are added to session
            Test was added successfully
            bulk_read is running now
            Capturing statistics for 10 secs [LNet Rates of lfrom]
            [R] Avg: 20       RPC/s Min: 20       RPC/s Max: 20       RPC/s
            [W] Avg: 9        RPC/s Min: 9        RPC/s Max: 9        RPC/s
            [LNet Bandwidth of lfrom]
            [R] Avg: 9.80     MiB/s Min: 9.80     MiB/s Max: 9.80     MiB/s 
            [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s 
            [LNet Rates of lto]
            [R] Avg: 9        RPC/s Min: 9        RPC/s Max: 9        RPC/s
            [W] Avg: 18       RPC/s Min: 18       RPC/s Max: 18       RPC/s
            [LNet Bandwidth of lto]
            [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s 
            [W] Avg: 8.80     MiB/s Min: 8.80     MiB/s Max: 8.80     MiB/s lfrom:
            12345-192.168.128.29@o2ib44: [Session 6 brw errors, 0 ping errors] [RPC: 17 errors, 0 dropped, 0 expired]
            Total 1 error nodes in lfrom
            lto:
            RPC failure, can't show error on 12345-192.168.128.3@o2ib44
            Total 1 error nodes in lto
            1 batch in stopping 
            ...
            (repeat 52 times total)
            ...
            1 batch in stopping
            Batch is stopped
            session is ended

            and here's a successful test on a different cluster running the same version of lustre

            [root@opal64:~]# ./selftest.sh 
            LST_SESSION = 284017
            SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
            192.168.128.4@o2ib18 are added to session
            192.168.128.125@o2ib18 are added to session
            Test was added successfully
            bulk_read is running now
            Capturing statistics for 10 secs [LNet Rates of lfrom]
            [R] Avg: 13878    RPC/s Min: 13878    RPC/s Max: 13878    RPC/s
            [W] Avg: 6938     RPC/s Min: 6938     RPC/s Max: 6938     RPC/s
            [LNet Bandwidth of lfrom]
            [R] Avg: 6940.33  MiB/s Min: 6940.33  MiB/s Max: 6940.33  MiB/s 
            [W] Avg: 1.06     MiB/s Min: 1.06     MiB/s Max: 1.06     MiB/s 
            [LNet Rates of lto]
            [R] Avg: 6939     RPC/s Min: 6939     RPC/s Max: 6939     RPC/s
            [W] Avg: 13879    RPC/s Min: 13879    RPC/s Max: 13879    RPC/s
            [LNet Bandwidth of lto]
            [R] Avg: 1.06     MiB/s Min: 1.06     MiB/s Max: 1.06     MiB/s 
            [W] Avg: 6940.73  MiB/s Min: 6940.73  MiB/s Max: 6940.73  MiB/s lfrom:
            Total 0 error nodes in lfrom
            lto:
            Total 0 error nodes in lto
            1 batch in stopping
            Batch is stopped
            session is ended
            
            defazio Gian-Carlo Defazio added a comment - These outputs are from using the wrapper at the end of https://wiki.lustre.org/LNET_Selftest I'm testing between a router node and a client node in these tests over an opa network First a node group (of jut one node) fails to add [root@mutt29:defazio1]# ./selftest.sh LST_SESSION = 309315 SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No 192.168.128.29@o2ib44 are added to session create session RPC failed on 12345-192.168.128.3@o2ib44: Unknown error -110 No nodes added successfully, deleting group lto Group is deleted Can't get count of nodes from lto: No such file or directory bulk_read is running now Capturing statistics for 10 secs Invalid nid: lto Failed to get count of nodes from lto: Success ./selftest.sh: line 55: kill: (309323) - No such processInvalid nid: lto Failed to get count of nodes from lto: Success Batch is stopped session is ended but I also had a run where the `lto` node added successfully, but the test still failed [root@mutt29:defazio1]# ./selftest.sh LST_SESSION = 309335 SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No 192.168.128.29@o2ib44 are added to session 192.168.128.3@o2ib44 are added to session Test was added successfully bulk_read is running now Capturing statistics for 10 secs [LNet Rates of lfrom] [R] Avg: 20       RPC/s Min: 20       RPC/s Max: 20       RPC/s [W] Avg: 9        RPC/s Min: 9        RPC/s Max: 9        RPC/s [LNet Bandwidth of lfrom] [R] Avg: 9.80     MiB/s Min: 9.80     MiB/s Max: 9.80     MiB/s  [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s  [LNet Rates of lto] [R] Avg: 9        RPC/s Min: 9        RPC/s Max: 9        RPC/s [W] Avg: 18       RPC/s Min: 18       RPC/s Max: 18       RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s  [W] Avg: 8.80     MiB/s Min: 8.80     MiB/s Max: 8.80     MiB/s lfrom: 12345-192.168.128.29@o2ib44: [Session 6 brw errors, 0 ping errors] [RPC: 17 errors, 0 dropped, 0 expired] Total 1 error nodes in lfrom lto: RPC failure, can't show error on 12345-192.168.128.3@o2ib44 Total 1 error nodes in lto 1 batch in stopping ... (repeat 52 times total) ... 1 batch in stopping Batch is stopped session is ended and here's a successful test on a different cluster running the same version of lustre [root@opal64:~]# ./selftest.sh  LST_SESSION = 284017 SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No 192.168.128.4@o2ib18 are added to session 192.168.128.125@o2ib18 are added to session Test was added successfully bulk_read is running now Capturing statistics for 10 secs [LNet Rates of lfrom] [R] Avg: 13878    RPC/s Min: 13878    RPC/s Max: 13878    RPC/s [W] Avg: 6938     RPC/s Min: 6938     RPC/s Max: 6938     RPC/s [LNet Bandwidth of lfrom] [R] Avg: 6940.33  MiB/s Min: 6940.33  MiB/s Max: 6940.33  MiB/s  [W] Avg: 1.06     MiB/s Min: 1.06     MiB/s Max: 1.06     MiB/s  [LNet Rates of lto] [R] Avg: 6939     RPC/s Min: 6939     RPC/s Max: 6939     RPC/s [W] Avg: 13879    RPC/s Min: 13879    RPC/s Max: 13879    RPC/s [LNet Bandwidth of lto] [R] Avg: 1.06     MiB/s Min: 1.06     MiB/s Max: 1.06     MiB/s  [W] Avg: 6940.73  MiB/s Min: 6940.73  MiB/s Max: 6940.73  MiB/s lfrom: Total 0 error nodes in lfrom lto: Total 0 error nodes in lto 1 batch in stopping Batch is stopped session is ended
            ofaaland Olaf Faaland added a comment -

            Gian-Carlo has more information and will update the ticket. 

             

            ofaaland Olaf Faaland added a comment - Gian-Carlo has more information and will update the ticket.   

            People

              cbordage Cyril Bordage
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: