Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16283

o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.1
    • None
    • 4.18.0-372.32.1.1toss.t4.x86_64
      lustre-2.15.1_7.llnl-2.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Observed on a lustre router node, while the servers and some of the clients were up and connected. The luster router node has Omnipath on the client side and IB on the lustre server side.

      lnetctl lnet unconfigure 

      hangs with stack

      [<0>] kiblnd_shutdown+0x347/0x4e0 [ko2iblnd]
      [<0>] lnet_shutdown_lndni+0x2b6/0x4c0 [lnet]
      [<0>] lnet_shutdown_lndnet+0x6c/0xb0 [lnet]
      [<0>] lnet_shutdown_lndnets+0x11e/0x300 [lnet]
      [<0>] LNetNIFini+0xb7/0x130 [lnet]
      [<0>] lnet_ioctl+0x220/0x260 [lnet]
      [<0>] notifier_call_chain+0x47/0x70
      [<0>] blocking_notifier_call_chain+0x42/0x60
      [<0>] libcfs_psdev_ioctl+0x346/0x590 [libcfs]
      [<0>] do_vfs_ioctl+0xa5/0x740
      [<0>] ksys_ioctl+0x64/0xa0
      [<0>] __x64_sys_ioctl+0x16/0x20
      [<0>] do_syscall_64+0x5b/0x1b0
      [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 

      Debug log shows it's waiting for 3 peers, even after 3700 seconds:

      00000800:00000200:1.0:1667256015.359743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect 
      ...
      00000800:00000200:3.0:1667259799.039743:0:35023:0:(o2iblnd.c:3049:kiblnd_shutdown()) 172.19.1.108@o2ib100: waiting for 3 peers to disconnect

      Before the shutdown there were 38 peers, all reported as "up"

      For patch stack, see https://github.com/LLNL/lustre/releases/tag/2.15.1_7.llnl

      For my reference, my local ticket is TOSS5826

      Attachments

        1. dk.mutt4.1.gz
          33 kB
        2. dk.mutt4.2.gz
          256 kB
        3. dk.mutt4.3.gz
          57 kB
        4. dmesg.mutt4.1667256190.gz
          32 kB
        5. dmesg.mutt4.1667259716.gz
          0.6 kB
        6. lnetctl.peer.show.mutt4.1.gz
          1 kB

        Issue Links

          Activity

            [LU-16283] o2iblnd.c:3049:kiblnd_shutdown() <NID>: waiting for <N> peers to disconnect

            This workaround seems to be working well for us. We do not bring NIs up and down dynamically in production normally, so the potential problem scenario probably won't occur and thus won't hurt us. So I'll remove topllnl, but leave the ticket open until an actual fix is identified.

            ofaaland Olaf Faaland added a comment - This workaround seems to be working well for us. We do not bring NIs up and down dynamically in production normally, so the potential problem scenario probably won't occur and thus won't hurt us. So I'll remove topllnl, but leave the ticket open until an actual fix is identified.

            Hi Olaf,

            I haven't been able to conclusively identify the problem yet. I believe it has to do with some sort of race on LNet shutdown, but this is kind of obvious. The workaround you applied should be good for most cases, the only scenario it doesn't cover is probably when active router NI's are being brought down/up dynamically. 

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, I haven't been able to conclusively identify the problem yet. I believe it has to do with some sort of race on LNet shutdown, but this is kind of obvious. The workaround you applied should be good for most cases, the only scenario it doesn't cover is probably when active router NI's are being brought down/up dynamically.  Thanks, Serguei.
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,
            I've added "lnetctl set routing 0" to our lnet service file. Have you had any success identifying the problem? Thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, I've added "lnetctl set routing 0" to our lnet service file. Have you had any success identifying the problem? Thanks
            ofaaland Olaf Faaland added a comment -

            Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break.

            ofaaland Olaf Faaland added a comment - Yes, we are adding "lnetctl set routing 0" to the shutdown tasks in our lnet service file after the holiday break.

            Hi Olaf,

            Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, Sorry, not yet. It is not addressing the root cause, but, for the lack of better ideas, I was considering changing the shutdown procedure to include "lnetctl set routing 0", but haven't submitted the patch yet. Thanks, Serguei.

            Hi Serguei,

            Do you have any update on this issue?

            Thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei, Do you have any update on this issue? Thanks

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: