Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16949

LNet: deadlock on o2ib NI going down under Centos 7.9

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • centos 7.9 VM 3.10.0-1160.25.1.el7_lustre.x86_64 kernel
      could not reproduce on centos 8.2
    • 3
    • 9223372036854775807

    Description

      The issue can be reproduced by adding an o2ib NI and then interrupting the corresponding link by pulling the cable or shutting down the switch connection or the whole switch. 

      Alternatively, one can add the o2ib NI when the corresponding link is already down (cable pulled) to the same effect.

      Using "ifdown" to bring the whole interface down doesn't reproduce the problem. 

      I could reproduce this on a Centos 7.9 VM, but not on a Centos 8.2 system.

      The issue got introduced by 

      commit da230373bd14306cb97fb48748ebce205f09d468
      Author: Serguei Smirnov <ssmirnov@whamcloud.com>
      Date:   Thu Feb 16 10:34:03 2023 -0800
      LU-16563 lnet: use discovered ni status to set initial health 

      It then got masked by another issue causing failure when trying to add an o2ib NI starting from 

      commit cc5594df3e70d1924f34ccdf4c3178654d277ad0
      Author: Shaun Tancheff <shaun.tancheff@hpe.com>
      Date:   Sun Apr 23 07:19:11 2023 -0500
      LU-16759 o2ib: MOFED 5.5+ ib_dma_virt_map_sg

      until some later commit which I didn't determine re-enabled adding o2iblnd NI. The latest master is behaving on 7.9 Centos as described.

      Attachments

        Activity

          People

            ssmirnov Serguei Smirnov
            ssmirnov Serguei Smirnov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: