Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14499

o2iblnd: LU-13368 changes cause shutdown procedure to not complete

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Changes applied by the patches from LU-13368 appear to be causing the o2iblnd shutdown procedure to not complete properly sometimes on lustre_rmmod:

      In that case, messages similar to the following keep showing up in the log:

      [51025.354675] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect
      [51029.354481] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect
      [51037.353971] LNet: 9402:0:(o2iblnd.c:3107:kiblnd_shutdown()) 10.1.11.124@o2ib10: waiting for 3 peers to disconnect
      

       

      Attachments

        Issue Links

          Activity

            [LU-14499] o2iblnd: LU-13368 changes cause shutdown procedure to not complete
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41937/
            Subject: LU-14499 lnet: Revert "LU-13368 lnet: discard the callback"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fc28666b2f648dc3d52f7ceaeb552405e17883da

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41937/ Subject: LU-14499 lnet: Revert " LU-13368 lnet: discard the callback" Project: fs/lustre-release Branch: master Current Patch Set: Commit: fc28666b2f648dc3d52f7ceaeb552405e17883da

            Hi Olaf, the problem here appears to be that even though the patches are code-complete and Maloo-tested, we're not able to verify Yang Sheng's fixes in a proper IB environment as Shuichi doesn't have the available resources. Would you be able to give these patches a try on your system?

             

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, the problem here appears to be that even though the patches are code-complete and Maloo-tested, we're not able to verify Yang Sheng's fixes in a proper IB environment as Shuichi doesn't have the available resources. Would you be able to give these patches a try on your system?  
            ofaaland Olaf Faaland added a comment -

            Hi Serguei and Yang Sheng,

            Thanks for clarifying. It looks like changes 40937 and 41970 aren't progressing. Are you waiting on something?

            Thanks

            ofaaland Olaf Faaland added a comment - Hi Serguei and Yang Sheng, Thanks for clarifying. It looks like changes 40937 and 41970 aren't progressing. Are you waiting on something? Thanks
            ys Yang Sheng added a comment -

            Sorry for the delay. Yes, Serguei is right.
            The https://review.whamcloud.com/#/c/fs/lustre-release/+/38845/ is original patch.
            The https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/ is a patch to work with 38845 to provide full function.
            The https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/ is a bug fixing patch for this ticket. Since i think it should be tested first, So i mark it as a 'test patch'.

            ys Yang Sheng added a comment - Sorry for the delay. Yes, Serguei is right. The https://review.whamcloud.com/#/c/fs/lustre-release/+/38845/ is original patch. The https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/ is a patch to work with 38845 to provide full function. The https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/ is a bug fixing patch for this ticket. Since i think it should be tested first, So i mark it as a 'test patch'.

            Hi Olaf,

            ys will correct me if I'm wrong, but I believe these are the two changes which are supposed to be fixing the original "discard the callback":

            https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/

            https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/

            Thanks,

            Serguei.

             

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, ys will correct me if I'm wrong, but I believe these are the two changes which are supposed to be fixing the original "discard the callback": https://review.whamcloud.com/#/c/fs/lustre-release/+/40937/ https://review.whamcloud.com/#/c/fs/lustre-release/+/41970/ Thanks, Serguei.  
            ofaaland Olaf Faaland added a comment - - edited

            What are the gerrit URLs for those changes? Thanks.

            ofaaland Olaf Faaland added a comment - - edited What are the gerrit URLs for those changes? Thanks.
            ys Yang Sheng added a comment -

            Hi, Serguei, Yes, you are right.

            ys Yang Sheng added a comment - Hi, Serguei, Yes, you are right.

            Hi Olaf,

            From comments in LU-13368 and this ticket, it looks like "lnet: discard the callback" change should be reverted. On the other hand, there were potential fixes supplied by Yang Sheng which didn't get tested. If I remember correctly, this got stuck pending the test results, which would help decide whether to revert the change, or keep it and add the fixes. 

            ys, sihara: is my understanding correct? 

            Thanks,

            Serguei. 

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, From comments in LU-13368 and this ticket, it looks like "lnet: discard the callback" change should be reverted. On the other hand, there were potential fixes supplied by Yang Sheng which didn't get tested. If I remember correctly, this got stuck pending the test results, which would help decide whether to revert the change, or keep it and add the fixes.  ys , sihara : is my understanding correct?  Thanks, Serguei. 
            ofaaland Olaf Faaland added a comment -

            Hi Serguei,
            Is this stuck because you need more information?
            thanks,

            ofaaland Olaf Faaland added a comment - Hi Serguei, Is this stuck because you need more information? thanks,

            People

              ssmirnov Serguei Smirnov
              ssmirnov Serguei Smirnov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: