Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7676

OSS Servers stuck in connecting/disconnect loop

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 1
    • 9223372036854775807

    Description

      We have had several OSS started to get into a state of disconnect and reconnect with clients. Sometimes they clear-up and then re-enter the same state later. Even with reboot the will enter into the same state.

      Attaching Lustre Debug dump. Please advice on what additional info is need for debugging.

      Attachments

        Issue Links

          Activity

            [LU-7676] OSS Servers stuck in connecting/disconnect loop
            pjones Peter Jones added a comment -

            Actually fix landed under LU-7569

            pjones Peter Jones added a comment - Actually fix landed under LU-7569

            need to ad NASA label

            mhanafi Mahmoud Hanafi added a comment - need to ad NASA label
            jaylan Jay Lan (Inactive) added a comment - - edited

            We need a b2_7_fe back port also. ATM we plan to stop running 2.7.1 until we receive the back port.

            jaylan Jay Lan (Inactive) added a comment - - edited We need a b2_7_fe back port also. ATM we plan to stop running 2.7.1 until we receive the back port.

            It looks like http://review.whamcloud.com/#/c/18025/ is a backport of patch LU-7569 http://review.whamcloud.com/#/c/17892/ that I have been asking for. Thanks.

            jaylan Jay Lan (Inactive) added a comment - It looks like http://review.whamcloud.com/#/c/18025/ is a backport of patch LU-7569 http://review.whamcloud.com/#/c/17892/ that I have been asking for. Thanks.

            Patch 18026 missed a newline at line 1909.

            jaylan Jay Lan (Inactive) added a comment - Patch 18026 missed a newline at line 1909.

            NASA: Please apply both patches being discussed here: http://review.whamcloud.com/#/c/18025/ and http://review.whamcloud.com/18026.

            doug Doug Oucharek (Inactive) added a comment - NASA: Please apply both patches being discussed here: http://review.whamcloud.com/#/c/18025/ and http://review.whamcloud.com/18026 .

            Liang: is your patch in addition to the one Amir ported or is it a replacement for it?

            doug Doug Oucharek (Inactive) added a comment - Liang: is your patch in addition to the one Amir ported or is it a replacement for it?
            liang Liang Zhen (Inactive) added a comment - - edited

            I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can't really help because current thread wouldn't sleep. I have updated the patch uploaded by Amir (http://review.whamcloud.com/#/c/16470/), I also ported it to 2_5_fe (http://review.whamcloud.com/18026)

            liang Liang Zhen (Inactive) added a comment - - edited I checked my original patch, seems I forgot to call set_current_state() before schedule_timeout(), which can't really help because current thread wouldn't sleep. I have updated the patch uploaded by Amir ( http://review.whamcloud.com/#/c/16470/ ), I also ported it to 2_5_fe ( http://review.whamcloud.com/18026 )
            ashehata Amir Shehata (Inactive) added a comment - Ported the patch here: http://review.whamcloud.com/#/c/18025/

            I looked through the log file attached and I see 442 instances of connection races, which occurs when two nodes are attempting to reconnect. This could result in a flurry of reconnects, which could consume memory. There is a prototype patch that has been done to address the same issue on another site. I'm in the process of porting it to NASA's branch and I'll push it in later today for you to try.

            ashehata Amir Shehata (Inactive) added a comment - I looked through the log file attached and I see 442 instances of connection races, which occurs when two nodes are attempting to reconnect. This could result in a flurry of reconnects, which could consume memory. There is a prototype patch that has been done to address the same issue on another site. I'm in the process of porting it to NASA's branch and I'll push it in later today for you to try.

            People

              green Oleg Drokin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: