Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7210

ASSERTION( peer->ibp_connecting == 0 )

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      As part of LU-7054 we applied patch http://review.whamcloud.com/#/c/14600 put we are hitting LBUG on the server

      LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) ASSERTION( peer->ibp_connecting == 0 ) failed:
      LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) LBUG
      Pid: 6521, comm: kiblnd_connd

      Attachments

        Issue Links

          Activity

            [LU-7210] ASSERTION( peer->ibp_connecting == 0 )

            Thank you Doug for your investigation. We will have 17527 reverted in our nas-2.7.2.

            jaylan Jay Lan (Inactive) added a comment - Thank you Doug for your investigation. We will have 17527 reverted in our nas-2.7.2.

            Looking at o2iblnd_cb.c_3nasC, I can see the problem. Under LU-7099, kiblnd_check_sends() was changed to kiblnd_check_sends_locked(). So, the locking was removed from this routine with the expectation that the callers will be holding the lock.

            In o2iblnd_cb.c_3nasC, kiblnd_check_sends_locked() has a remaining call to spin_unlock(&conn->ibc_lock) at the end which should not be there. So, you end up with double-unlock calls which will cause the problems you are seeing.

            I suspect this happened due to a collision of patches. I noticed that LU-7099 landed for 2.7 FE but not 2.8 FE (I'll see that this gets addressed).

            An important side effect of LU-7099: with the removal of all the "lock-dances" when check_sends is called, we no longer need the extra reference count on the connection introduced in this ticket. So, LU-7099 removes 17527 and that is ok.

            doug Doug Oucharek (Inactive) added a comment - Looking at o2iblnd_cb.c_3nasC, I can see the problem. Under LU-7099 , kiblnd_check_sends() was changed to kiblnd_check_sends_locked(). So, the locking was removed from this routine with the expectation that the callers will be holding the lock. In o2iblnd_cb.c_3nasC, kiblnd_check_sends_locked() has a remaining call to spin_unlock(&conn->ibc_lock) at the end which should not be there. So, you end up with double-unlock calls which will cause the problems you are seeing. I suspect this happened due to a collision of patches. I noticed that LU-7099 landed for 2.7 FE but not 2.8 FE (I'll see that this gets addressed). An important side effect of LU-7099 : with the removal of all the "lock-dances" when check_sends is called, we no longer need the extra reference count on the connection introduced in this ticket. So, LU-7099 removes 17527 and that is ok.

            o2iblnd_cb.c_3nasC has been attached.

            I compared the o2iblnd_cb.c from 3nasC and from 2nasC, which worked fine, the changes as far as change 17527's concern are the same.

            Note that we have been carrying change 17527 for several months without the mount problem until I did the rebase.

            jaylan Jay Lan (Inactive) added a comment - o2iblnd_cb.c_3nasC has been attached. I compared the o2iblnd_cb.c from 3nasC and from 2nasC, which worked fine, the changes as far as change 17527's concern are the same. Note that we have been carrying change 17527 for several months without the mount problem until I did the rebase.

            As you can see from b2_8_fe, change 17527 is needed. Can you attach your patched version o2iblnd_cb.c to this ticket? A mismatch of changes may be present causing a problem.

            doug Doug Oucharek (Inactive) added a comment - As you can see from b2_8_fe, change 17527 is needed. Can you attach your patched version o2iblnd_cb.c to this ticket? A mismatch of changes may be present causing a problem.

            I rebased NASA Ames's nas-2.7.2 branch to b2_7_fe on Aug 20, which picked up 58 patches up to this one:
            LU-8019 llite: Restore proper opencache operations

            The resulting client codes hung on mounting a lustre fs with errors:

            LustreError: 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.

            The culprit has been identified to be this commit (change 17527)
            LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done

            We have carried that patch since May 4th (nas-2.7.1-2nas) to address our lnet disconnect/reconnect problem. It showed no problem until we rebased to b2_7_fe (9f42af2).

            I noticed that b2_7_fe does not include change 17527, On the other hand, b2_8_fe includes two patches from LU-2710: change 17527 and change 17004. So I cherry-picked #17004 to our nas-2.7.2, but the mount problem persisted. If I reverted 17527 instead, the rebased code works fine on mount.

            However, I am concerned if the disconnect/reconnect problem we hit before would come back if I back out 17527. Please advise. Thanks!

            jaylan Jay Lan (Inactive) added a comment - I rebased NASA Ames's nas-2.7.2 branch to b2_7_fe on Aug 20, which picked up 58 patches up to this one: LU-8019 llite: Restore proper opencache operations The resulting client codes hung on mounting a lustre fs with errors: LustreError: 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. The culprit has been identified to be this commit (change 17527) LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done We have carried that patch since May 4th (nas-2.7.1-2nas) to address our lnet disconnect/reconnect problem. It showed no problem until we rebased to b2_7_fe (9f42af2). I noticed that b2_7_fe does not include change 17527, On the other hand, b2_8_fe includes two patches from LU-2710 : change 17527 and change 17004. So I cherry-picked #17004 to our nas-2.7.2, but the mount problem persisted. If I reverted 17527 instead, the rebased code works fine on mount. However, I am concerned if the disconnect/reconnect problem we hit before would come back if I back out 17527. Please advise. Thanks!

            Reopening/resolving in order to be able to apply the FixVersion of 2.8.0 to reflect the work that was landed here.

            jgmitter Joseph Gmitter (Inactive) added a comment - Reopening/resolving in order to be able to apply the FixVersion of 2.8.0 to reflect the work that was landed here.

            Thanks for a great summary of the patches, Doug! This helps!

            jaylan Jay Lan (Inactive) added a comment - Thanks for a great summary of the patches, Doug! This helps!

            The original problem reported in this ticket was caused by http://review.whamcloud.com/#/c/14600/ where it put two flag bits in the same byte. That created a potential race as flipping bits is not atomic operations and there was not a lock which was protecting flag changes. So, ibp_connecting would get set back to 0 when there was a race with a change to another bit (such as ibp_scheduled).

            Patch http://review.whamcloud.com/#/c/14600/ has been reverted which will eliminate this assert from happening (so we are closing this ticket). However we still need the fix done by http://review.whamcloud.com/#/c/14600/ and that is being redone under LU-7569.

            With regards to 17004: this corrected an inconsistency I found in the code which opened a very small race between the lock being released and reacquired. It is not a fix for this issue, but is still a good cleanup to have in master to rule it out when we investigate future races.

            So, in summary, 17004 does not need to be applied to existing installations, but should land to master as a good cleanup. Likewise, 17527 is a good cleanup. The true fix here is to revert 14600 which has been done to master recently.

            doug Doug Oucharek (Inactive) added a comment - The original problem reported in this ticket was caused by http://review.whamcloud.com/#/c/14600/ where it put two flag bits in the same byte. That created a potential race as flipping bits is not atomic operations and there was not a lock which was protecting flag changes. So, ibp_connecting would get set back to 0 when there was a race with a change to another bit (such as ibp_scheduled). Patch http://review.whamcloud.com/#/c/14600/ has been reverted which will eliminate this assert from happening (so we are closing this ticket). However we still need the fix done by http://review.whamcloud.com/#/c/14600/ and that is being redone under LU-7569 . With regards to 17004: this corrected an inconsistency I found in the code which opened a very small race between the lock being released and reacquired. It is not a fix for this issue, but is still a good cleanup to have in master to rule it out when we investigate future races. So, in summary, 17004 does not need to be applied to existing installations, but should land to master as a good cleanup. Likewise, 17527 is a good cleanup. The true fix here is to revert 14600 which has been done to master recently.

            Yeah I was surprised that patch 17004 was landed. Liang do you recommend Jay use this patch?

            simmonsja James A Simmons added a comment - Yeah I was surprised that patch 17004 was landed. Liang do you recommend Jay use this patch?

            Hmmm. Should we take http://review.whamcloud.com/#/c/17004/ or drop it?
            James said to drop it on Dec 18 then that patch was landed today...

            jaylan Jay Lan (Inactive) added a comment - Hmmm. Should we take http://review.whamcloud.com/#/c/17004/ or drop it? James said to drop it on Dec 18 then that patch was landed today...

            People

              doug Doug Oucharek (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: