[LU-7210] ASSERTION( peer->ibp_connecting == 0 ) Created: 25/Sep/15  Updated: 13/Oct/16  Resolved: 16/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File o2iblnd_cb.c_3nasC    
Issue Links:
Related
is related to LU-7569 IB leaf switch caused LNet routers to... Resolved
is related to LU-5718 RDMA too fragmented with router Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As part of LU-7054 we applied patch http://review.whamcloud.com/#/c/14600 put we are hitting LBUG on the server

LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) ASSERTION( peer->ibp_connecting == 0 ) failed:
LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) LBUG
Pid: 6521, comm: kiblnd_connd



 Comments   
Comment by Joseph Gmitter (Inactive) [ 25/Sep/15 ]

Hi Amir,
This appears to be an Lnet issue. Can you have a look?
Thanks.
Joe

Comment by Amir Shehata (Inactive) [ 29/Sep/15 ]

Would it be possible to provide a stack trace for the crash?
thanks

Comment by Mahmoud Hanafi [ 29/Sep/15 ]
PID: 7228   TASK: ffff880c333e1540  CPU: 4   COMMAND: "kiblnd_connd"
 #0 [ffff880c345f3a68] machine_kexec at ffffffff81038f0b
 #1 [ffff880c345f3ac8] crash_kexec at ffffffff810c5552
 #2 [ffff880c345f3b98] kdb_kdump_check at ffffffff81291727
 #3 [ffff880c345f3ba8] kdb_main_loop at ffffffff81294917
 #4 [ffff880c345f3cb8] kdb_save_running at ffffffff8128ea7e
 #5 [ffff880c345f3cc8] kdba_main_loop at ffffffff81476a88
 #6 [ffff880c345f3d08] kdb at ffffffff81291c16
 #7 [ffff880c345f3d78] panic at ffffffff8155948f
 #8 [ffff880c345f3df8] lbug_with_loc at ffffffffa05a4eeb [libcfs]
 #9 [ffff880c345f3e18] kiblnd_destroy_peer at ffffffffa0a42e43 [ko2iblnd]
#10 [ffff880c345f3e38] kiblnd_destroy_conn at ffffffffa0a4d197 [ko2iblnd]
#11 [ffff880c345f3e68] kiblnd_connd at ffffffffa0a56fd2 [ko2iblnd]
#12 [ffff880c345f3ee8] kthread at ffffffff8109a106
#13 [ffff880c345f3f48] kernel_thread at ffffffff8100c20a
Comment by Amir Shehata (Inactive) [ 29/Sep/15 ]

I don't seem to see the case where the assert could be triggered. Would it be possible to capture net and neterror debug prior to the bug to see how we get into this case.

In order to get into this assert, ibp_connecting > 0 and the refcount on the peer has to have gone to 0. I don't see an immediate path in the code where that can happen, but it's happening, so there are assumptions that have been made that are false. More debug output would help narrow down my investigation.

Comment by Mahmoud Hanafi [ 29/Sep/15 ]

Our hosts are setup to panic on lbug, so we are not able to dump the logs. If I can get a easy reproducer I'll try to dump the logs.

Is there any info from the crash dump that can help you?

Comment by Amir Shehata (Inactive) [ 29/Sep/15 ]

I'm trying to understand the sequence of events that lead up to the crash. But if you can provide the System.map, vmlinux and vmcore I can try investigating the crash dump a bit more.

In the meantime I'm still working out the different paths in the code that could lead to the system state I'm suspecting:

The peer thinks that it is connecting: ibp_connecting is > 0
The connection being cleaned off the zombie list is likely in one of the following states: 
IBLND_CONN_ACTIVE_CONNECT ||
IBLND_CONN_PASSIVE_WAIT
The peer's refcount is == 1
Comment by Doug Oucharek (Inactive) [ 22/Oct/15 ]

I believe I have found the scenario which will cause this assert:

The change that 14600 made to kiblnd_reconnect() is to "retry connection if it's still needed and not other connection attempts are in progress". In the following if statement, it is incrementing ibp_connecting from a value of 1 to 2 regardless of whether we are retrying now or waiting for the failed connection to be cleaned up due to a race condition (objective of the patch).

The failed connection ends up on the zombie list for connd to clean up. Remember that ibp_connecting is still a value of 2. In the connd processing, it cleans up the connection by calling kiblnd_destroy_conn() (start of the call chain seen above). That routine was altered by patch 14600 to trigger a new peer connection attempt if the ibp_tx_queue is not empty. Otherwise, it decrements ibp_connecting by one. In both cases, kiblnd_peer_decref() gets called to drop reference to the peer because the connection was cleaned up.

Now, if we take the path where ibp_tx_queue is empty, we decrement ibp_connecting making it a value of 1. When kiblnd_peer_decref() is called, it will find the peer refcount at 1 so it believes it needs to call kiblnd_destroy_peer(). That routine will find ibp_connecting with a value of 1 and asserts.

If we launch a new peer connection, the peer's reference counter gets incremented so the connection cleanup does not also try to destroy the peer.

So, the question is this: why does ibp_connecting have a value of 2 when we are holding off connecting due to a race condition? Is this a bug in patch 14600?

Comment by Gerrit Updater [ 31/Oct/15 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17004
Subject: LU-7210 lnet: Change connect peer failed cleanup order
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 048108cac815ae543debbc3a01449fa7223bd8f7

Comment by Mahmoud Hanafi [ 07/Dec/15 ]

We have tried this patch it does help in reducing the occurrence of the LBUG but doesn't eliminate it.

Comment by Gerrit Updater [ 09/Dec/15 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17527
Subject: LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7a9e3e5e0cfb527a9a9d7a096e128bd2c78000d7

Comment by Jay Lan (Inactive) [ 10/Dec/15 ]

Do we take both #17004 and #17527?

Comment by Doug Oucharek (Inactive) [ 10/Dec/15 ]

Yes, take both patches please. Each blocks a window which could trigger this assert.

Comment by Jeremy Filizetti [ 17/Dec/15 ]

I've tried with both of these patches and I am still seeing this LBUG.

If a host gets rejected with a conn race there is an increment of ibp_connecting that it's possible won't be removed. For instance if on the kiblnd_destroy_conn->kiblnd_connect_peer reconnect there is another rejection that is not a conn race (ie: frags/queue depth) then in kiblnd_reconnect:

        /* retry connection if it's still needed and no other connection
         * attempts (active or passive) are in progress
         * NB: reconnect is still needed even when ibp_tx_queue is
         * empty if ibp_version != version because reconnect may be
         * initiated by kiblnd_query() */
        if ((!list_empty(&peer->ibp_tx_queue) ||
             peer->ibp_version != version) &&
            peer->ibp_connecting == 1 &&
            peer->ibp_accepting == 0) {
                if (why == IBLND_REJECT_CONN_RACE) {
                        /* don't reconnect immediately, intensive reconnecting
                         * may consume a lot of memory. kiblnd_destroy_conn
                         * will reconnect after releasing all resources of
                         * this connection */
                        conn->ibc_conn_race = 1;
                } else {
                        retry_now = 1;
                }
                peer->ibp_connecting++;
                peer->ibp_version     = version;
                peer->ibp_incarnation = incarnation;
        }

        write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);

        if (!retry_now)
                return;

There ends up being no connect and so no call is made to kiblnd_connreq_done or kiblnd_peer_connect_failed to decrement ibp_connecting.

Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ]

The fix http://review.whamcloud.com/#/c/14600 was just redesigned last night under patch: http://review.whamcloud.com/#/c/17661. Please apply the new version and try that out. It is very new and needs to be validated (something I will be working on today). If you do try it, please share feedback of your results here or LU-7569.

Comment by James A Simmons [ 17/Dec/15 ]

There are two patches for this ticket. Do I drop both and just use 17661 for testing?

Comment by Jeremy Filizetti [ 17/Dec/15 ]

Just a quick look at the code I can see it doesn't account for my last comment but it does look much better to me from a design point. I'll add a check for my issue and test the code and let you know.

Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ]

I feel that 17527 is also useful. So 17527 and 17661 should both be used.

Comment by James A Simmons [ 17/Dec/15 ]

The patch 17527 conflicts with 17661.

Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ]

In that case, drop 17527..

Comment by Liang Zhen (Inactive) [ 18/Dec/15 ]

I think I have removed the conflict part from 17527 so you can keep both of them?

Comment by Jay Lan (Inactive) [ 18/Dec/15 ]

How about 17004? We currently picked 17004 and 17527. Should I take 17661 on top of both?

Comment by James A Simmons [ 18/Dec/15 ]

Drop 17004. It conflicts with 17661. Also Liang has some reservations about the 17004 patch. You need 17661 and 17527 for your testing.

Comment by Gerrit Updater [ 05/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17527/
Subject: LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2f033b088745bb98eeeed33c58518fddd16397db

Comment by Gerrit Updater [ 08/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17004/
Subject: LU-7210 lnet: Change connect peer failed cleanup order
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 53737bf2055cddabf53140dce08d0e1009aab483

Comment by James A Simmons [ 08/Jan/16 ]

Both patches have landed for this ticket.

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

An even more comprehensive fix for these asserts is being done under LU-7569. I'm going to close this ticket, given the patches associated with it are landed, and let LU-7569 be used for further work.

Comment by Jay Lan (Inactive) [ 08/Jan/16 ]

The patchset #3 and #4 of patch http://review.whamcloud.com/17527/ are the same.
If I do not take #4, then one day I would think my patch is one behind the top of tree.
Ir I am to take #4, I would need to do two extra steps: revert #3 and then cherry-pick #4 to have a clean bookkeeping.

Is the creation of #4 really necessary?

Comment by Jay Lan (Inactive) [ 08/Jan/16 ]

Hmmm. Should we take http://review.whamcloud.com/#/c/17004/ or drop it?
James said to drop it on Dec 18 then that patch was landed today...

Comment by James A Simmons [ 08/Jan/16 ]

Yeah I was surprised that patch 17004 was landed. Liang do you recommend Jay use this patch?

Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ]

The original problem reported in this ticket was caused by http://review.whamcloud.com/#/c/14600/ where it put two flag bits in the same byte. That created a potential race as flipping bits is not atomic operations and there was not a lock which was protecting flag changes. So, ibp_connecting would get set back to 0 when there was a race with a change to another bit (such as ibp_scheduled).

Patch http://review.whamcloud.com/#/c/14600/ has been reverted which will eliminate this assert from happening (so we are closing this ticket). However we still need the fix done by http://review.whamcloud.com/#/c/14600/ and that is being redone under LU-7569.

With regards to 17004: this corrected an inconsistency I found in the code which opened a very small race between the lock being released and reacquired. It is not a fix for this issue, but is still a good cleanup to have in master to rule it out when we investigate future races.

So, in summary, 17004 does not need to be applied to existing installations, but should land to master as a good cleanup. Likewise, 17527 is a good cleanup. The true fix here is to revert 14600 which has been done to master recently.

Comment by Jay Lan (Inactive) [ 09/Jan/16 ]

Thanks for a great summary of the patches, Doug! This helps!

Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ]

Reopening/resolving in order to be able to apply the FixVersion of 2.8.0 to reflect the work that was landed here.

Comment by Jay Lan (Inactive) [ 12/Oct/16 ]

I rebased NASA Ames's nas-2.7.2 branch to b2_7_fe on Aug 20, which picked up 58 patches up to this one:
LU-8019 llite: Restore proper opencache operations

The resulting client codes hung on mounting a lustre fs with errors:

LustreError: 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.

The culprit has been identified to be this commit (change 17527)
LU-7210 o2iblnd: take extra refcount in kiblnd_connreq_done

We have carried that patch since May 4th (nas-2.7.1-2nas) to address our lnet disconnect/reconnect problem. It showed no problem until we rebased to b2_7_fe (9f42af2).

I noticed that b2_7_fe does not include change 17527, On the other hand, b2_8_fe includes two patches from LU-2710: change 17527 and change 17004. So I cherry-picked #17004 to our nas-2.7.2, but the mount problem persisted. If I reverted 17527 instead, the rebased code works fine on mount.

However, I am concerned if the disconnect/reconnect problem we hit before would come back if I back out 17527. Please advise. Thanks!

Comment by Doug Oucharek (Inactive) [ 12/Oct/16 ]

As you can see from b2_8_fe, change 17527 is needed. Can you attach your patched version o2iblnd_cb.c to this ticket? A mismatch of changes may be present causing a problem.

Comment by Jay Lan (Inactive) [ 12/Oct/16 ]

o2iblnd_cb.c_3nasC has been attached.

I compared the o2iblnd_cb.c from 3nasC and from 2nasC, which worked fine, the changes as far as change 17527's concern are the same.

Note that we have been carrying change 17527 for several months without the mount problem until I did the rebase.

Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ]

Looking at o2iblnd_cb.c_3nasC, I can see the problem. Under LU-7099, kiblnd_check_sends() was changed to kiblnd_check_sends_locked(). So, the locking was removed from this routine with the expectation that the callers will be holding the lock.

In o2iblnd_cb.c_3nasC, kiblnd_check_sends_locked() has a remaining call to spin_unlock(&conn->ibc_lock) at the end which should not be there. So, you end up with double-unlock calls which will cause the problems you are seeing.

I suspect this happened due to a collision of patches. I noticed that LU-7099 landed for 2.7 FE but not 2.8 FE (I'll see that this gets addressed).

An important side effect of LU-7099: with the removal of all the "lock-dances" when check_sends is called, we no longer need the extra reference count on the connection introduced in this ticket. So, LU-7099 removes 17527 and that is ok.

Comment by Jay Lan (Inactive) [ 13/Oct/16 ]

Thank you Doug for your investigation. We will have 17527 reverted in our nas-2.7.2.

Generated at Sat Feb 10 02:06:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.