[LU-7210] ASSERTION( peer->ibp_connecting == 0 ) Created: 25/Sep/15 Updated: 13/Oct/16 Resolved: 16/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
As part of LNetError: 6521:0:(o2iblnd.c:377:kiblnd_destroy_peer()) ASSERTION( peer->ibp_connecting == 0 ) failed: |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 25/Sep/15 ] |
|
Hi Amir, |
| Comment by Amir Shehata (Inactive) [ 29/Sep/15 ] |
|
Would it be possible to provide a stack trace for the crash? |
| Comment by Mahmoud Hanafi [ 29/Sep/15 ] |
PID: 7228 TASK: ffff880c333e1540 CPU: 4 COMMAND: "kiblnd_connd"
#0 [ffff880c345f3a68] machine_kexec at ffffffff81038f0b
#1 [ffff880c345f3ac8] crash_kexec at ffffffff810c5552
#2 [ffff880c345f3b98] kdb_kdump_check at ffffffff81291727
#3 [ffff880c345f3ba8] kdb_main_loop at ffffffff81294917
#4 [ffff880c345f3cb8] kdb_save_running at ffffffff8128ea7e
#5 [ffff880c345f3cc8] kdba_main_loop at ffffffff81476a88
#6 [ffff880c345f3d08] kdb at ffffffff81291c16
#7 [ffff880c345f3d78] panic at ffffffff8155948f
#8 [ffff880c345f3df8] lbug_with_loc at ffffffffa05a4eeb [libcfs]
#9 [ffff880c345f3e18] kiblnd_destroy_peer at ffffffffa0a42e43 [ko2iblnd]
#10 [ffff880c345f3e38] kiblnd_destroy_conn at ffffffffa0a4d197 [ko2iblnd]
#11 [ffff880c345f3e68] kiblnd_connd at ffffffffa0a56fd2 [ko2iblnd]
#12 [ffff880c345f3ee8] kthread at ffffffff8109a106
#13 [ffff880c345f3f48] kernel_thread at ffffffff8100c20a
|
| Comment by Amir Shehata (Inactive) [ 29/Sep/15 ] |
|
I don't seem to see the case where the assert could be triggered. Would it be possible to capture net and neterror debug prior to the bug to see how we get into this case. In order to get into this assert, ibp_connecting > 0 and the refcount on the peer has to have gone to 0. I don't see an immediate path in the code where that can happen, but it's happening, so there are assumptions that have been made that are false. More debug output would help narrow down my investigation. |
| Comment by Mahmoud Hanafi [ 29/Sep/15 ] |
|
Our hosts are setup to panic on lbug, so we are not able to dump the logs. If I can get a easy reproducer I'll try to dump the logs. Is there any info from the crash dump that can help you? |
| Comment by Amir Shehata (Inactive) [ 29/Sep/15 ] |
|
I'm trying to understand the sequence of events that lead up to the crash. But if you can provide the System.map, vmlinux and vmcore I can try investigating the crash dump a bit more. In the meantime I'm still working out the different paths in the code that could lead to the system state I'm suspecting: The peer thinks that it is connecting: ibp_connecting is > 0 The connection being cleaned off the zombie list is likely in one of the following states: IBLND_CONN_ACTIVE_CONNECT || IBLND_CONN_PASSIVE_WAIT The peer's refcount is == 1 |
| Comment by Doug Oucharek (Inactive) [ 22/Oct/15 ] |
|
I believe I have found the scenario which will cause this assert: The change that 14600 made to kiblnd_reconnect() is to "retry connection if it's still needed and not other connection attempts are in progress". In the following if statement, it is incrementing ibp_connecting from a value of 1 to 2 regardless of whether we are retrying now or waiting for the failed connection to be cleaned up due to a race condition (objective of the patch). The failed connection ends up on the zombie list for connd to clean up. Remember that ibp_connecting is still a value of 2. In the connd processing, it cleans up the connection by calling kiblnd_destroy_conn() (start of the call chain seen above). That routine was altered by patch 14600 to trigger a new peer connection attempt if the ibp_tx_queue is not empty. Otherwise, it decrements ibp_connecting by one. In both cases, kiblnd_peer_decref() gets called to drop reference to the peer because the connection was cleaned up. Now, if we take the path where ibp_tx_queue is empty, we decrement ibp_connecting making it a value of 1. When kiblnd_peer_decref() is called, it will find the peer refcount at 1 so it believes it needs to call kiblnd_destroy_peer(). That routine will find ibp_connecting with a value of 1 and asserts. If we launch a new peer connection, the peer's reference counter gets incremented so the connection cleanup does not also try to destroy the peer. So, the question is this: why does ibp_connecting have a value of 2 when we are holding off connecting due to a race condition? Is this a bug in patch 14600? |
| Comment by Gerrit Updater [ 31/Oct/15 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/17004 |
| Comment by Mahmoud Hanafi [ 07/Dec/15 ] |
|
We have tried this patch it does help in reducing the occurrence of the LBUG but doesn't eliminate it. |
| Comment by Gerrit Updater [ 09/Dec/15 ] |
|
Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/17527 |
| Comment by Jay Lan (Inactive) [ 10/Dec/15 ] |
|
Do we take both #17004 and #17527? |
| Comment by Doug Oucharek (Inactive) [ 10/Dec/15 ] |
|
Yes, take both patches please. Each blocks a window which could trigger this assert. |
| Comment by Jeremy Filizetti [ 17/Dec/15 ] |
|
I've tried with both of these patches and I am still seeing this LBUG. If a host gets rejected with a conn race there is an increment of ibp_connecting that it's possible won't be removed. For instance if on the kiblnd_destroy_conn->kiblnd_connect_peer reconnect there is another rejection that is not a conn race (ie: frags/queue depth) then in kiblnd_reconnect: /* retry connection if it's still needed and no other connection * attempts (active or passive) are in progress * NB: reconnect is still needed even when ibp_tx_queue is * empty if ibp_version != version because reconnect may be * initiated by kiblnd_query() */ if ((!list_empty(&peer->ibp_tx_queue) || peer->ibp_version != version) && peer->ibp_connecting == 1 && peer->ibp_accepting == 0) { if (why == IBLND_REJECT_CONN_RACE) { /* don't reconnect immediately, intensive reconnecting * may consume a lot of memory. kiblnd_destroy_conn * will reconnect after releasing all resources of * this connection */ conn->ibc_conn_race = 1; } else { retry_now = 1; } peer->ibp_connecting++; peer->ibp_version = version; peer->ibp_incarnation = incarnation; } write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags); if (!retry_now) return; There ends up being no connect and so no call is made to kiblnd_connreq_done or kiblnd_peer_connect_failed to decrement ibp_connecting. |
| Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ] |
|
The fix http://review.whamcloud.com/#/c/14600 was just redesigned last night under patch: http://review.whamcloud.com/#/c/17661. Please apply the new version and try that out. It is very new and needs to be validated (something I will be working on today). If you do try it, please share feedback of your results here or |
| Comment by James A Simmons [ 17/Dec/15 ] |
|
There are two patches for this ticket. Do I drop both and just use 17661 for testing? |
| Comment by Jeremy Filizetti [ 17/Dec/15 ] |
|
Just a quick look at the code I can see it doesn't account for my last comment but it does look much better to me from a design point. I'll add a check for my issue and test the code and let you know. |
| Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ] |
|
I feel that 17527 is also useful. So 17527 and 17661 should both be used. |
| Comment by James A Simmons [ 17/Dec/15 ] |
|
The patch 17527 conflicts with 17661. |
| Comment by Doug Oucharek (Inactive) [ 17/Dec/15 ] |
|
In that case, drop 17527.. |
| Comment by Liang Zhen (Inactive) [ 18/Dec/15 ] |
|
I think I have removed the conflict part from 17527 so you can keep both of them? |
| Comment by Jay Lan (Inactive) [ 18/Dec/15 ] |
|
How about 17004? We currently picked 17004 and 17527. Should I take 17661 on top of both? |
| Comment by James A Simmons [ 18/Dec/15 ] |
|
Drop 17004. It conflicts with 17661. Also Liang has some reservations about the 17004 patch. You need 17661 and 17527 for your testing. |
| Comment by Gerrit Updater [ 05/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17527/ |
| Comment by Gerrit Updater [ 08/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17004/ |
| Comment by James A Simmons [ 08/Jan/16 ] |
|
Both patches have landed for this ticket. |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
An even more comprehensive fix for these asserts is being done under |
| Comment by Jay Lan (Inactive) [ 08/Jan/16 ] |
|
The patchset #3 and #4 of patch http://review.whamcloud.com/17527/ are the same. Is the creation of #4 really necessary? |
| Comment by Jay Lan (Inactive) [ 08/Jan/16 ] |
|
Hmmm. Should we take http://review.whamcloud.com/#/c/17004/ or drop it? |
| Comment by James A Simmons [ 08/Jan/16 ] |
|
Yeah I was surprised that patch 17004 was landed. Liang do you recommend Jay use this patch? |
| Comment by Doug Oucharek (Inactive) [ 08/Jan/16 ] |
|
The original problem reported in this ticket was caused by http://review.whamcloud.com/#/c/14600/ where it put two flag bits in the same byte. That created a potential race as flipping bits is not atomic operations and there was not a lock which was protecting flag changes. So, ibp_connecting would get set back to 0 when there was a race with a change to another bit (such as ibp_scheduled). Patch http://review.whamcloud.com/#/c/14600/ has been reverted which will eliminate this assert from happening (so we are closing this ticket). However we still need the fix done by http://review.whamcloud.com/#/c/14600/ and that is being redone under With regards to 17004: this corrected an inconsistency I found in the code which opened a very small race between the lock being released and reacquired. It is not a fix for this issue, but is still a good cleanup to have in master to rule it out when we investigate future races. So, in summary, 17004 does not need to be applied to existing installations, but should land to master as a good cleanup. Likewise, 17527 is a good cleanup. The true fix here is to revert 14600 which has been done to master recently. |
| Comment by Jay Lan (Inactive) [ 09/Jan/16 ] |
|
Thanks for a great summary of the patches, Doug! This helps! |
| Comment by Joseph Gmitter (Inactive) [ 16/Mar/16 ] |
|
Reopening/resolving in order to be able to apply the FixVersion of 2.8.0 to reflect the work that was landed here. |
| Comment by Jay Lan (Inactive) [ 12/Oct/16 ] |
|
I rebased NASA Ames's nas-2.7.2 branch to b2_7_fe on Aug 20, which picked up 58 patches up to this one: The resulting client codes hung on mounting a lustre fs with errors: LustreError: 15c-8: MGC10.151.26.117@o2ib: The configuration from log 'nbp1-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. The culprit has been identified to be this commit (change 17527) We have carried that patch since May 4th (nas-2.7.1-2nas) to address our lnet disconnect/reconnect problem. It showed no problem until we rebased to b2_7_fe (9f42af2). I noticed that b2_7_fe does not include change 17527, On the other hand, b2_8_fe includes two patches from However, I am concerned if the disconnect/reconnect problem we hit before would come back if I back out 17527. Please advise. Thanks! |
| Comment by Doug Oucharek (Inactive) [ 12/Oct/16 ] |
|
As you can see from b2_8_fe, change 17527 is needed. Can you attach your patched version o2iblnd_cb.c to this ticket? A mismatch of changes may be present causing a problem. |
| Comment by Jay Lan (Inactive) [ 12/Oct/16 ] |
|
o2iblnd_cb.c_3nasC has been attached. I compared the o2iblnd_cb.c from 3nasC and from 2nasC, which worked fine, the changes as far as change 17527's concern are the same. Note that we have been carrying change 17527 for several months without the mount problem until I did the rebase. |
| Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ] |
|
Looking at o2iblnd_cb.c_3nasC, I can see the problem. Under In o2iblnd_cb.c_3nasC, kiblnd_check_sends_locked() has a remaining call to spin_unlock(&conn->ibc_lock) at the end which should not be there. So, you end up with double-unlock calls which will cause the problems you are seeing. I suspect this happened due to a collision of patches. I noticed that An important side effect of |
| Comment by Jay Lan (Inactive) [ 13/Oct/16 ] |
|
Thank you Doug for your investigation. We will have 17527 reverted in our nas-2.7.2. |