[LU-13652] [1575337.260035] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) LBUG Created: 09/Jun/20 Updated: 19/Sep/20 Resolved: 19/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
OSS LBUG. First time we have seen this.
[1574769.939126] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1574769.972906] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.151.11.102@o2ib (293): c: 32, oc: 0, rc: 32 [1574968.944839] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [1574968.978608] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous similar messages [1574969.012379] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.151.24.203@o2ib (247): c: 32, oc: 0, rc: 32 [1574969.053585] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar messages [1575256.183968] Lustre: nbp8-OST0103: Connection restored to 10dfb7c6-2481-1ba8-d8c9-5458677b6b29 (at 10.151.31.52@o2ib) [1575256.183973] Lustre: Skipped 15281 previous similar messages [1575337.223394] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: [1575337.260035] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) LBUG [1575337.283229] Pid: 8719, comm: lnet_discovery 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 [1575337.283233] Call Trace: [1575337.283243] [<ffffffffc0cbd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [1575337.305316] [<ffffffffc0cbd87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [1575337.305340] [<ffffffffc0d56a8a>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] [1575337.305351] [<ffffffffc0d570c5>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] [1575337.305361] [<ffffffffc0d5bd20>] lnet_peer_discovery+0x6c0/0x1150 [lnet] [1575337.305365] [<ffffffffb20c61f1>] kthread+0xd1/0xe0 [1575337.305368] [<ffffffffb278dd37>] ret_from_fork_nospec_end+0x0/0x39 [1575337.305389] [<ffffffffffffffff>] 0xffffffffffffffff [1575337.305391] Kernel panic - not syncing: LBUG [1575337.305393] CPU: 11 PID: 8719 Comm: lnet_discovery Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 [1575337.305394] Hardware name: SGI.COM SUMMIT/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013 [1575337.305395] Call Trace: [1575337.305399] [<ffffffffb277ac43>] dump_stack+0x19/0x1b [1575337.305402] [<ffffffffb2774987>] panic+0xe8/0x21f [1575337.305408] [<ffffffffc0cbd8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [1575337.305417] [<ffffffffc0d56a8a>] lnet_destroy_peer_locked+0x24a/0x350 [lnet] [1575337.305425] [<ffffffffc0d570c5>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet] [1575337.305434] [<ffffffffc0d5bd20>] lnet_peer_discovery+0x6c0/0x1150 [lnet] [1575337.305436] [<ffffffffb20c72e0>] ? wake_up_atomic_t+0x30/0x30 [1575337.305444] [<ffffffffc0d5b660>] ? lnet_peer_merge_data+0xde0/0xde0 [lnet] [1575337.305446] [<ffffffffb20c61f1>] kthread+0xd1/0xe0 [1575337.305448] [<ffffffffb20c6120>] ? insert_kthread_work+0x40/0x40 [1575337.305450] [<ffffffffb278dd37>] ret_from_fork_nospec_begin+0x21/0x21 [1575337.305452] [<ffffffffb20c6120>] ? insert_kthread_work+0x40/0x40 |
| Comments |
| Comment by Amir Shehata (Inactive) [ 09/Jun/20 ] |
|
Hi Mahmoud, This looks like it could be a duplicate of this: https://jira.whamcloud.com/browse/LU-9971 Do you have the patches which are outlined on this ticket? |
| Comment by Jay Lan (Inactive) [ 09/Jun/20 ] |
|
Hi Amir,
Thanks, Jay |
| Comment by Peter Jones [ 09/Jun/20 ] |
|
Amir Could you please port these fixes to b2_12 so that they can be considered for 2.12.6 (and NASA can carry the patches in the meantime) Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 10/Jun/20 ] |
|
I uploaded the relevant patches to b2_12. |
| Comment by Jay Lan (Inactive) [ 10/Jun/20 ] |
|
Thank you, Amir. I will cherry-pick the two patches you uploaded once they land to b2_12. |
| Comment by Amir Shehata (Inactive) [ 11/Jun/20 ] |
|
Hey Jay, There was a problem with one of the patches which I just fixed. You'll need three patches: https://review.whamcloud.com/#/c/38890/2 https://review.whamcloud.com/#/c/38891/2 https://review.whamcloud.com/#/c/38892/2
|
| Comment by Peter Jones [ 19/Sep/20 ] |
|
It ended up being two patches in the end, but everything is in b2_12 for 2.12.6 now |