[LU-13652] [1575337.260035] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) LBUG Created: 09/Jun/20  Updated: 19/Sep/20  Resolved: 19/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9971 MR: ABA problem in lnet_discover_peer... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

OSS LBUG. First time we have seen this.

 

 [1574769.939126] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[1574769.972906] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.151.11.102@o2ib (293): c: 32, oc: 0, rc: 32
[1574968.944839] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[1574968.978608] LNetError: 7420:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Skipped 3 previous similar messages
[1574969.012379] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 10.151.24.203@o2ib (247): c: 32, oc: 0, rc: 32
[1574969.053585] LNetError: 7420:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Skipped 3 previous similar messages
[1575256.183968] Lustre: nbp8-OST0103: Connection restored to 10dfb7c6-2481-1ba8-d8c9-5458677b6b29 (at 10.151.31.52@o2ib)
[1575256.183973] Lustre: Skipped 15281 previous similar messages
[1575337.223394] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 
[1575337.260035] LNetError: 8719:0:(peer.c:280:lnet_destroy_peer_locked()) LBUG
[1575337.283229] Pid: 8719, comm: lnet_discovery 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
[1575337.283233] Call Trace:
[1575337.283243]  [<ffffffffc0cbd7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[1575337.305316]  [<ffffffffc0cbd87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[1575337.305340]  [<ffffffffc0d56a8a>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
[1575337.305351]  [<ffffffffc0d570c5>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
[1575337.305361]  [<ffffffffc0d5bd20>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
[1575337.305365]  [<ffffffffb20c61f1>] kthread+0xd1/0xe0
[1575337.305368]  [<ffffffffb278dd37>] ret_from_fork_nospec_end+0x0/0x39
[1575337.305389]  [<ffffffffffffffff>] 0xffffffffffffffff
[1575337.305391] Kernel panic - not syncing: LBUG
[1575337.305393] CPU: 11 PID: 8719 Comm: lnet_discovery Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.12.1.el7_lustre2124.x86_64 #1
[1575337.305394] Hardware name: SGI.COM SUMMIT/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
[1575337.305395] Call Trace:
[1575337.305399]  [<ffffffffb277ac43>] dump_stack+0x19/0x1b
[1575337.305402]  [<ffffffffb2774987>] panic+0xe8/0x21f
[1575337.305408]  [<ffffffffc0cbd8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[1575337.305417]  [<ffffffffc0d56a8a>] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
[1575337.305425]  [<ffffffffc0d570c5>] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
[1575337.305434]  [<ffffffffc0d5bd20>] lnet_peer_discovery+0x6c0/0x1150 [lnet]
[1575337.305436]  [<ffffffffb20c72e0>] ? wake_up_atomic_t+0x30/0x30
[1575337.305444]  [<ffffffffc0d5b660>] ? lnet_peer_merge_data+0xde0/0xde0 [lnet]
[1575337.305446]  [<ffffffffb20c61f1>] kthread+0xd1/0xe0
[1575337.305448]  [<ffffffffb20c6120>] ? insert_kthread_work+0x40/0x40
[1575337.305450]  [<ffffffffb278dd37>] ret_from_fork_nospec_begin+0x21/0x21
[1575337.305452]  [<ffffffffb20c6120>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Amir Shehata (Inactive) [ 09/Jun/20 ]

Hi Mahmoud,

This looks like it could be a duplicate of this: https://jira.whamcloud.com/browse/LU-9971

Do you have the patches which are outlined on this ticket?

Comment by Jay Lan (Inactive) [ 09/Jun/20 ]

Hi Amir,

LU-9971 commit did not land to b2_12.

Thanks, Jay

Comment by Peter Jones [ 09/Jun/20 ]

Amir

Could you please port these fixes to b2_12 so that they can be considered for 2.12.6 (and NASA can carry the patches in the meantime)

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 10/Jun/20 ]

I uploaded the relevant patches to b2_12.

Comment by Jay Lan (Inactive) [ 10/Jun/20 ]

Thank you, Amir. I will cherry-pick the two patches you uploaded once they land to b2_12.

Comment by Amir Shehata (Inactive) [ 11/Jun/20 ]

Hey Jay, There was a problem with one of the patches which I just fixed.

You'll need three patches:

https://review.whamcloud.com/#/c/38890/2

https://review.whamcloud.com/#/c/38891/2

https://review.whamcloud.com/#/c/38892/2

 

Comment by Peter Jones [ 19/Sep/20 ]

It ended up being two patches in the end, but everything is in b2_12 for 2.12.6 now

Generated at Sat Feb 10 03:03:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.