[LU-16484] Linux kernel BUG when deleting and adding a peer and using a filesystem Created: 17/Jan/23  Updated: 20/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.9
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dean Luick Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2-node cluster:
Lustre Client / Rocky 8.5 with Lustre 2.12.9
Lustre Server / CentOS 7.9 with Lustre 2.12.6
Transport: OPA


Issue Links:
Related
is related to LU-16349 Excessive number of OPA disconnects /... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Using a small 2-node cluster with 1 server and 1 client.  On the client, I ran the following in loops in two windows:

Window #1:

        lnetctl peer del --prim_nid $PEER
        lnetctl peer add --prim_nid $PEER

Where $PEER is the NID of the server

Window #2:

        touch $BASE_DIR/qq
        rm -f $BASE_DIR/*

Where $BASE_DIR is a directory within a mounted Lustre file system from $PEER.

Running either results in no issues.  Running both at the same time results in a kernel BUG.

 

[ 8763.523887] list_add corruption. prev->next should be next (ffff8e98555a72e0), but was ffff8e98056d2810. (prev=ffff8e98056d2810).
[ 8763.536952] ------------[ cut here ]------------
[ 8763.536953] kernel BUG at lib/list_debug.c:28!
[ 8763.541933] invalid opcode: 0000 [#1] SMP PTI
[ 8763.546809] CPU: 9 PID: 18262 Comm: lnet_discovery Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-348.el8.0.2.x86_64 #1
[ 8763.560434] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020
[ 8763.572117] RIP: 0010:__list_add_valid.cold.0+0x26/0x28
[ 8763.577964] Code: 00 00 00 c3 48 89 d1 48 c7 c7 88 6e 51 91 48 89 c2 e8 a0 da ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 e0 6e 51 91 e8 8c da ca ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 70 6f 51 91 e8 78 da ca ff 0f 0b
[ 8763.598976] RSP: 0018:ffffb7c9658e7d80 EFLAGS: 00010246
[ 8763.604823] RAX: 0000000000000075 RBX: ffff8e98555a72c0 RCX: 0000000000000000
[ 8763.612808] RDX: 0000000000000000 RSI: ffff8ea73fa56818 RDI: ffff8ea73fa56818
[ 8763.620794] RBP: ffff8e98056d2800 R08: 0000000000006475 R09: 0000000000aaaaaa
[ 8763.628779] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e98555a730c
[ 8763.636765] R13: ffff8e98056d2810 R14: ffff8e9931905c00 R15: 0000000000000000
[ 8763.644751] FS:  0000000000000000(0000) GS:ffff8ea73fa40000(0000) knlGS:0000000000000000
[ 8763.653806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8763.660235] CR2: 000056304a85f748 CR3: 0000000f35610005 CR4: 00000000001706e0
[ 8763.668221] Call Trace:
[ 8763.670972]  lnet_select_pathway+0xebd/0x12d0 [lnet]
[ 8763.676538]  lnet_send+0x5d/0x1b0 [lnet]
[ 8763.680936]  lnet_peer_discovery+0x277/0x11f0 [lnet]
[ 8763.686491]  ? __schedule+0x2cc/0x700
[ 8763.690591]  ? finish_wait+0x80/0x80
[ 8763.694600]  ? lnet_peer_merge_data+0xd50/0xd50 [lnet]
[ 8763.700350]  kthread+0x116/0x130
[ 8763.703962]  ? kthread_flush_work_fn+0x10/0x10
[ 8763.708933]  ret_from_fork+0x35/0x40

 

 



 Comments   
Comment by Dean Luick [ 17/Jan/23 ]

The steps in the description were an attempt to recreate a failure described in LU-16349.

Generated at Sat Feb 10 03:27:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.