[LU-16484] Linux kernel BUG when deleting and adding a peer and using a filesystem - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.9
Labels:
None
Environment:
2-node cluster:
Lustre Client / Rocky 8.5 with Lustre 2.12.9
Lustre Server / CentOS 7.9 with Lustre 2.12.6
Transport: OPA

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Using a small 2-node cluster with 1 server and 1 client. On the client, I ran the following in loops in two windows:

Window #1:

lnetctl peer del --prim_nid $PEER
lnetctl peer add --prim_nid $PEER

Where $PEER is the NID of the server

Window #2:

touch $BASE_DIR/qq
rm -f $BASE_DIR/*

Where $BASE_DIR is a directory within a mounted Lustre file system from $PEER.

Running either results in no issues. Running both at the same time results in a kernel BUG.

[ 8763.523887] list_add corruption. prev->next should be next (ffff8e98555a72e0), but was ffff8e98056d2810. (prev=ffff8e98056d2810).
[ 8763.536952] ------------[ cut here ]------------
[ 8763.536953] kernel BUG at lib/list_debug.c:28!
[ 8763.541933] invalid opcode: 0000 [#1] SMP PTI
[ 8763.546809] CPU: 9 PID: 18262 Comm: lnet_discovery Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-348.el8.0.2.x86_64 #1
[ 8763.560434] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020
[ 8763.572117] RIP: 0010:__list_add_valid.cold.0+0x26/0x28
[ 8763.577964] Code: 00 00 00 c3 48 89 d1 48 c7 c7 88 6e 51 91 48 89 c2 e8 a0 da ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 e0 6e 51 91 e8 8c da ca ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 70 6f 51 91 e8 78 da ca ff 0f 0b
[ 8763.598976] RSP: 0018:ffffb7c9658e7d80 EFLAGS: 00010246
[ 8763.604823] RAX: 0000000000000075 RBX: ffff8e98555a72c0 RCX: 0000000000000000
[ 8763.612808] RDX: 0000000000000000 RSI: ffff8ea73fa56818 RDI: ffff8ea73fa56818
[ 8763.620794] RBP: ffff8e98056d2800 R08: 0000000000006475 R09: 0000000000aaaaaa
[ 8763.628779] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e98555a730c
[ 8763.636765] R13: ffff8e98056d2810 R14: ffff8e9931905c00 R15: 0000000000000000
[ 8763.644751] FS:  0000000000000000(0000) GS:ffff8ea73fa40000(0000) knlGS:0000000000000000
[ 8763.653806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8763.660235] CR2: 000056304a85f748 CR3: 0000000f35610005 CR4: 00000000001706e0
[ 8763.668221] Call Trace:
[ 8763.670972]  lnet_select_pathway+0xebd/0x12d0 [lnet]
[ 8763.676538]  lnet_send+0x5d/0x1b0 [lnet]
[ 8763.680936]  lnet_peer_discovery+0x277/0x11f0 [lnet]
[ 8763.686491]  ? __schedule+0x2cc/0x700
[ 8763.690591]  ? finish_wait+0x80/0x80
[ 8763.694600]  ? lnet_peer_merge_data+0xd50/0xd50 [lnet]
[ 8763.700350]  kthread+0x116/0x130
[ 8763.703962]  ? kthread_flush_work_fn+0x10/0x10
[ 8763.708933]  ret_from_fork+0x35/0x40

Attachments

Issue Links

is related to

LU-16349 Excessive number of OPA disconnects / LNET network errors in cluster

Resolved

Activity

People

Assignee:: WC Triage

Reporter:: Dean Luick

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jan/23 9:03 PM

Updated:: 20/Jan/23 5:41 PM