Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.9
-
None
-
2-node cluster:
Lustre Client / Rocky 8.5 with Lustre 2.12.9
Lustre Server / CentOS 7.9 with Lustre 2.12.6
Transport: OPA
-
3
-
9223372036854775807
Description
Using a small 2-node cluster with 1 server and 1 client. On the client, I ran the following in loops in two windows:
Window #1:
lnetctl peer del --prim_nid $PEER
lnetctl peer add --prim_nid $PEER
Where $PEER is the NID of the server
Window #2:
touch $BASE_DIR/qq
rm -f $BASE_DIR/*
Where $BASE_DIR is a directory within a mounted Lustre file system from $PEER.
Running either results in no issues. Running both at the same time results in a kernel BUG.
[ 8763.523887] list_add corruption. prev->next should be next (ffff8e98555a72e0), but was ffff8e98056d2810. (prev=ffff8e98056d2810). [ 8763.536952] ------------[ cut here ]------------ [ 8763.536953] kernel BUG at lib/list_debug.c:28! [ 8763.541933] invalid opcode: 0000 [#1] SMP PTI [ 8763.546809] CPU: 9 PID: 18262 Comm: lnet_discovery Kdump: loaded Tainted: G OE --------- - - 4.18.0-348.el8.0.2.x86_64 #1 [ 8763.560434] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020 [ 8763.572117] RIP: 0010:__list_add_valid.cold.0+0x26/0x28 [ 8763.577964] Code: 00 00 00 c3 48 89 d1 48 c7 c7 88 6e 51 91 48 89 c2 e8 a0 da ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 e0 6e 51 91 e8 8c da ca ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 70 6f 51 91 e8 78 da ca ff 0f 0b [ 8763.598976] RSP: 0018:ffffb7c9658e7d80 EFLAGS: 00010246 [ 8763.604823] RAX: 0000000000000075 RBX: ffff8e98555a72c0 RCX: 0000000000000000 [ 8763.612808] RDX: 0000000000000000 RSI: ffff8ea73fa56818 RDI: ffff8ea73fa56818 [ 8763.620794] RBP: ffff8e98056d2800 R08: 0000000000006475 R09: 0000000000aaaaaa [ 8763.628779] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e98555a730c [ 8763.636765] R13: ffff8e98056d2810 R14: ffff8e9931905c00 R15: 0000000000000000 [ 8763.644751] FS: 0000000000000000(0000) GS:ffff8ea73fa40000(0000) knlGS:0000000000000000 [ 8763.653806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8763.660235] CR2: 000056304a85f748 CR3: 0000000f35610005 CR4: 00000000001706e0 [ 8763.668221] Call Trace: [ 8763.670972] lnet_select_pathway+0xebd/0x12d0 [lnet] [ 8763.676538] lnet_send+0x5d/0x1b0 [lnet] [ 8763.680936] lnet_peer_discovery+0x277/0x11f0 [lnet] [ 8763.686491] ? __schedule+0x2cc/0x700 [ 8763.690591] ? finish_wait+0x80/0x80 [ 8763.694600] ? lnet_peer_merge_data+0xd50/0xd50 [lnet] [ 8763.700350] kthread+0x116/0x130 [ 8763.703962] ? kthread_flush_work_fn+0x10/0x10 [ 8763.708933] ret_from_fork+0x35/0x40
Attachments
Issue Links
- is related to
-
LU-16349 Excessive number of OPA disconnects / LNET network errors in cluster
- Resolved