Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.12.0
-
CentOS 7.6, kernel 3.10.0-957.1.3.el7_lustre.x86_64, MOFED 4.5
-
3
-
9223372036854775807
Description
This just occurred on a Lustre 2.12.0 client, running Robinhood, many cores with 2 x AMD EPYC 7401 so 96 cpu threads total.
[1488057.711176] CPU: 63 PID: 54246 Comm: ptlrpcd_07_06 Kdump: loaded Tainted: G OEL ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [1488057.711177] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018 [1488057.711178] task: ffff8abafc522080 ti: ffff8abaf4f54000 task.ti: ffff8abaf4f54000 [1488057.711182] RIP: 0010:[<ffffffff897121e6>] [<ffffffff897121e6>] native_queued_spin_lock_slowpath+0x126/0x200 [1488057.711183] RSP: 0018:ffff8abaf4f57b78 EFLAGS: 00000246 [1488057.711183] RAX: 0000000000000000 RBX: ffffffffc0ff5a97 RCX: 0000000001f90000 [1488057.711184] RDX: ffff8a9affa9b780 RSI: 0000000000910000 RDI: ffff8adafa7b1b00 [1488057.711185] RBP: ffff8abaf4f57b78 R08: ffff8afb3f9db780 R09: 0000000000000000 [1488057.711185] R10: 0000000000000000 R11: 000000000000000f R12: ffff8aed569e2a00 [1488057.711186] R13: 0005c4aa17c51860 R14: ffff8aed24f64b00 R15: 0000000000000007 [1488057.711187] FS: 00007ea1d389b700(0000) GS:ffff8afb3f9c0000(0000) knlGS:0000000000000000 [1488057.711188] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1488057.711189] CR2: 00007ecf3e8c7ff6 CR3: 000000276ae10000 CR4: 00000000003407e0 [1488057.711190] Call Trace: [1488057.711192] [<ffffffff89d5bfcb>] queued_spin_lock_slowpath+0xb/0xf [1488057.711193] [<ffffffff89d6a480>] _raw_spin_lock+0x20/0x30 [1488057.711201] [<ffffffffc0c23418>] cfs_percpt_lock+0x58/0x110 [libcfs] [1488057.711211] [<ffffffffc0c889d8>] LNetMDUnlink+0x78/0x180 [lnet] [1488057.711250] [<ffffffffc0f276bf>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc] [1488057.711287] [<ffffffffc0f2c35e>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc] [1488057.711324] [<ffffffffc0f2c83f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc] [1488057.711362] [<ffffffffc0f5cc5c>] ptlrpcd+0x28c/0x550 [ptlrpc] [1488057.711364] [<ffffffff896d67b0>] ? wake_up_state+0x20/0x20 [1488057.711402] [<ffffffffc0f5c9d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc] [1488057.711404] [<ffffffff896c1c31>] kthread+0xd1/0xe0 [1488057.711406] [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40 [1488057.711408] [<ffffffff89d74c24>] ret_from_fork_nospec_begin+0xe/0x21 [1488057.711410] [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
A lot of CPU were stucks in LNetMDUnlink. Server crashed with hard lockup at the end. vmcore available on demand. Attached vmcore-dmesg.txt
I kept the default lru_size/lru_max_age 0/3900000 values on this server, so I'll try to reduce them as follow:
- lru_size=100
- lru_max_age=1200
like on our Lustre 2.10 robinhood server for Oak.
Any other recommendation welcomed.
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-12194 clients getting soft lockups on 2.10.7
-
- Open
-