Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
- llnl
Environment:
CentOS 7.6, kernel 3.10.0-957.1.3.el7_lustre.x86_64, MOFED 4.5

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This just occurred on a Lustre 2.12.0 client, running Robinhood, many cores with 2 x AMD EPYC 7401 so 96 cpu threads total.

[1488057.711176] CPU: 63 PID: 54246 Comm: ptlrpcd_07_06 Kdump: loaded Tainted: G           OEL ------------   3.10.0-957.1.3.el7_lustre.x86_64 #1
[1488057.711177] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
[1488057.711178] task: ffff8abafc522080 ti: ffff8abaf4f54000 task.ti: ffff8abaf4f54000
[1488057.711182] RIP: 0010:[<ffffffff897121e6>]  [<ffffffff897121e6>] native_queued_spin_lock_slowpath+0x126/0x200
[1488057.711183] RSP: 0018:ffff8abaf4f57b78  EFLAGS: 00000246
[1488057.711183] RAX: 0000000000000000 RBX: ffffffffc0ff5a97 RCX: 0000000001f90000
[1488057.711184] RDX: ffff8a9affa9b780 RSI: 0000000000910000 RDI: ffff8adafa7b1b00
[1488057.711185] RBP: ffff8abaf4f57b78 R08: ffff8afb3f9db780 R09: 0000000000000000
[1488057.711185] R10: 0000000000000000 R11: 000000000000000f R12: ffff8aed569e2a00
[1488057.711186] R13: 0005c4aa17c51860 R14: ffff8aed24f64b00 R15: 0000000000000007
[1488057.711187] FS:  00007ea1d389b700(0000) GS:ffff8afb3f9c0000(0000) knlGS:0000000000000000
[1488057.711188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1488057.711189] CR2: 00007ecf3e8c7ff6 CR3: 000000276ae10000 CR4: 00000000003407e0
[1488057.711190] Call Trace:
[1488057.711192]  [<ffffffff89d5bfcb>] queued_spin_lock_slowpath+0xb/0xf
[1488057.711193]  [<ffffffff89d6a480>] _raw_spin_lock+0x20/0x30
[1488057.711201]  [<ffffffffc0c23418>] cfs_percpt_lock+0x58/0x110 [libcfs]
[1488057.711211]  [<ffffffffc0c889d8>] LNetMDUnlink+0x78/0x180 [lnet]
[1488057.711250]  [<ffffffffc0f276bf>] ptlrpc_unregister_reply+0xbf/0x790 [ptlrpc]
[1488057.711287]  [<ffffffffc0f2c35e>] ptlrpc_expire_one_request+0xee/0x520 [ptlrpc]
[1488057.711324]  [<ffffffffc0f2c83f>] ptlrpc_expired_set+0xaf/0x1a0 [ptlrpc]
[1488057.711362]  [<ffffffffc0f5cc5c>] ptlrpcd+0x28c/0x550 [ptlrpc]
[1488057.711364]  [<ffffffff896d67b0>] ? wake_up_state+0x20/0x20
[1488057.711402]  [<ffffffffc0f5c9d0>] ? ptlrpcd_check+0x590/0x590 [ptlrpc]
[1488057.711404]  [<ffffffff896c1c31>] kthread+0xd1/0xe0
[1488057.711406]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40
[1488057.711408]  [<ffffffff89d74c24>] ret_from_fork_nospec_begin+0xe/0x21
[1488057.711410]  [<ffffffff896c1b60>] ? insert_kthread_work+0x40/0x40

A lot of CPU were stucks in LNetMDUnlink. Server crashed with hard lockup at the end. vmcore available on demand. Attached vmcore-dmesg.txt

I kept the default lru_size/lru_max_age 0/3900000 values on this server, so I'll try to reduce them as follow:

lru_size=100
lru_max_age=1200
like on our Lustre 2.10 robinhood server for Oak.

Any other recommendation welcomed.

Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-rbh01-vmcore-dmesg.txt
1.00 MB
28/Jan/19 11:36 PM

Issue Links

is related to

LU-12194 clients getting soft lockups on 2.10.7

Open

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Jan/19 11:36 PM

Updated:: 11/Jul/22 6:51 PM