[LU-15510] Soft locks on OSS servers with fail over with MDS. Created: 01/Feb/22  Updated: 11/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: ORNL, ornl
Environment:

OSS server running RHEL7 3.10.0-1160.49.1.el7.x86_64 with ZFS 2.0.7.
Lustre version 2.12.6


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Communication between the MDS and OSS servers failed so recovery started (IR is disabled"). The recovery on the OSS server failed with a lock up:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ll_ost_io01_074:30838]                   

With stack trace:

[95181.855433] CPU: 3 PID: 30807 Comm: ll_ost_io01_070 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.49.1.el7.x86_64 #1
[95181.855433] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
[95181.855434] task: ffff93abd41c6300 ti: ffff93abdc84c000 task.ti: ffff93abdc84c000
[95181.855435] RIP: 0010:[<ffffffff85f17aa2>]  
[95181.855441]  [<ffffffff85f17aa2>] native_queued_spin_lock_slowpath+0x122/0x200
[95181.855441] RSP: 0018:ffff93abdc84fcb8  EFLAGS: 00000246
[95181.855442] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000190000
[95181.855443] RDX: ffff93c4dd69b8c0 RSI: 0000000000290000 RDI: ffff93c367157830
[95181.855443] RBP: ffff93abdc84fcb8 R08: ffff93c4dd65b8c0 R09: 0000000000000000
[95181.855444] R10: ffff93c4dd65f160 R11: fffff40dfe6a9200 R12: ffff93abdc84fc58
[95181.855444] R13: ffff93c3b5c66000 R14: ffff93b8aaa89850 R15: ffffffffc17a7b96
[95181.855445] FS:  0000000000000000(0000) GS:ffff93c4dd640000(0000) knlGS:0000000000000000
[95181.855446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[95181.855447] CR2: 000000c002f51000 CR3: 0000002fd1d42000 CR4: 00000000007607e0
[95181.855448] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[95181.855448] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[95181.855449] PKRU: 00000000
[95181.855449] Call Trace:
[95181.855454]  [<ffffffff8657dcf3>] queued_spin_lock_slowpath+0xb/0xf
[95181.855459]  [<ffffffff8658baa0>] _raw_spin_lock+0x20/0x30
[95181.855518]  [<ffffffffc1463232>] ptlrpc_server_drop_request+0x1c2/0x6d0 [ptlrpc]
[95181.855545]  [<ffffffffc14637d2>] ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc]
[95181.855572]  [<ffffffffc1465a41>] ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
[95181.855597]  [<ffffffffc14626a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[95181.855600]  [<ffffffff85ed3233>] ? __wake_up+0x13/0x20
[95181.855625]  [<ffffffffc14691f4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[95181.855650]  [<ffffffffc14686c0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[95181.855653]  [<ffffffff85ec5e61>] kthread+0xd1/0xe0
[95181.855655]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40
[95181.855657]  [<ffffffff86595ddd>] ret_from_fork_nospec_begin+0x7/0x21
[95181.855659]  [<ffffffff85ec5d90>] ? insert_kthread_work+0x40/0x40



 Comments   
Comment by Peter Jones [ 02/Feb/22 ]

James

Is this something that you are working on or an operational issue? If the latter, are there any logs available? Are any patches applied to the vanilla 2.12.6 release?

Peter

Comment by James A Simmons [ 08/Feb/22 ]

I never seen this bug before so I was hoping you ran into before.

Comment by James A Simmons [ 22/Jun/22 ]

We just hit this bug again.  Currently our CPT looks like

0:    0 2 4  6 8 19 12 141 16 18 20 22

1:   1 3 5 7 9 11 13 15 17 19 21 23

I wonder if doubling the CPT count would lower the lock contention.

Comment by Dustin Leverman [ 11/Jul/22 ]

Peter, 

       To answer your question from above, this is an operational issue that impacted production. We reverted the code change so things are now stable. I know that there were patches applied to 2.12.6, but I'm not sure what they are. James would know the details. 

 

Thanks,

Dustin 

Generated at Sat Feb 10 03:18:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.