[LU-15510] Soft locks on OSS servers with fail over with MDS. Created: 01/Feb/22 Updated: 11/Jul/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ORNL, ornl | ||
| Environment: |
OSS server running RHEL7 3.10.0-1160.49.1.el7.x86_64 with ZFS 2.0.7. |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Communication between the MDS and OSS servers failed so recovery started (IR is disabled"). The recovery on the OSS server failed with a lock up: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ll_ost_io01_074:30838] With stack trace: [95181.855433] CPU: 3 PID: 30807 Comm: ll_ost_io01_070 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.49.1.el7.x86_64 #1 |
| Comments |
| Comment by Peter Jones [ 02/Feb/22 ] |
|
James Is this something that you are working on or an operational issue? If the latter, are there any logs available? Are any patches applied to the vanilla 2.12.6 release? Peter |
| Comment by James A Simmons [ 08/Feb/22 ] |
|
I never seen this bug before so I was hoping you ran into before. |
| Comment by James A Simmons [ 22/Jun/22 ] |
|
We just hit this bug again. Currently our CPT looks like 0: 0 2 4 6 8 19 12 141 16 18 20 22 1: 1 3 5 7 9 11 13 15 17 19 21 23 I wonder if doubling the CPT count would lower the lock contention. |
| Comment by Dustin Leverman [ 11/Jul/22 ] |
|
Peter, To answer your question from above, this is an operational issue that impacted production. We reverted the code change so things are now stable. I know that there were patches applied to 2.12.6, but I'm not sure what they are. James would know the details.
Thanks, Dustin |