Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.

Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in ~~LU-10680~~.

In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.

Timeframe overview:
Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
Feb 24 02:03:56 manual crash dump taken of oak-io2-s1

Attaching the following files:

kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
crash foreach bt: oak_io2-s1_foreach_bt.xt
kernel memory usage: oak-io2-s1_kmem.txt
vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):

https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
(debuginfo files are available in comment-221257).

We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.

Thanks much!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt_oak-io4-s1.log
2.33 MB
12/Nov/19 11:41 PM
foreach_bt-oak-io1-s1-2019-09-01-21-43-46.log
2.11 MB
04/Sep/19 4:00 AM
oak_io2-s1_foreach_bt.xt
2.40 MB
24/Feb/18 6:35 PM
oak-io1-s2-2018-03-05-09-33-02_vmcore-dmesg.txt
760 kB
05/Mar/18 7:04 PM
oak-io2-s1_kernel.log
498 kB
24/Feb/18 6:35 PM
oak-io2-s1_kmem.txt
0.8 kB
24/Feb/18 6:35 PM
oak-io2-s1_vmcore-dmesg.txt
987 kB
24/Feb/18 6:35 PM
sysfs_alloc_inode_GFP_NOFS.patch
3 kB
14/Mar/18 9:59 AM

Issue Links

is related to

LU-10697 MDT locking issues after failing over OSTs from hung OSS

Open

Activity

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Stephane Thiell

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Feb/18 6:37 PM

Updated:: 12/Nov/19 11:47 PM