Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.10.3
-
None
-
CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.
Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in LU-10680.
In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.
Timeframe overview:
Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
Feb 24 02:03:56 manual crash dump taken of oak-io2-s1
Attaching the following files:
- kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
- vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
- crash foreach bt: oak_io2-s1_foreach_bt.xt
- kernel memory usage: oak-io2-s1_kmem.txt
- vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):
https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
(debuginfo files are available in comment-221257).
We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.
Thanks much!
Stephane
Attachments
Issue Links
- is related to
-
LU-10697 MDT locking issues after failing over OSTs from hung OSS
- Open