Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10709

OSS deadlock in 2.10.3

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.3
    • None
    • CentOS 7.4 kernel 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We got another OSS deadlock last night on Oak. Likely to be a regression of 2.10.3.

      Since the upgrade to 2.10.3, these servers haven't been stable for more than 48h in general. This issue might be related to the OSS situation described in LU-10697. For latest MDS instabilities, sounds like it will be fixed in LU-10680.

      In this case, OSS deadlock of oak-io2-s1, OSTs from its partner (oak-io2-s2) were already migrated to it (oak-io2-s1) due to a previous deadlock/issue, so 48 OSTs were mounted.

      Timeframe overview:
      Feb 21 11:28:49: OSTs from oak-io2-s2 migrated to oak-io2-s1
      Feb 23 19:05:04: first stack trace of stuck thread (oak-io2-s1 kernel: Pid: 17265, comm: ll_ost00_032)
      Feb 23 22:59: monitoring reports that ssh to oak-io2-s1 doesn't work anymore
      Feb 23 23:01:51 oak-io2-s1 kernel: INFO: task kswapd0:264 blocked for more than 120 seconds.
      Feb 24 02:03:56 manual crash dump taken of oak-io2-s1

      Attaching the following files:

      • kernel logs in oak-io2-s1_kernel.log (where you can find most of the details in the timeframe above)
      • vmcore-dmesg: oak-io2-s1_vmcore-dmesg.txt
      • crash foreach bt: oak_io2-s1_foreach_bt.xt
      • kernel memory usage: oak-io2-s1_kmem.txt
      • vmcore (oak-io2-s1-vmcore-2018-02-24-02_03_56.gz):

      https://stanford.box.com/s/n8ft8quvr6ubuvd12ukdsoarmrz4uixr
      (debuginfo files are available in comment-221257).

      We decided to downgrade all servers to 2.10.2 on this system because this has had a significant impact on production lately.

      Thanks much!

      Stephane

       

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: