Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10144

sanityn test_77c: OST hang

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a76f9770-b57e-11e7-9d39-52540065bddc.

      The sub-test test_77c failed with the following error:

      Timeout occurred after 271 mins, last suite running was sanityn, restarting cluster to continue tests
      

      The following seen in console log from OST:

      [12776.179843] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ll_ost_io00_025:31888]
      [12776.180678] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver[12776.182841] NMI watchdog: BUG: soft lo
      

      This looks like the beginning of a report of a kernel oops, but the bulk of the report is lost or thrown away. Since switching the mode of collection for console logs from per-test to per-session this happens a lot.
      I think it's an artifact of keeping and using the same console log on a node both before and after a failure. Having incomplete or overwritten failure information in console logs makes it nearly impossible to see what really happened.

      This fail in test 77c has only been seen for a few days. Seems likely something bad has landed in master very recently to cause it.

      Info required for matching: sanityn 77c

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: