[LU-10144] sanityn test_77c: OST hang Created: 20/Oct/17  Updated: 13/May/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a76f9770-b57e-11e7-9d39-52540065bddc.

The sub-test test_77c failed with the following error:

Timeout occurred after 271 mins, last suite running was sanityn, restarting cluster to continue tests

The following seen in console log from OST:

[12776.179843] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ll_ost_io00_025:31888]
[12776.180678] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver[12776.182841] NMI watchdog: BUG: soft lo

This looks like the beginning of a report of a kernel oops, but the bulk of the report is lost or thrown away. Since switching the mode of collection for console logs from per-test to per-session this happens a lot.
I think it's an artifact of keeping and using the same console log on a node both before and after a failure. Having incomplete or overwritten failure information in console logs makes it nearly impossible to see what really happened.

This fail in test 77c has only been seen for a few days. Seems likely something bad has landed in master very recently to cause it.

Info required for matching: sanityn 77c



 Comments   
Comment by Bob Glossman (Inactive) [ 13/May/18 ]

another on b2_10:
https://testing.hpdd.intel.com/test_sets/129c1e6e-5643-11e8-b303-52540065bddc

Generated at Sat Feb 10 02:32:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.