[LU-7393] OSS hung with high load and blocked ll_{*} threads Created: 05/Nov/15  Updated: 24/Jan/17  Resolved: 24/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

lola
build: build: 2.7.62-28-g0754bc8, 0754bc8f2623bea184111af216f7567608db35b6; soakbuild '20151104.1'


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error occurred during soak testing of build '20151104.1' on cluster lola (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151104.1). MDTs are fromated with ldiskfs and OSTs with zfs as storage backend. DNE is enabled. MDSes are configured in HA failover configuration.
OSS nodes are neither restarted nor failed over.

Symptom:

  • OSS node (lola-3) shows high load to large number of blocked processes. No iowait or high disk load + long queue and wait times can seen
  • List of blocked process can be seen from 'w' and 't' sysrq-trigger iniiated at Nov 5 08:19:12 PST 2015, and 08:23:3 PST 2015 respectively (see attached messages file)
  • Problems most likely started at Nov 4, 18:50
    see messages file and debug log file (lustre-log.1446691819.85273.bz2) attached
  • 220 additional debug log files have been written which could be provided on demand


 Comments   
Comment by Joseph Gmitter (Inactive) [ 05/Nov/15 ]

Frank,
Are the debug logs available?
Thanks.
Joe

Comment by Frank Heckes (Inactive) [ 23/Nov/15 ]

Joe, by accident I didn't attached the files mentioned in the description. I was convinced I did. After checking possible locations on soak nodes and my laptop I'm sure they're gone. I'm very sorry.

Comment by Cliff White (Inactive) [ 24/Jan/17 ]

Issue was not reproduced

Generated at Sat Feb 10 02:08:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.