[LU-8548] MDS crash during DNE2 testing with Lustre 2.9 Created: 26/Aug/16  Updated: 26/Aug/16  Resolved: 26/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Power8 clients running 2.8.56 and back end servers running the same. This OOM happened while running mdtest with the directory striped across 16 MDTs.


Attachments: Text File kern.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I attached the kern log that captured the OOM that happened while running mdtest with the directory striped across 16 MDTs.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 26/Aug/16 ]

Hi Lai,

Could you please help to investigate this issue?

Thanks.
Joe

Comment by Lai Siyao [ 26/Aug/16 ]

The backtrace shows it's a soft lockup in mdt_lock_root_xattr()->mdt_remote_object_lock(), which is introduced in LU-7660, however this root XATTR lock is not held (i.e., decref-ed right after taken), I don't see why it will cause soft lockup. Could you get backtraces of the MDS where the soft lockup occurs? and are all MDT locked up?

BTW, do you think it's happens with Power8 clients only?

Comment by James A Simmons [ 26/Aug/16 ]

I updated the server side up to the latest master since it was over a month old and when I attempted to reproduce this problem it went away. So it looks like a version mismatch in version of pre-2.9 caused this issue.

Generated at Sat Feb 10 02:18:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.