[LU-5333] rm cause MDS to complain hung tasks and disconnecting clients Created: 11/Jul/14  Updated: 22/Jun/16  Resolved: 22/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Haisong Cai (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: sdsc
Environment:

Linux puma-mds-10-6.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux


Attachments: File dmesg.3369     HTML File dmesg_log     File lustre-log.tgz    
Issue Links:
Related
is related to LU-5726 MDS buffer not freed when deleting files Resolved
is related to LU-5503 MDS (2.4.2) are getting "Service thre... Resolved
Severity: 2
Rank (Obsolete): 14881

 Description   

A client was running "rm" to remove a couple of million files when MDS system load shot to 30 and kernel trace dumping complaining hung tasks - See attached output from "dmesg"

I would think this is normal workload for a duo-westmere CPU / 24GB RAM bonded myricom 10Gbps system.

We have been seeing happening more frequently in 2.4.3 than when we were at 1.8.7.

Anything suggestion?

thanks,
Haisong



 Comments   
Comment by Peter Jones [ 12/Jul/14 ]

Niu

Could you please advise on this issue?

Thanks

Peter

Comment by Oleg Drokin [ 14/Jul/14 ]

From the traces it looks like it's a combination of OOM and journal deadlock of some sort.

Comment by Haisong Cai (Inactive) [ 15/Jul/14 ]

Hi Oleg,

Later that day, the MDS came to a point where local commands were hanging.
We rebooted the server and ran e2fsck. That fixed a bunch quota entries and couple of inodes.

It has been stable so far.

Haisong

Comment by Haisong Cai (Inactive) [ 20/Oct/14 ]

We have another case where removing some several million files from filesystem caused MDS to dump stack traces and gradually hang. I will attache some stack traces and dmesg following this message.

Comment by Haisong Cai (Inactive) [ 20/Oct/14 ]

Correction: server is running 2.4.2 not 2.4.3

Comment by Niu Yawei (Inactive) [ 27/Oct/14 ]

I think this could be related with LU-5726, and LU-5503 looks another instance of such problem.

Comment by Haisong Cai (Inactive) [ 28/Oct/14 ]

Hi Yawei,

LU-5726 indicates the issue is fixed in 2.7.0.
Could you comment on whether the fix can be back-ported into earlier versions, specifically 2.5.*?

thanks,
Haisong

Comment by Peter Jones [ 28/Oct/14 ]

Haisong

To be clear LU-5726 is targeted to be fixed in the 2.7 release but is not fixed yet. Your interest in this issue will raise the priority on this work and Niu will look at the possibilities/options to backport a fix to 2.5.x as part of this effort.

Regards

Peter

Comment by Niu Yawei (Inactive) [ 06/Feb/15 ]

b2_5 port: http://review.whamcloud.com/#/c/13464/

Generated at Sat Feb 10 01:50:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.