[LU-1021] MDS dumps call traces after run find and du commands Created: 22/Jan/12  Updated: 31/Aug/12  Resolved: 31/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Johann Lombardi (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File messages     File messages.1     File messages.2    
Severity: 3
Rank (Obsolete): 6481

 Description   

When the user ran "find" or "du -sh" command on the client, MDS dumps the many call traces and the error messages and the can't response to the clients. (MDS didn't hang and down, but the response was really slow).
I'm attaching the /var/log/messages and we can see many call traces on these log files. I'm not sure, all are caused by same problem, but at least the user ran several "find and "du -sh" commands on Jan 17 and Jan 18.



 Comments   
Comment by Oleg Drokin [ 22/Jan/12 ]

These all look like MDS getting stuck and spending significant time in an attempt to allocate some memory and then getting stuck on resource semaphore in ldlm pool cleaning thread.

I think there was entire class of patches accepted that would prevent lustre allocations from diving into this code.

Would be good to get Johann take a look as he must remember that work.

Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ]

I remember such changes on the client side, but not on the MDS side.

That said, there are many messages indicating that the threads got finally unstuck:

Jan 15 20:47:37 ALPL505 kernel: Lustre: Service thread pid 14282 completed after 1630.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jan 15 22:00:44 ALPL505 kernel: Lustre: Service thread pid 15016 completed after 343.70s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jan 15 23:50:51 ALPL505 kernel: Lustre: Service thread pid 14956 completed after 746.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Jan 15 23:50:51 ALPL505 kernel: Lustre: Service thread pid 14308 completed after 746.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
...

So it seems that the system isn't really deadlocked, "just" awefully slow. Can you reproduce this issue easily?

Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ]

BTW, do you apply any patches on top of 1.8.6-wc1?

Comment by Kit Westneat (Inactive) [ 31/Aug/12 ]

This looks like a dupe of LU-1535, can someone confirm?

Comment by Peter Jones [ 31/Aug/12 ]

Lai could you please confirm whether this is a duplicate of LU-1535?

Comment by Lai Siyao [ 31/Aug/12 ]

Yes, it is.

Comment by Peter Jones [ 31/Aug/12 ]

Thanks Lai!

Generated at Sat Feb 10 01:12:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.