[LU-1021] MDS dumps call traces after run find and du commands Created: 22/Jan/12 Updated: 31/Aug/12 Resolved: 31/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Johann Lombardi (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6481 |
| Description |
|
When the user ran "find" or "du -sh" command on the client, MDS dumps the many call traces and the error messages and the can't response to the clients. (MDS didn't hang and down, but the response was really slow). |
| Comments |
| Comment by Oleg Drokin [ 22/Jan/12 ] |
|
These all look like MDS getting stuck and spending significant time in an attempt to allocate some memory and then getting stuck on resource semaphore in ldlm pool cleaning thread. I think there was entire class of patches accepted that would prevent lustre allocations from diving into this code. Would be good to get Johann take a look as he must remember that work. |
| Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ] |
|
I remember such changes on the client side, but not on the MDS side. That said, there are many messages indicating that the threads got finally unstuck: Jan 15 20:47:37 ALPL505 kernel: Lustre: Service thread pid 14282 completed after 1630.26s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Jan 15 22:00:44 ALPL505 kernel: Lustre: Service thread pid 15016 completed after 343.70s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Jan 15 23:50:51 ALPL505 kernel: Lustre: Service thread pid 14956 completed after 746.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Jan 15 23:50:51 ALPL505 kernel: Lustre: Service thread pid 14308 completed after 746.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). ... So it seems that the system isn't really deadlocked, "just" awefully slow. Can you reproduce this issue easily? |
| Comment by Johann Lombardi (Inactive) [ 23/Feb/12 ] |
|
BTW, do you apply any patches on top of 1.8.6-wc1? |
| Comment by Kit Westneat (Inactive) [ 31/Aug/12 ] |
|
This looks like a dupe of |
| Comment by Peter Jones [ 31/Aug/12 ] |
|
Lai could you please confirm whether this is a duplicate of |
| Comment by Lai Siyao [ 31/Aug/12 ] |
|
Yes, it is. |
| Comment by Peter Jones [ 31/Aug/12 ] |
|
Thanks Lai! |