[LU-4498] MDT thread hung, ls fails on directory Created: 16/Jan/14 Updated: 21/Mar/14 Resolved: 21/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12303 |
| Description |
|
IU is running into an issue where running ls on a certain file causes clients to get evicted. It appears as if there is a hung MDT thread holding a lock on the file. After the MDT is rebooted, listing the directory and file works fine. We were able to capture client debug logs and a backtrace of all the threads from the running system, but due to an issue with STONITH, we were unable to get a good vmcore from the system. Also when we tried to get debug logs from the MDT, the log overflowed, even with a 20GB buffer. We are currently waiting for the issue to reappear and will get debug logs on a quiesced system, as well as a good vmcore. I'll upload the logs we have. Is there anything else we should be looking to get? |
| Comments |
| Comment by Kit Westneat (Inactive) [ 16/Jan/14 ] |
|
Here's an example of the directory listing. I forgot to mention it takes hours for the MDT to actually evict the client. ls -al |
| Comment by Cliff White (Inactive) [ 16/Jan/14 ] |
|
Is there anything else you can tell us about the 'certain file' ? |
| Comment by Kit Westneat (Inactive) [ 17/Jan/14 ] |
|
It's happened three times on seemingly unrelated files. It looks like the last e2fsck was on June 12. FWIW there don't seem to be any ldisk errors in the logs anytime recently. |
| Comment by Peter Jones [ 17/Jan/14 ] |
|
Bobijam Could you please help with this one? Thanks Peter |
| Comment by Zhenyu Xu [ 20/Jan/14 ] |
|
Fanyong, does it look like dir hash collision issue? |
| Comment by nasf (Inactive) [ 20/Jan/14 ] |
|
According to the client side log, every getattr RPC is for different file, so it cannot to say that the "ls" fall into hash collision. On the other hand, the MDT side shows that during the "ls" processing, there are some "open-create" operations. What they are for? |
| Comment by Zhenyu Xu [ 20/Jan/14 ] |
|
the client log shows that ls happened during 2014/01/09 23:46:11 to 2014/01/09 23:46:12, which took 1 seconds and does not match to the 1st comment report (from 2013-12-20_13:37:32 to 2013-12-20_15:44:49). Do you have logs which covers the issue time span? |
| Comment by Zhenyu Xu [ 21/Jan/14 ] |
|
would you mind trying this patch http://review.whamcloud.com/8936 , it is a backport of dir hash collision fix patch. |
| Comment by Peter Jones [ 20/Mar/14 ] |
|
Bobijam Could this be related to Peter |
| Comment by Zhenyu Xu [ 21/Mar/14 ] |
|
yes, I think it's related to |
| Comment by Peter Jones [ 21/Mar/14 ] |
|
ok so then let's mark it as a duplicate of |