[LU-2218] lots of small IO causes OST deadlock Created: 22/Oct/12 Updated: 09/May/14 Resolved: 09/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES kernel on debian |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5276 |
| Description |
|
Sanger have been running into an issue where one of their applications seems to deadlock OSTs. They have an application that does lots of small IO and seems to create and delete a lot of files. It also seems to saturate the network, so there are a lot of bulk IO errors. It looks like the quota and jbd sections are getting into some kind of deadlock. I'm uploading the full logs, but there is a lot of: Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264411] [<ffffffff8139ba25>] rwsem_down_failed_common+0x95/0x1e0 and Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266346] Call Trace: I asked them to turn down the oss threads to try to reduce contention on the disks and network, but that didn't seem to help. Let me know if there are any other logs you need. |
| Comments |
| Comment by Peter Jones [ 22/Oct/12 ] |
|
Niu Could you please help with this one? Thanks peter |
| Comment by Kit Westneat (Inactive) [ 22/Oct/12 ] |
|
Here is additional information from Sanger: The pipeline runs 200-1500 concurrent jobs; each job creates its own There does not appear to be any concurrent write access to any of the All of these IO is happening on un-striped files & directories. One thing we also noticed; the user concerned was at 98% of their inode Unfortunately, their pipeline does not trap errors; it simply re-tries We have upped the user's quota, and re-activated the pipeline, so we |
| Comment by Guy Coates [ 24/Oct/12 ] |
|
Hi, We can reproduce the deadlock even when the user is well within quota. |
| Comment by Kit Westneat (Inactive) [ 24/Oct/12 ] |
|
I've looked more closely at the logs and it looks like there is some kind of deadlock between 3 different functions: mutex_lock They all seem to be called by either filter_destroy or ldlm_cancel_callback originally. Are there any debug flags that might help get information? |
| Comment by Niu Yawei (Inactive) [ 24/Oct/12 ] |
|
Hi, Kit I checked the log, there are indeed many service threads are waiting for starting journal, and somes are waiting on the dqptr_sem, but it seems not necessary be a deadlock. I didn't see lock ordering issues from code so far (there was a deadlock problem, but it's already fixed in Could you try to get the full stacktrace (on the hang OST) by sysrq? I think that could be helpful. Thanks. |
| Comment by Niu Yawei (Inactive) [ 24/Oct/12 ] |
|
Hi, Yangshen I saw that ext4-back-dquot-to-rhel54.patch is removed by ba5dd769f66194a80920cf93d6014c78729efaae ( |
| Comment by Niu Yawei (Inactive) [ 24/Oct/12 ] |
|
btw, Kit, what's the exact kernel version on OST? |
| Comment by Yang Sheng [ 24/Oct/12 ] |
|
SLES kernel may different than RHEL. So the patches cannot apply directly. |
| Comment by Kit Westneat (Inactive) [ 25/Oct/12 ] |
|
The kernel is version 2.6.32.59-sles-lustre-1.8.8wc1. I tried to see if there were any ext4 bugs fixed in more recent 2.6.32.y kernels, but I couldn't find any. We'll try to get a full stack trace the next time this occurred. We tried this time, but as you can see the first part was not saved for some reason. |
| Comment by Guy Coates [ 26/Oct/12 ] |
|
We have been running the same workload on one of our 2.2 file-systems for the past few days, and we not able to trigger the problem. Our plan will be to upgrade the problematic file-system to 2.X. I am happy for you to reduce the priority of the ticket in the meantime. Guy |
| Comment by Niu Yawei (Inactive) [ 26/Oct/12 ] |
|
Thank you, Guy. sles kernel isn't officially supported by 1.8.8, so I'm not sure if there is potential deadlock of journal lock with qptr_sem. Anyway, glad to hear you resolved the problem by upgrading. |