[LU-81] Some JBD2 journaling deadlock at BULL Created: 09/Feb/11 Updated: 24/Nov/17 Resolved: 29/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | Lustre 2.2.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 2 | ||||
| Bugzilla ID: | 24,438 | ||||
| Rank (Obsolete): | 4793 | ||||
| Description |
|
BULL reports at the bugzilla that there are some possible deadlock issues on MDS with jbd2 (just run away transactions?): At CEA, they have encountered several occurrences of the same scenario where all Lustre activity is As a consequence, MDS has to be rebooted and Lustre layer has to be restarted on it with recovery. The MDS threads which appear to be strongly involved in the frozen situation have the following There are about 234 tasks with the same following stack: PID 5250 mdt_rdpg_143 One is with: Pid: 4990 mdt_395 And another with: Pid: 4534 "jbd2/sdd-8" ================================== Analyzing the crash dump shows that the task hung in jbd2_journal_commit_transaction() is in this This problem looks like bug 16667, but unfortunately it is not applicable 'as is' as it dates back Can you see the reason for this deadlock? I have to precise that this bug is critical as it blocks normal cluster operation (ie with HSM). |
| Comments |
| Comment by Alex Zhuravlev [ 15/Feb/11 ] |
|
is there possibility to reproduce the issue and grab crash image to have access to the stacks with the offsets? or probably the customer saved the crash? also whether changelog consumer (HSM userspace agent) were running? it's important to understand whether MDS was just generating records or records got cancelled as well. |
| Comment by Peter Jones [ 16/Feb/11 ] |
|
I have added Bull to this ticket in the hope that someone there can answer Alex's question and help move this issue forward |
| Comment by Sebastien Buisson (Inactive) [ 16/Feb/11 ] |
|
Hi Alex, Peter, Thanks to open this Jira ticket. I think CEA saved the crash dumps, but as the cluster is classified it is not possible to get them out. So please tell us precisely what you need, and I will have our on-site Support team send it (forearch bt? bt -a? ...). I do not know if HSM userspace agent was running, I will forward this question. Cheers, |
| Comment by Alex Zhuravlev [ 16/Feb/11 ] |
|
Hello Sebastien, I think the very first info we need is detailed stacks for all the processes. |
| Comment by Peter Jones [ 15/Mar/11 ] |
|
Update from Bull is that the onsite support team are working on getting this information |
| Comment by Cory Spitz [ 15/Mar/11 ] |
|
Given the information presented here, I was reminded of Lustre Bug 21406. Perhaps that ticket could be inspected to see if the conditions are similar. Further, implementing the workaround from attachment 28496 (https://bugzilla.lustre.org/attachment.cgi?id=28496), which did not land to 2.x, may be a useful experiment if the problem can be easily reproduced. However, I also remember that 21406 was associated with OST threads, not MDT threads, so perhaps it doesn't apply. |
| Comment by Peter Jones [ 15/Mar/11 ] |
|
Alex Cray observed on the 2.1 call that this seems somewhat similar to bz 21760. Does this seem plausible from the evidence available? Thanks Peter |
| Comment by Peter Jones [ 15/Mar/11 ] |
|
Johann You were involved in 21760. Are you able to comment on this theory? If so, what evidence should the on-site Bull support staff look for to prove\disprove this theory? Is there a workaround\fix that could be tried out to see if it prevents this problem? Thanks Peter |
| Comment by Cory Spitz [ 15/Mar/11 ] |
|
Oops, did I say 21760? I meant 21406, but I also missed that this issue was MDT related. See my earlier (edited) comment. Sorry if I caused any misdirection. |
| Comment by Peter Jones [ 15/Mar/11 ] |
|
Heh. Actually, my notes said 21706 so I guess the wrong transposition |
| Comment by Johann Lombardi (Inactive) [ 16/Mar/11 ] |
|
I don't think that bugzilla ticket 21706 is related to this issue. That said, i have noticed that the jbd2-commit-timer-no-jiffies-rounding.diff patch HTH |
| Comment by Peter Jones [ 24/Mar/11 ] |
|
I think that the patch Johann mentions is http://review.whamcloud.com/#change,358 |
| Comment by Peter Jones [ 04/Apr/11 ] |
|
Any word back from CEA as to whether this issue still manifests itself with the missing patch applied? |
| Comment by Peter Jones [ 04/Apr/11 ] |
|
Any word back from CEA as to whether this issue still manifests itself with the missing patch applied? |
| Comment by Sebastien Buisson (Inactive) [ 04/Apr/11 ] |
|
Hi Peter, Still no news from CEA on this. At least we will have more information on Thursday. Cheers, |
| Comment by Sebastien Buisson (Inactive) [ 06/Apr/11 ] |
|
Hi, Bad news from CEA. They reactivated Changelogs yesterday evening, and this bug appeared this afternoon. Any 'new' ideas on how to tackle this issue? Sebastien. |
| Comment by Johann Lombardi (Inactive) [ 06/Apr/11 ] |
|
Not without looking at the crash dump. |
| Comment by Sebastien Buisson (Inactive) [ 06/Apr/11 ] |
|
Hi Johann, > Not without looking at the crash dump. > Not without looking at the crash dump. Cheers, |
| Comment by Johann Lombardi (Inactive) [ 06/Apr/11 ] |
|
Hi Sebastien, No, changelogs are not activated by default, you need to register a changelog user to enable it. |
| Comment by Alex Zhuravlev [ 06/Apr/11 ] |
|
> Unfortunately the crash dump cannot be taken out of CEA. What crash commands would you like Bruno to launch? list of all the threads with backtraces would be a good start. |
| Comment by Peter Jones [ 07/Apr/11 ] |
|
update from Bull "problem reoccurred yesterday, after less than 24 hours with ChangeLogs activated. |
| Comment by Peter Jones [ 21/Apr/11 ] |
|
As per Bull, CEA do not expect to be able to gather this data until the end of May. |
| Comment by Sebastien Buisson (Inactive) [ 01/Jun/11 ] |
|
Hi, Here is the the long-time awaited 'foreach bt' (in fact the Alt+SysRq+T console output taken live during one occurrence of the problem). Cheers, |
| Comment by Alex Zhuravlev [ 22/Jun/11 ] |
|
PID: 26299 TASK: ffff88047d851620 CPU: 28 COMMAND: "llog_process_th" PID: 22091 TASK: ffff8808695bad90 CPU: 22 COMMAND: "mdt_attr_101" seem to be known ordering issue with journal_start() vs. catalog semaphore. |
| Comment by Sebastien Buisson (Inactive) [ 12/Jul/11 ] |
|
Hi, Any news about this? TIA, |
| Comment by Alex Zhuravlev [ 18/Jul/11 ] |
|
Hello Sebastien, the fix I was thinking of was work being done for the Sequoia project. we can't land it onto master due in general, canceling code should follow "start transaction first, then do locking in llog" rule. |
| Comment by Alexandre Louvet [ 02/Aug/11 ] |
|
Hi, Just wanted to report that the hit frequency of this problem did increased recently. We are know at about 2 or 3 hangs a day. Is there anything we can provide to help ? Alex. |
| Comment by Diego Moreno (Inactive) [ 04/Aug/11 ] |
|
Hi, As the priority of this issue is rising up, just another question: do you think we can deploy any kind of work-around (different from just "deactivate changelog", of course)? Thanks, |
| Comment by Peter Jones [ 04/Aug/11 ] |
|
Niu Could you please look into a workaround\fix for this issue that will work with the existing master code? Thanks Peter |
| Comment by Patrick Valentin (Inactive) [ 05/Aug/11 ] |
|
Hi, TIA |
| Comment by Peter Jones [ 06/Aug/11 ] |
|
Patrick I have sent you information on this Peter |
| Comment by Patrick Valentin (Inactive) [ 09/Aug/11 ] |
|
The tarball containing the kernel core dump and kernel image is available on whamcloud ftp server.
It must be analysed on a 2.6.32 kernel using the corresponding crash command (5.0.0). In case of troubles, crash_5.1.7 ("crash_5.1.7.tar.gz") is also available in the traball. To use it, you have to set the following variables: Let me know if you need additional information Regards, |
| Comment by Niu Yawei (Inactive) [ 17/Aug/11 ] |
|
Hi, Alex/Johann Given that it's difficult to port the Orion llog changes onto master, I think we could probably introduce a simple workaround temporarily for master: Let's invent a rw lock for each mdd to protect the changelog, each changelog adding will takes the read lock, and the changelog cancelling has to hold the write lock, since changelog cancelling only happens when user issue the changelog clear command, I think the performance impact will be acceptable. Considering it's just a temporary workaround, I want to minimize the code changes as much as possible, and another advantage of this approach is that it doesn't affect other llog users except the changelog. If this workaround sounds ok to you, I'll make the patch soon. Thanks |
| Comment by Diego Moreno (Inactive) [ 17/Aug/11 ] |
|
Hi Niu, From my point of view I think this is what we are looking for. Just a work-around based on a simple lock, with a moderate impact on performances. |
| Comment by Johann Lombardi (Inactive) [ 17/Aug/11 ] |
|
Could we start a transaction earlier like done in bugzilla 18030? |
| Comment by Niu Yawei (Inactive) [ 18/Aug/11 ] |
|
Ok, I tried to make a patch which start transaction before catlog locking in llog_cat_cancel_records(). Thanks. |
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = FAILURE
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 04/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 04/Jan/12 ] |
|
Landed for 2.2 |
| Comment by Peter Jones [ 05/Jan/12 ] |
|
Bull report that this has reocurred at CEA |
| Comment by Bruno Faccini (Inactive) [ 05/Jan/12 ] |
|
So seems that the work-around ("patch which start transaction before catlog locking in What do you think ??? |
| Comment by Niu Yawei (Inactive) [ 05/Jan/12 ] |
|
Could you provide the statck trace? If we don't know the exact reason, I'm afraid that the brute-force lock can't resolve the problem too. |
| Comment by Bruno Faccini (Inactive) [ 10/Jan/12 ] |
|
This last time, the thread hung since a long time in jbd2_journal_commit_transaction() is named "jbd2/dm-0-8" but still with the same stack !!! : then there is a bunch of other Lustre threads (ll_<...>, mdt_[rdpg_]<id>, ...) stuck with the same/following stack's ending stages : ...... ==================================================================================================================================== |
| Comment by Niu Yawei (Inactive) [ 10/Jan/12 ] |
|
hi, Bruno, is there exact full stack trace available? |
| Comment by Peter Jones [ 29/Mar/12 ] |
|
Landed for 2.2. Bull advised separately that this issue no longer exists with the fix |
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 08/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Nathan Rutman [ 30/Jul/12 ] |
I believe there is a general problem here that is not resolved by simply increasing the journal credits, which really just serves to mask the problem in some cases. We're looking at a case now where cancelling lots of unlink records results in a similar lock inversion caused by the journal restart in the llog updates. The code really needs to be changed so that the lock inversion can't happen. |
| Comment by Bruno Faccini (Inactive) [ 01/Aug/12 ] |
|
I understand there are strong assumptions that we don't have a definitive fix for this quite un-frequent problem/dead-lock actually ... And BTW, I just got a new occurence of this same scenario, but on an OSS this time running with Lustre 2.1.1 and a Kernel version 2.6.32-131.12.1 which contains the JBD2 patch jbd2-commit-timer-no-jiffies-rounding.diff patch ... The involved hung thread's stacks look about the same : PID: 15704 TASK: ffff88062c52c0c0 CPU: 4 COMMAND: "jbd2/dm-5-8" and many other like this one PID: 15892 TASK: ffff88062c73f4c0 CPU: 4 COMMAND: "ll_ost_io_36" But since we are on an OSS this can not be implied from any "llog" activity, can we just consider that we are back on a "pure" JBD2 issue there ??? |
| Comment by William Power [ 01/Aug/12 ] |
|
Bruno - can you post/attach the full set of stack traces for this lockup. |