[LU-4271] mds load goes very high and filesystem hangs after mounting mdt Created: 19/Nov/13 Updated: 14/Aug/14 Resolved: 14/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 11731 | ||||||||
| Description |
|
after recover of a crashed mds the system load goes to >800 here is the error Call Trace: |
| Comments |
| Comment by Mahmoud Hanafi [ 19/Nov/13 ] |
|
Console log showing the start of the issue. |
| Comment by Bruno Faccini (Inactive) [ 19/Nov/13 ] |
|
Hello mahmoud, I suppose the stack you attached is one that shows-up frequently during the problem, are there others showing attached/running/spinning processes instead blocked ones ? This stack indicates we are waiting for OSS reply, did you check health of OSSes ?? |
| Comment by Bruno Faccini (Inactive) [ 19/Nov/13 ] |
|
Having a look to the Console log you attached, I think more interesting is the fact there are multiple threads reported as stuck in JBD2 layer. And particulary the jbd2/dm-3-8 instances with stack like : sp ip Function (args) 0xffff881e643c9c60 0xffffffff8151c712 thread_return 0xffff881e643c9d28 0xffffffffa0c338df [jbd2]jbd2_journal_commit_transaction+0x19f (0xffff883f0c696000) 0xffff881e643c9e68 0xffffffffa0c3a6c8 [jbd2]kjournald2+0xb8 (0xffff883f0c696000) 0xffff881e643c9ee8 0xffffffff8108fb96 kthread+0x96 (0xffff881e641458e8) 0xffff881e643c9f48 0xffffffff8100c0ca child_rip+0xa (unknown, unknown) This looks like situation described in |
| Comment by Mahmoud Hanafi [ 19/Nov/13 ] |
|
we were able to get things working by |
| Comment by Bruno Faccini (Inactive) [ 19/Nov/13 ] |
|
Hello Mahmoud, since you have been able to restart your file-system, can we downgrade the priority from blocker ? |
| Comment by Jay Lan (Inactive) [ 19/Nov/13 ] |
|
Please note that I cherry-picked patch set #3 of |
| Comment by Mahmoud Hanafi [ 19/Nov/13 ] |
|
Yes we can downgrade it from blocker dm-3-8 is the mdt I have gone through the raid logs and nothing shows up there. This is the first trace that is printed. Wouldn't this suggest that it is stuck in ko2iblnd? |
| Comment by Mahmoud Hanafi [ 20/Nov/13 ] |
|
What do these error mean.. We see a lot of these. |
| Comment by Bruno Faccini (Inactive) [ 25/Nov/13 ] |
|
The "? kiblnd_send+0x2a0/0x9e0 [ko2iblnd]" entry in the stack is likely to be an old+stale reference found by the unwinder. The obd_set_info_async() msgs should have been printed upon your MDS+MGS restart and are likely to occur during massive MGS re-connect from all Clients. It just means probably that target_handle_connect() generically (when some cleanup need in import/export is detected) calls obd_set_info_async() for the MGS device but it has no o_set_info_async method/operation available to proceed. So we can easily consider msg is harmless. |
| Comment by Bruno Faccini (Inactive) [ 25/Nov/13 ] |
|
I downgraded priority to Major, is it ok ?? I don't think there is a lot more to be done for this ticket, my feeling is that we faced a JBD2 issue upon MDS restart+reconnection, and it is unlikely to re-occur but if this happen, we will need a crash-dump. |
| Comment by John Fuchs-Chesney (Inactive) [ 14/Mar/14 ] |
|
Hello Mahmoud, |
| Comment by Mahmoud Hanafi [ 26/Mar/14 ] |
|
We have hit this bug again. Here is the details of the stack Lustre: 7240:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 1 previous similar message^M Lustre: nbp8-MDT0000: haven't heard from client fae2954d-57b1-3826-45b9-252e2516d6f0 (at 10.151.55.93@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff883f62fef400, cur 1395779273 expire 1395779123 last 1395779046^M Lustre: 7228:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from e3993dc5-8b59-86ee-9304-faf298c767b4@10.151.23.158@o2ib t0 exp (null) cur 1395779280 last 0^M Lustre: 7228:0:(ldlm_lib.c:952:target_handle_connect()) Skipped 1 previous similar message^M Lustre: 7398:0:(quota_interface.c:543:quota_chk_acq_common()) still haven't managed to acquire quota space from the quota master after 10 retries (err=0, rc=0)^M INFO: task ldlm_cn_01:6369 blocked for more than 120 seconds.^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M ldlm_cn_01 D 000000000000001d 0 6369 2 0x00000080^M ffff881e9b8cdb30 0000000000000046 0000000000000000 ffffffffa0925cd0^M ffff881e9b8cdad0 00000000fdc0d821 ffff883feeb58e00 ffff883e4235124d^M ffff881ff32bf098 ffff881e9b8cdfd8 000000000000fc40 ffff881ff32bf098^M Call Trace:^M [<ffffffffa0925cd0>] ? kiblnd_send+0x2a0/0x9e0 [ko2iblnd]^M [<ffffffffa055c14a>] start_this_handle+0x27a/0x4a0 [jbd2]^M [<ffffffff8108ff00>] ? autoremove_wake_function+0x0/0x40^M [<ffffffffa055c570>] jbd2_journal_start+0xd0/0x110 [jbd2]^M [<ffffffffa08e6338>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]^M [<ffffffffa072c017>] fsfilt_ldiskfs_start+0x77/0x5e0 [fsfilt_ldiskfs]^M [<ffffffffa07a9ac0>] llog_origin_handle_cancel+0x4b0/0xd70 [ptlrpc]^M [<ffffffffa04f9923>] ? cfs_alloc+0x63/0x90 [libcfs]^M [<ffffffffa0630ebf>] ? keys_fill+0x6f/0x1a0 [obdclass]^M [<ffffffffa076f71f>] ldlm_cancel_handler+0x1bf/0x5e0 [ptlrpc]^M [<ffffffffa079fb4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M [<ffffffff8100c0ca>] child_rip+0xa/0x20^M [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M [<ffffffffa079ef00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M INFO: task ldlm_cb_00:6370 blocked for more than 120 seconds.^M Attached full dmesg and full process list to ftp site (nbp8-mds1-dmesg.mar25.gz) |
| Comment by Bruno Faccini (Inactive) [ 27/Mar/14 ] |
|
Did you also successfully take a crash-dump of your MDS, as I requested in my update/comment dated "25/Nov/13 4:47 PM" for this ticket, and as seem to indicate the end of the dmesg you uploaded for this latest occurrence ?? If yes, can you upload it with associated vmlinux (or kernel-debuginfo* RPMs) and Lustre modules. BTW, at the time you dumped all threads stacks with Alt+SysRq+T key-stroke, there are 502 threads stuck in jbd2_journal_start()/start_this_handle(), 129 threads stuck in dqacq_handler()/... because one of them is itself also stuck in the journal/jbd2 layer via lustre_commit_dquot(), so the load should be greater than 630 at that time ... But I am unable to find who is responsible to block all of these threads with the only full stack listing. |
| Comment by Mahmoud Hanafi [ 27/Mar/14 ] |
|
We did take a crash dump. I will encrypt and uploaded. It will have to be accessed by US citizens only. |
| Comment by Mahmoud Hanafi [ 31/Mar/14 ] |
|
crash dump was uploaded and email was sent to cliff white. |
| Comment by Bruno Faccini (Inactive) [ 15/Apr/14 ] |
|
Cliff, I wonder if you found the same situation in the crash-dump than the one I described from the all threads stacks log ?? |
| Comment by Cliff White (Inactive) [ 15/Apr/14 ] |
|
Mahmoud, apparently I missed your email, very sorry, can you re-send the information on the dump? |
| Comment by Mahmoud Hanafi [ 15/Apr/14 ] |
|
Cliff, Email sent -Mahmoud |
| Comment by Cliff White (Inactive) [ 16/Apr/14 ] |
|
Mahmoud, you including a large number of kernel RPMs in the tarball, can you please specify exactly which version corresponds to the kernel running on the node where the vmcore was taken? |
| Comment by Bob Glossman (Inactive) [ 21/Apr/14 ] |
|
Have been looking over the crash dump from the customer. As noted in previous comments there is quite a large number of threads all stuck in jbd2_journal_start with the common calling sequence ... osd_trans_start -> ldiskfs_journal_start_sb -> jbd2_journal_start -> start_this_handle. It seems very likely there's a single other thread holding a lock or otherwise blocking all these callers of jbd2_journal_start, but so far I haven't been able to find the culprit. |
| Comment by Bob Glossman (Inactive) [ 21/Apr/14 ] |
|
I note that in addition to the 500+ threads stuck in jbd2_journal_start, there's one thread stuck elsewhere it jbd2 code. The stack trace of this thread looks like: PID: 7885 TASK: ffff881e506e2ae0 CPU: 6 COMMAND: "mdt_439" #0 [ffff881e506f38c0] schedule at ffffffff8151c712 #1 [ffff881e506f3988] jbd2_log_wait_commit at ffffffffa0564325 [jbd2] #2 [ffff881e506f3a18] jbd2_journal_stop at ffffffffa055bacb [jbd2] #3 [ffff881e506f3a78] __ldiskfs_journal_stop at ffffffffa08e62a8 [ldiskfs] #4 [ffff881e506f3aa8] osd_trans_stop at ffffffffa0d26476 [osd_ldiskfs] #5 [ffff881e506f3ad8] mdd_trans_stop at ffffffffa0c2d4aa [mdd] #6 [ffff881e506f3ae8] mdd_attr_set at ffffffffa0c0ca5f [mdd] #7 [ffff881e506f3bc8] cml_attr_set at ffffffffa0d4fa76 [cmm] #8 [ffff881e506f3bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt] #9 [ffff881e506f3c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt] #10 [ffff881e506f3cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #11 [ffff881e506f3cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #12 [ffff881e506f3d48] mdt_reint at ffffffffa0c902b4 [mdt] #13 [ffff881e506f3d68] mdt_handle_common at ffffffffa0c84772 [mdt] #14 [ffff881e506f3db8] mdt_regular_handle at ffffffffa0c85665 [mdt] #15 [ffff881e506f3dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #16 [ffff881e506f3f48] kernel_thread at ffffffff8100c0ca I can't see how this thread might be blocking all the others, but it is interesting that it's the only one I have found so far in similar code but different from all the others. |
| Comment by Bob Glossman (Inactive) [ 21/Apr/14 ] |
|
another odd outlier thread: PID: 6425 TASK: ffff883fdc6c9500 CPU: 5 COMMAND: "jbd2/dm-3-8" #0 [ffff883fec013c60] schedule at ffffffff8151c712 #1 [ffff883fec013d28] jbd2_journal_commit_transaction at ffffffffa055d8df [jbd2] #2 [ffff883fec013e68] kjournald2 at ffffffffa05646c8 [jbd2] #3 [ffff883fec013ee8] kthread at ffffffff8108fb96 #4 [ffff883fec013f48] kernel_thread at ffffffff8100c0ca I'm wondering if this might be a dup of |
| Comment by Bruno Faccini (Inactive) [ 21/Apr/14 ] |
|
Sure Bob, this last thread should be the one blocking all the others in JBD2 layer! And again, and like in one of my original updates on 19/Nov/2013, the concerned device is "dm-3-8" that Mahmoud confirmed to be the MDT. And yes, it looks like a dup of What would be cool now would be to identify if this last thread has been scheduled recently, if not why or if yes, why it is looping+re-schedule()ing (t_updates != NULL?) ... |
| Comment by Bruno Faccini (Inactive) [ 02/May/14 ] |
|
Bob, can you check if you find the same kind of dead-lock scenario/stacks that Bobi detailled in |
| Comment by Bruno Faccini (Inactive) [ 24/May/14 ] |
|
I have been working "blindly" with Bob having direct access to the crash-dump, and here is the analysis of the deadlock out from the crash-dump you provided. PID: 19803 TASK: ffff883ff3754080 CPU: 1 COMMAND: "mdt_504" #0 [ffff883dc31f76b0] schedule at ffffffff8151c712 #1 [ffff883dc31f7778] cfs_waitq_wait at ffffffffa04f960e [libcfs] #2 [ffff883dc31f7788] qctxt_wait_pending_dqacq at ffffffffa09bfe1b [lquota] #3 [ffff883dc31f7878] qctxt_adjust_qunit at ffffffffa09c5d41 [lquota] #4 [ffff883dc31f7908] quota_acquire_common at ffffffffa09c9653 [lquota] #5 [ffff883dc31f7938] quota_chk_acq_common at ffffffffa09cb722 [lquota] #6 [ffff883dc31f7a78] lquota_chkquota.clone.3 at ffffffffa0c0300b [mdd] #7 [ffff883dc31f7ae8] mdd_attr_set at ffffffffa0c0dcf6 [mdd] #8 [ffff883dc31f7bc8] cml_attr_set at ffffffffa0d4fa76 [cmm] #9 [ffff883dc31f7bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt] #10 [ffff883dc31f7c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt] #11 [ffff883dc31f7cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #12 [ffff883dc31f7cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #13 [ffff883dc31f7d48] mdt_reint at ffffffffa0c902b4 [mdt] #14 [ffff883dc31f7d68] mdt_handle_common at ffffffffa0c84772 [mdt] #15 [ffff883dc31f7db8] mdt_regular_handle at ffffffffa0c85665 [mdt] #16 [ffff883dc31f7dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #17 [ffff883dc31f7f48] kernel_thread at ffffffff8100c0ca This is confirmed by looking into the source code of mdd_attr_set(), where we can see the mdd_trans_start()/lquota_chkquota()/mdd_trans_stop() call sequence. The reason this thread is waiting is because the qunit it wants to process is presently hashed/"inflight". Looking more into others threads to find a reason to explain this blocked situation, we found a thread that is working on this same qunit but is itself stuck trying to start a new journal/transaction handle (like hundreds other threads ...), with the following stack : PID: 7460 TASK: ffff883f65bc9500 CPU: 25 COMMAND: "mdt_308" #0 [ffff883f65bcb2b0] schedule at ffffffff8151c712 #1 [ffff883f65bcb378] start_this_handle at ffffffffa055c14a [jbd2] #2 [ffff883f65bcb438] jbd2_journal_start at ffffffffa055c570 [jbd2] #3 [ffff883f65bcb488] ldiskfs_journal_start_sb at ffffffffa08e6338 [ldiskfs] #4 [ffff883f65bcb498] lustre_quota_journal_start at ffffffffa072c61f [fsfilt_ldiskfs] #5 [ffff883f65bcb4a8] lustre_commit_dquot at ffffffffa0734968 [fsfilt_ldiskfs] #6 [ffff883f65bcb548] fsfilt_ldiskfs_dquot at ffffffffa072b604 [fsfilt_ldiskfs] #7 [ffff883f65bcb568] dqacq_handler at ffffffffa09d59fe [lquota] #8 [ffff883f65bcb628] schedule_dqacq at ffffffffa09c2529 [lquota] #9 [ffff883f65bcb728] qctxt_adjust_qunit at ffffffffa09c5cdc [lquota] #10 [ffff883f65bcb7b8] quota_acquire_common at ffffffffa09c9653 [lquota] #11 [ffff883f65bcb7e8] quota_chk_acq_common at ffffffffa09cb722 [lquota] #12 [ffff883f65bcb928] lquota_chkquota.clone.1 at ffffffffa0c1c28b [mdd] #13 [ffff883f65bcb998] mdd_rename at ffffffffa0c22c0a [mdd] #14 [ffff883f65bcbb08] cml_rename at ffffffffa0d52b04 [cmm] #15 [ffff883f65bcbb88] mdt_reint_rename at ffffffffa0c9da73 [mdt] #16 [ffff883f65bcbcd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #17 [ffff883f65bcbcf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #18 [ffff883f65bcbd48] mdt_reint at ffffffffa0c902b4 [mdt] #19 [ffff883f65bcbd68] mdt_handle_common at ffffffffa0c84772 [mdt] #20 [ffff883f65bcbdb8] mdt_regular_handle at ffffffffa0c85665 [mdt] #21 [ffff883f65bcbdc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #22 [ffff883f65bcbf48] kernel_thread at ffffffff8100c0ca and which is presently blocked due to the current/running transaction being in commit process!! Seems that this has been possible due to the fact in mdd_attr_set(), a journal_start()/mdd_trans_start() has been done before starting a quota operation, when it is never the case in same layer's routines, like mdd_rename()/mdd_[un]link()/mdd_create()/..., where all "heavy" quota operations are made out of mdd_trans_start()/mdd_trans_stop() boundaries, to prevent such dead-lock situations. So, a possible fix for this should be to change the code of mdd_attr_set() by simply moving the #ifdef HAVE_QUOTA_SUPPORT/#endif block out from mdd_trans_start()/mdd_trans_stop() boundaries. I give it a try at http://review.whamcloud.com/10443. BTW, this problem is a pure b2_1/2.1.x one since all the concerned code has been re-written in 2.4. |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Jul/14 ] |
|
Hello Bruno and Mahmoud, Can I ask if there is going to be more progress on this ticket? Many thanks, |
| Comment by Peter Jones [ 14/Aug/14 ] |
|
As per NASA ok to close ticket |