[LU-10680] MDT becoming unresponsive in 2.10.3 Created: 17/Feb/18 Updated: 23/Jan/24 Resolved: 19/Jun/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
3.10.0-693.2.2.el7_lustre.pl1.x86_64 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this: [52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000 [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [62145.591250] Pid: 157540, comm: mdt_rdpg01_050 [62145.591486] Call Trace: [62145.591949] [<ffffffff816a94e9>] schedule+0x29/0x70 [62145.592200] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [62145.592444] [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [62145.592708] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [62145.592971] [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 [62145.593223] [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [62145.593472] [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd] [62145.593760] [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass] [62145.594025] [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0 [62145.594272] [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd] [62145.594537] [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd] [62145.595020] [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd] [62145.595267] [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd] [62145.595540] [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt] [62145.595790] [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt] [62145.596051] [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt] [62145.596352] [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc] [62145.596633] [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [62145.597138] [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [62145.597385] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [62145.597651] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [62145.597927] [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [62145.598176] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [62145.598420] [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0 [62145.598695] [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [62145.598955] [<ffffffff810b098f>] kthread+0xcf/0xe0 [62145.604690] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [62145.604932] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [62145.605175] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 ... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:
vmcore-oak-md1-s2-2018-02-16-11:02:21.gz: [https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn ] vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz: https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp also attaching foreach bt and memory usage Thanks! Stephane |
| Comments |
| Comment by Peter Jones [ 17/Feb/18 ] |
|
Thanks Stephane |
| Comment by Bruno Faccini (Inactive) [ 18/Feb/18 ] |
|
Stephane, But for some unknown reason kthreadd is not starting it, or has still not reported it did... When on the other hand, the mdd_changelog_clear() error msgs seem to indicate changelogs garbage collection may have already started! Thus, I would like to analyze the crash-dump available, but to do so I also need the lustre-debuginfo RPM from your Lustre version, can you also make it available ? |
| Comment by Stephane Thiell [ 18/Feb/18 ] |
|
Hey Bruno, Thanks for checking. I noticed that function `mdd_chlg_garbage_collect()` in the "inactivate thread" stack trace too. But this is weird because it is very unlikely cl1 was actually idle, it's a robinhood consumer and if I check the robinhood log just a bit before the issue / crash dump, it seems to be running: 2018/02/16 10:29:18 [7022/2] STATS | ==================== Dumping stats at 2018/02/16 10:29:18 ===================== 2018/02/16 10:29:18 [7022/2] STATS | ======== General statistics ========= 2018/02/16 10:29:18 [7022/2] STATS | Daemon start time: 2018/01/29 11:23:38 2018/02/16 10:29:18 [7022/2] STATS | Started modules: scan,log_reader,policy_run(modeguard) 2018/02/16 10:29:18 [7022/2] STATS | ======== FS scan statistics ========= 2018/02/16 10:29:18 [7022/2] STATS | last scan = 2018/01/30 00:24:56 2018/02/16 10:29:18 [7022/2] STATS | duration = 13h 01min 18s (46878 s) 2018/02/16 10:29:18 [7022/2] STATS | status = complete 2018/02/16 10:29:18 [7022/2] STATS | current scan interval = 22.0d 2018/02/16 10:29:18 [7022/2] STATS | ChangeLog reader #0: 2018/02/16 10:29:18 [7022/2] STATS | fs_name = oak 2018/02/16 10:29:18 [7022/2] STATS | mdt_name = MDT0000 2018/02/16 10:29:18 [7022/2] STATS | reader_id = cl1 2018/02/16 10:29:18 [7022/2] STATS | records read = 176082077 2018/02/16 10:29:18 [7022/2] STATS | interesting records = 119680156 2018/02/16 10:29:18 [7022/2] STATS | suppressed records = 56401921 2018/02/16 10:29:18 [7022/2] STATS | records pending = 999 2018/02/16 10:29:18 [7022/2] STATS | last received = 2018/02/16 10:29:17 2018/02/16 10:29:18 [7022/2] STATS | last read record time = 2018/02/16 10:29:17.671089 2018/02/16 10:29:18 [7022/2] STATS | last read record id = 3459537671 2018/02/16 10:29:18 [7022/2] STATS | last pushed record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | last committed record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | last cleared record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | read speed = 76977.50 record/sec (513.18 incl. idle time) 2018/02/16 10:29:18 [7022/2] STATS | processing speed ratio = 1.00 2018/02/16 10:29:18 [7022/2] STATS | ChangeLog stats: 2018/02/16 10:29:18 [7022/2] STATS | MARK: 18414, CREAT: 16458408, MKDIR: 3318418, HLINK: 35763559, SLINK: 120840, MKNOD: 726 2018/02/16 10:29:18 [7022/2] STATS | UNLNK: 43112348, RMDIR: 3143745, RENME: 6424858, RNMTO: 0, OPEN: 0, CLOSE: 32387229 2018/02/16 10:29:18 [7022/2] STATS | LYOUT: 118087, TRUNC: 3063071, SATTR: 29921326, XATTR: 0, HSM: 0, MTIME: 695102, CTIME: 1535946 2018/02/16 10:29:18 [7022/2] STATS | ATIME: 0, MIGRT: 0 I also just checked that we only have one changelog consumer, cl1: [root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdd/oak-MDT0000/changelog_users current index: 3475865674 ID index (idle seconds) cl1 3475865473 (1) [root@oak-md1-s2 ~]# You should be able to download the debuginfo RPMs at the following links:
Thanks! |
| Comment by Bruno Faccini (Inactive) [ 18/Feb/18 ] |
|
Well after doing some analysis on the crash-dump I feel puzzled by what I seem to have discovered. Dump of assembler code for function mdd_changelog_store:
931 int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
0xffffffffc12f7972 <+66>: mov 0x28(%rsi),%r14
932 struct llog_changelog_rec *rec, struct thandle *th)
933 {
0xffffffffc12f7930 <+0>: nopl 0x0(%rax,%rax,1)
0xffffffffc12f7935 <+5>: push %rbp
0xffffffffc12f7936 <+6>: mov %rsp,%rbp
0xffffffffc12f7939 <+9>: push %r15
0xffffffffc12f793b <+11>: push %r14
0xffffffffc12f793d <+13>: push %r13
0xffffffffc12f793f <+15>: mov %rdx,%r13
0xffffffffc12f7942 <+18>: push %r12
0xffffffffc12f7944 <+20>: push %rbx
0xffffffffc12f7945 <+21>: mov %rsi,%rbx
0xffffffffc12f794f <+31>: sub $0x30,%rsp
0xffffffffc12f7957 <+39>: mov %rcx,-0x58(%rbp)
0xffffffffc12f795b <+43>: mov %gs:0x28,%rax
0xffffffffc12f7964 <+52>: mov %rax,-0x30(%rbp)
0xffffffffc12f7968 <+56>: xor %eax,%eax
0xffffffffc12f796a <+58>: mov %rdi,-0x48(%rbp)
934 struct obd_device *obd = mdd2obd_dev(mdd);
935 struct llog_ctxt *ctxt;
936 struct thandle *llog_th;
937 int rc;
938 bool run_gc_task;
939
940 rec->cr_hdr.lrh_len = llog_data_len(sizeof(*rec) +
0xffffffffc12f79b2 <+130>: mov %eax,0x0(%r13)
........................
964
965 /* time to recover some space ?? */
966 spin_lock(&mdd->mdd_cl.mc_lock);
967 if (unlikely(mdd->mdd_changelog_gc && (ktime_get_real_seconds() -
0xffffffffc12f7a8f <+351>: mov 0xf8(%rbx),%eax
0xffffffffc12f7a95 <+357>: test %eax,%eax
0xffffffffc12f7a97 <+359>: jne 0xffffffffc12f7dc0 <mdd_changelog_store+1168>
0xffffffffc12f7dc0 <+1168>: callq 0xffffffffc07fec90 <ktime_get_real_seconds>
0xffffffffc12f7dc5 <+1173>: mov 0x108(%rbx),%edx
0xffffffffc12f7dcb <+1179>: sub 0xf0(%rbx),%rax
0xffffffffc12f7dd2 <+1186>: cmp %rdx,%rax
0xffffffffc12f7dd5 <+1189>: jle 0xffffffffc12f7a9d <mdd_changelog_store+365>
0xffffffffc12f7ddb <+1195>: cmpq $0x0,0xe8(%rbx)
0xffffffffc12f7de3 <+1203>: jne 0xffffffffc12f7a9d <mdd_changelog_store+365>
0xffffffffc12f7de9 <+1209>: mov 0x30(%r15),%rdi
0xffffffffc12f7ded <+1213>: callq 0xffffffffc0919a20 <llog_cat_free_space>
0xffffffffc12f7df2 <+1218>: cmp 0x10c(%rbx),%eax
0xffffffffc12f7df8 <+1224>: ja 0xffffffffc12f7a9d <mdd_changelog_store+365>
968 mdd->mdd_cl.mc_gc_time > mdd->mdd_changelog_min_gc_interval) &&
969 mdd->mdd_cl.mc_gc_task == NULL &&
970 llog_cat_free_space(ctxt->loc_handle) <=
971 mdd->mdd_changelog_min_free_cat_entries)) {
972 CWARN("%s: low on changelog_catalog free entries, starting "
0xffffffffc12f7dfe <+1230>: lea 0x40(%r14),%rdx
0xffffffffc12f7e02 <+1234>: mov $0xffffffffc1316ec8,%rsi
0xffffffffc12f7e09 <+1241>: mov $0xffffffffc132ba80,%rdi
0xffffffffc12f7e10 <+1248>: xor %eax,%eax
0xffffffffc12f7e12 <+1250>: movl $0x4,0x33c74(%rip) # 0xffffffffc132ba90 <msgdata.69570+16>
0xffffffffc12f7e1c <+1260>: movq $0xffffffffc1316b28,0x33c59(%rip) # 0xffffffffc132ba80 <msgdata.69570>
0xffffffffc12f7e27 <+1271>: movq $0xffffffffc1312b70,0x33c56(%rip) # 0xffffffffc132ba88 <msgdata.69570+8>
0xffffffffc12f7e32 <+1282>: movl $0x3cd,0x33c58(%rip) # 0xffffffffc132ba94 <msgdata.69570+20>
0xffffffffc12f7e3c <+1292>: movq $0xffffffffc132ba70,0x33c59(%rip) # 0xffffffffc132baa0 <msgdata.69570+32>
0xffffffffc12f7e47 <+1303>: movl $0x400,0x33c47(%rip) # 0xffffffffc132ba98 <msgdata.69570+24>
0xffffffffc12f7e51 <+1313>: callq 0xffffffffc0809b70 <libcfs_debug_msg>
973 "ChangeLog garbage collection thread\n", obd->obd_name);
974
975 /* indicate further kthread run will occur outside right after
976 * critical section
977 */
978 mdd->mdd_cl.mc_gc_task = (struct task_struct *)(-1);
0xffffffffc12f7e56 <+1318>: movq $0xffffffffffffffff,0xe8(%rbx)
0xffffffffc12f7e61 <+1329>: jmpq 0xffffffffc12f7a9d <mdd_changelog_store+365>
0xffffffffc12f7e66: nopw %cs:0x0(%rax,%rax,1)
979 run_gc_task = true;
980 }
981 spin_unlock(&mdd->mdd_cl.mc_lock);
982 if (run_gc_task) {
983 struct task_struct *gc_task;
984
985 gc_task = kthread_run(mdd_chlg_garbage_collect, mdd,
0xffffffffc12f7aa7 <+375>: xor %eax,%eax
0xffffffffc12f7aa9 <+377>: mov $0xffffffffc131502e,%rcx
0xffffffffc12f7ab0 <+384>: mov $0xffffffff,%edx
0xffffffffc12f7ab5 <+389>: mov %rbx,%rsi
0xffffffffc12f7ab8 <+392>: mov $0xffffffffc12f0ac0,%rdi
0xffffffffc12f7abf <+399>: callq 0xffffffff810b0740 <kthread_create_on_node>
0xffffffffc12f7ac4 <+404>: cmp $0xfffffffffffff000,%rax
0xffffffffc12f7aca <+410>: mov %rax,%r12
0xffffffffc12f7acd <+413>: ja 0xffffffffc12f7d55 <mdd_changelog_store+1061>
0xffffffffc12f7ad3 <+419>: mov %rax,%rdi
0xffffffffc12f7ad6 <+422>: callq 0xffffffff810c4750 <wake_up_process>
............................
where it appears that the compiler has completely avoided run_gc_task variable setting/testing. And this seems to be somewhat the case in your crash-dump : crash> bt 2 PID: 2 TASK: ffff880169448fd0 CPU: 6 COMMAND: "kthreadd" #0 [ffff88016946b500] __schedule at ffffffff816a8f65 #1 [ffff88016946b568] schedule at ffffffff816a94e9 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2] #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2] #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2] #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2] #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs] #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs] #8 [ffff88016946b770] dqget at ffffffff81267814 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4 #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs] #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota] #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota] #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota] #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs] #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass] #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass] #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass] #18 [ffff88016946ba60] shrink_slab at ffffffff81195413 #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091 #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264 #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6 #22 [ffff88016946bd68] copy_process at ffffffff8108510d #23 [ffff88016946bdf8] do_fork at ffffffff81086a51 #24 [ffff88016946be70] kernel_thread at ffffffff81086d06 #25 [ffff88016946be80] kthreadd at ffffffff810b1341 #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58 crash> and even if it seems to move forward as its last activity timestamp is much recent (-1120s) than the one of thread 157540 still stuck with the same stack. So 1st thing I have to do asap is to push a fix to effectivelly initialize run_gc_task variable/boolean to false, and secondly, I will continue crash-dump analysis in order to understand the reason of "kthreadd" pseudo-hang/slow-operations. |
| Comment by Gerrit Updater [ 18/Feb/18 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31347 |
| Comment by Stephane Thiell [ 19/Feb/18 ] |
|
Bruno, Wow, what an amazing troubleshooting! Strange that the uninitialized variable wasn't detected by -Wall nor QA/Maloo indeed... I just built new 2.10.3 RPMs with your patch added that I will deploy ASAP. Thanks much! |
| Comment by Bruno Faccini (Inactive) [ 20/Feb/18 ] |
|
Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow. We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt[0-9], mdt_rdpg) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks : PID: 13741 TASK: ffff88203a31dee0 CPU: 23 COMMAND: "mdt_rdpg01_031" #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65 #1 [ffff880fe83efb88] schedule at ffffffff816a94e9 #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9 #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc] #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc] #7 [ffff880fe83efec8] kthread at ffffffff810b098f #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58 ....................... PID: 157540 TASK: ffff88103a13bf40 CPU: 11 COMMAND: "mdt_rdpg01_050" #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65 #1 [ffff881cb4b57938] schedule at ffffffff816a94e9 #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9 #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd] #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd] #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd] #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt] #10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt] #11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt] #12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #15 [ffff881cb4b57ec8] kthread at ffffffff810b098f #16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58 ..................... PID: 193852 TASK: ffff88202234bf40 CPU: 3 COMMAND: "mdt01_098" #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65 #1 [ffff881aa61fb648] schedule at ffffffff816a94e9 #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9 #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd] #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd] #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt] #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt] #10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt] #11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt] #12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt] #13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc] #14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc] #15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc] #16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #19 [ffff881aa61fbec8] kthread at ffffffff810b098f #20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58 ................... either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record. But as I have already indicated, kthreadd is stuck with the following stack : PID: 2 TASK: ffff880169448fd0 CPU: 6 COMMAND: "kthreadd" #0 [ffff88016946b500] __schedule at ffffffff816a8f65 #1 [ffff88016946b568] schedule at ffffffff816a94e9 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2] #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2] #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2] #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2] #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs] #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs] #8 [ffff88016946b770] dqget at ffffffff81267814 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4 #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs] #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota] #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota] #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota] #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs] #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass] #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass] #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass] #18 [ffff88016946ba60] shrink_slab at ffffffff81195413 #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091 #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264 #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6 #22 [ffff88016946bd68] copy_process at ffffffff8108510d #23 [ffff88016946bdf8] do_fork at ffffffff81086a51 #24 [ffff88016946be70] kernel_thread at ffffffff81086d06 #25 [ffff88016946be80] kthreadd at ffffffff810b1341 #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58 as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock. I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled. I will push a new/2nd patch soon to implement this. |
| Comment by Bruno Faccini (Inactive) [ 21/Feb/18 ] |
|
Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ? But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle. |
| Comment by Stephane Thiell [ 21/Feb/18 ] |
|
Thanks Bruno! Hope you'll find a way to fix this race condition. |
| Comment by Gerrit Updater [ 22/Feb/18 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376 |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/ |
| Comment by Bruno Faccini (Inactive) [ 02/Mar/18 ] |
|
Hello Stephane, Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket. |
| Comment by Stephane Thiell [ 02/Mar/18 ] |
|
Hey Bruno, Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS. |
| Comment by Gerrit Updater [ 06/Mar/18 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552 |
| Comment by Gerrit Updater [ 08/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/ |
| Comment by Andreas Dilger [ 08/Mar/18 ] |
|
Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch " |
| Comment by Gerrit Updater [ 14/Jun/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/ |
| Comment by Peter Jones [ 19/Jun/18 ] |
|
Landed for 2.12 |
| Comment by Andreas Dilger [ 23/Jan/24 ] |
|
"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53773 |