[LU-10680] MDT becoming unresponsive in 2.10.3 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this:

[52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
[54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
[56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
[56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[62145.591250] Pid: 157540, comm: mdt_rdpg01_050
[62145.591486] 
               Call Trace:
[62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
[62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
[62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
[62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
[62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
[62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
[62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
[62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
[62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
[62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
[62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
[62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
[62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
[62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
[62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
[62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
[62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
[62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0

... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

[https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

]

vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

also attaching foreach bt and memory usage

Thanks!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt-oak-md1-s2-2018-02-16-11-02-21.txt
661 kB
17/Feb/18 12:22 AM

Issue Links

is related to

LU-10734 sanity test_160g: User cl8 still found in changelog_users

Resolved

LU-12865 sanity test 160f fails with ‘mds1: User cl6 not registered’

Resolved

LU-12871 enable changelog garbage collection by default

Resolved

is related to

LU-7340 ChangeLogs catalog full condition should be handled more gracefully

Resolved

Activity

[LU-10680] MDT becoming unresponsive in 2.10.3

Gerrit Updater added a comment - 14/Jun/18 3:53 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/
Subject: ~~LU-10680~~ mdd: create gc thread when no current transaction
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 31fef6845e8be39ae8dfd242714a84001925b7ad

Gerrit Updater added a comment - 14/Jun/18 3:53 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/ Subject: LU-10680 mdd: create gc thread when no current transaction Project: fs/lustre-release Branch: master Current Patch Set: Commit: 31fef6845e8be39ae8dfd242714a84001925b7ad

Andreas Dilger added a comment - 08/Mar/18 6:53 PM

Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch "LU-10680 mdd: create gc thread when no current transaction" for this ticket can land when it is ready, but no longer blocks the 2.11 release.

Andreas Dilger added a comment - 08/Mar/18 6:53 PM Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch " LU-10680 mdd: create gc thread when no current transaction " for this ticket can land when it is ready, but no longer blocks the 2.11 release.

Gerrit Updater added a comment - 08/Mar/18 1:57 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/
Subject: ~~LU-10680~~ mdd: disable changelog garbage collection by default
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 57719d0c8833b8a8a8b6516d3bd4f2ad2b226285

Gerrit Updater added a comment - 08/Mar/18 1:57 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/ Subject: LU-10680 mdd: disable changelog garbage collection by default Project: fs/lustre-release Branch: master Current Patch Set: Commit: 57719d0c8833b8a8a8b6516d3bd4f2ad2b226285

Gerrit Updater added a comment - 06/Mar/18 7:27 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552
Subject: ~~LU-10680~~ mdd: disable changelog garbage collection by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 62e4519de3dcb588b88311950e8c6b7fefcca6a8

Gerrit Updater added a comment - 06/Mar/18 7:27 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552 Subject: LU-10680 mdd: disable changelog garbage collection by default Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 62e4519de3dcb588b88311950e8c6b7fefcca6a8

Stephane Thiell added a comment - 02/Mar/18 9:54 PM

Hey Bruno,

Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS.

Stephane Thiell added a comment - 02/Mar/18 9:54 PM Hey Bruno, Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS.

Bruno Faccini (Inactive) added a comment - 02/Mar/18 8:08 AM

Hello Stephane,
Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock.
As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list.

Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

Bruno Faccini (Inactive) added a comment - 02/Mar/18 8:08 AM Hello Stephane, Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock. As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list. Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

Gerrit Updater added a comment - 27/Feb/18 3:43 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/
Subject: ~~LU-10680~~ mdd: fix run_gc_task uninitialized
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

Gerrit Updater added a comment - 27/Feb/18 3:43 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/ Subject: LU-10680 mdd: fix run_gc_task uninitialized Project: fs/lustre-release Branch: master Current Patch Set: Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

Gerrit Updater added a comment - 22/Feb/18 3:24 PM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376
Subject: ~~LU-10680~~ mdd: create gc thread when no current transaction
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2985625ebbb404c01f88ada84427d5b4386fbfa4

Gerrit Updater added a comment - 22/Feb/18 3:24 PM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376 Subject: LU-10680 mdd: create gc thread when no current transaction Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2985625ebbb404c01f88ada84427d5b4386fbfa4

Stephane Thiell added a comment - 21/Feb/18 7:39 PM

Thanks Bruno! Hope you'll find a way to fix this race condition.

Stephane Thiell added a comment - 21/Feb/18 7:39 PM Thanks Bruno! Hope you'll find a way to fix this race condition.

Bruno Faccini (Inactive) added a comment - 21/Feb/18 10:45 AM

Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ?

But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle.

Bruno Faccini (Inactive) added a comment - 21/Feb/18 10:45 AM Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ? But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle.

Bruno Faccini (Inactive) added a comment - 20/Feb/18 6:46 PM - edited

Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow.

We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt[0-9], mdt_rdpg) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks :

PID: 13741  TASK: ffff88203a31dee0  CPU: 23  COMMAND: "mdt_rdpg01_031"
 #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65
 #1 [ffff880fe83efb88] schedule at ffffffff816a94e9
 #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9
 #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d
 #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc]
 #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc]
 #7 [ffff880fe83efec8] kthread at ffffffff810b098f
 #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58
.......................
PID: 157540  TASK: ffff88103a13bf40  CPU: 11  COMMAND: "mdt_rdpg01_050"
 #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65
 #1 [ffff881cb4b57938] schedule at ffffffff816a94e9
 #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9
 #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d
 #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
 #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd]
 #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd]
 #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd]
 #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt]
#10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt]
#11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt]
#12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
#13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
#14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
#15 [ffff881cb4b57ec8] kthread at ffffffff810b098f
#16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58
.....................
PID: 193852  TASK: ffff88202234bf40  CPU: 3   COMMAND: "mdt01_098"
 #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65
 #1 [ffff881aa61fb648] schedule at ffffffff816a94e9
 #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9
 #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d
 #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea
 #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
 #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd]
 #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd]
 #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt]
 #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt]
#10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt]
#11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt]
#12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt]
#13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc]
#14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc]
#15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc]
#16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
#17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
#18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
#19 [ffff881aa61fbec8] kthread at ffffffff810b098f
#20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58
...................

either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record.

But as I have already indicated, kthreadd is stuck with the following stack :

PID: 2      TASK: ffff880169448fd0  CPU: 6   COMMAND: "kthreadd"
 #0 [ffff88016946b500] __schedule at ffffffff816a8f65
 #1 [ffff88016946b568] schedule at ffffffff816a94e9
 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2]
 #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2]
 #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2]
 #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2]
 #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs]
 #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs]
 #8 [ffff88016946b770] dqget at ffffffff81267814
 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4
#10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs]
#11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota]
#12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota]
#13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota]
#14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs]
#15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass]
#16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass]
#17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass]
#18 [ffff88016946ba60] shrink_slab at ffffffff81195413
#19 [ffff88016946bb00] zone_reclaim at ffffffff81198091
#20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264
#21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6
#22 [ffff88016946bd68] copy_process at ffffffff8108510d
#23 [ffff88016946bdf8] do_fork at ffffffff81086a51
#24 [ffff88016946be70] kernel_thread at ffffffff81086d06
#25 [ffff88016946be80] kthreadd at ffffffff810b1341
#26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58

as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock.

I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled.

I will push a new/2nd patch soon to implement this.

Bruno Faccini (Inactive) added a comment - 20/Feb/18 6:46 PM - edited Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow. We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt [0-9] , mdt_rdpg ) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks : PID: 13741 TASK: ffff88203a31dee0 CPU: 23 COMMAND: "mdt_rdpg01_031" #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65 #1 [ffff880fe83efb88] schedule at ffffffff816a94e9 #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9 #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc] #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc] #7 [ffff880fe83efec8] kthread at ffffffff810b098f #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58 ....................... PID: 157540 TASK: ffff88103a13bf40 CPU: 11 COMMAND: "mdt_rdpg01_050" #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65 #1 [ffff881cb4b57938] schedule at ffffffff816a94e9 #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9 #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd] #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd] #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd] #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt] #10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt] #11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt] #12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #15 [ffff881cb4b57ec8] kthread at ffffffff810b098f #16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58 ..................... PID: 193852 TASK: ffff88202234bf40 CPU: 3 COMMAND: "mdt01_098" #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65 #1 [ffff881aa61fb648] schedule at ffffffff816a94e9 #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9 #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd] #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd] #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt] #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt] #10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt] #11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt] #12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt] #13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc] #14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc] #15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc] #16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #19 [ffff881aa61fbec8] kthread at ffffffff810b098f #20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58 ................... either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record. But as I have already indicated, kthreadd is stuck with the following stack : PID: 2 TASK: ffff880169448fd0 CPU: 6 COMMAND: "kthreadd" #0 [ffff88016946b500] __schedule at ffffffff816a8f65 #1 [ffff88016946b568] schedule at ffffffff816a94e9 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2] #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2] #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2] #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2] #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs] #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs] #8 [ffff88016946b770] dqget at ffffffff81267814 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4 #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs] #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota] #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota] #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota] #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs] #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass] #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass] #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass] #18 [ffff88016946ba60] shrink_slab at ffffffff81195413 #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091 #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264 #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6 #22 [ffff88016946bd68] copy_process at ffffffff8108510d #23 [ffff88016946bdf8] do_fork at ffffffff81086a51 #24 [ffff88016946be70] kernel_thread at ffffffff81086d06 #25 [ffff88016946be80] kthreadd at ffffffff810b1341 #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58 as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock. I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled. I will push a new/2nd patch soon to implement this.

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/Feb/18 12:22 AM

Updated:: 20/Dec/24 11:04 PM

Resolved:: 19/Jun/18 8:08 PM