Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0
    • Lustre 2.10.3
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this: 

      [52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
      [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
      [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
      [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [62145.591250] Pid: 157540, comm: mdt_rdpg01_050
      [62145.591486] 
                     Call Trace:
      [62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
      [62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
      [62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
      [62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      [62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
      [62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
      [62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
      [62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
      [62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
      [62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
      [62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
      [62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
      [62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
      [62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
      [62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
      [62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      [62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      [62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      
      

      ... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

       

      vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

      [https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

      ]

      vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

      https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

      also attaching foreach bt and memory usage

      Thanks!

      Stephane

      Attachments

        Issue Links

          Activity

            [LU-10680] MDT becoming unresponsive in 2.10.3

            Hello Stephane,
            Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock.
            As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list.

            Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Stephane, Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock. As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list. Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/
            Subject: LU-10680 mdd: fix run_gc_task uninitialized
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/ Subject: LU-10680 mdd: fix run_gc_task uninitialized Project: fs/lustre-release Branch: master Current Patch Set: Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376
            Subject: LU-10680 mdd: create gc thread when no current transaction
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2985625ebbb404c01f88ada84427d5b4386fbfa4

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31376 Subject: LU-10680 mdd: create gc thread when no current transaction Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2985625ebbb404c01f88ada84427d5b4386fbfa4

            Thanks Bruno! Hope you'll find a way to fix this race condition.

            sthiell Stephane Thiell added a comment - Thanks Bruno! Hope you'll find a way to fix this race condition.

            Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ?

            But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle.

            bfaccini Bruno Faccini (Inactive) added a comment - Thinking more about it, it seems to me that khreadd should be excluded to run the memory reclaim mechanism in its context, in this case it would be better to do nothing (and leave next thread that will try to allocate some memory to trigger the same conditions/threshold and start to reclaim in its context) or to simply/only wake-up kswapd to do the job asynchronously ? But anyway, we need to protect against such possible cases, and only request for changelog garbage-collection thread creation, outside of any current journal transaction/handle.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow.

            We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt[0-9], mdt_rdpg) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks :

            PID: 13741  TASK: ffff88203a31dee0  CPU: 23  COMMAND: "mdt_rdpg01_031"
             #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65
             #1 [ffff880fe83efb88] schedule at ffffffff816a94e9
             #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9
             #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d
             #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea
             #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc]
             #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc]
             #7 [ffff880fe83efec8] kthread at ffffffff810b098f
             #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58
            .......................
            PID: 157540  TASK: ffff88103a13bf40  CPU: 11  COMMAND: "mdt_rdpg01_050"
             #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65
             #1 [ffff881cb4b57938] schedule at ffffffff816a94e9
             #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9
             #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d
             #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea
             #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
             #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd]
             #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd]
             #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd]
             #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt]
            #10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt]
            #11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt]
            #12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
            #13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
            #14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
            #15 [ffff881cb4b57ec8] kthread at ffffffff810b098f
            #16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58
            .....................
            PID: 193852  TASK: ffff88202234bf40  CPU: 3   COMMAND: "mdt01_098"
             #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65
             #1 [ffff881aa61fb648] schedule at ffffffff816a94e9
             #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9
             #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d
             #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea
             #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd]
             #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd]
             #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd]
             #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt]
             #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt]
            #10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt]
            #11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt]
            #12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt]
            #13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc]
            #14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc]
            #15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc]
            #16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc]
            #17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc]
            #18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc]
            #19 [ffff881aa61fbec8] kthread at ffffffff810b098f
            #20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58
            ...................
            

            either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record.

            But as I have already indicated, kthreadd is stuck with the following stack :

            PID: 2      TASK: ffff880169448fd0  CPU: 6   COMMAND: "kthreadd"
             #0 [ffff88016946b500] __schedule at ffffffff816a8f65
             #1 [ffff88016946b568] schedule at ffffffff816a94e9
             #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2]
             #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2]
             #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2]
             #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2]
             #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs]
             #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs]
             #8 [ffff88016946b770] dqget at ffffffff81267814
             #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4
            #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs]
            #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota]
            #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota]
            #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota]
            #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs]
            #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass]
            #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass]
            #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass]
            #18 [ffff88016946ba60] shrink_slab at ffffffff81195413
            #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091
            #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264
            #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6
            #22 [ffff88016946bd68] copy_process at ffffffff8108510d
            #23 [ffff88016946bdf8] do_fork at ffffffff81086a51
            #24 [ffff88016946be70] kernel_thread at ffffffff81086d06
            #25 [ffff88016946be80] kthreadd at ffffffff810b1341
            #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58
            

            as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock.

            I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled.

            I will push a new/2nd patch soon to implement this.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow. We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt [0-9] , mdt_rdpg ) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks : PID: 13741 TASK: ffff88203a31dee0 CPU: 23 COMMAND: "mdt_rdpg01_031" #0 [ffff880fe83efb20] __schedule at ffffffff816a8f65 #1 [ffff880fe83efb88] schedule at ffffffff816a94e9 #2 [ffff880fe83efb98] schedule_timeout at ffffffff816a6ff9 #3 [ffff880fe83efc40] wait_for_completion at ffffffff816a989d #4 [ffff880fe83efca0] kthread_create_on_node at ffffffff810b07ea #5 [ffff880fe83efd58] ptlrpc_start_thread at ffffffffc0b962d5 [ptlrpc] #6 [ffff880fe83efde8] ptlrpc_main at ffffffffc0b9803c [ptlrpc] #7 [ffff880fe83efec8] kthread at ffffffff810b098f #8 [ffff880fe83eff50] ret_from_fork at ffffffff816b4f58 ....................... PID: 157540 TASK: ffff88103a13bf40 CPU: 11 COMMAND: "mdt_rdpg01_050" #0 [ffff881cb4b578d0] __schedule at ffffffff816a8f65 #1 [ffff881cb4b57938] schedule at ffffffff816a94e9 #2 [ffff881cb4b57948] schedule_timeout at ffffffff816a6ff9 #3 [ffff881cb4b579f0] wait_for_completion at ffffffff816a989d #4 [ffff881cb4b57a50] kthread_create_on_node at ffffffff810b07ea #5 [ffff881cb4b57b08] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881cb4b57b70] mdd_changelog_data_store_by_fid at ffffffffc130532d [mdd] #7 [ffff881cb4b57bc8] mdd_changelog_data_store at ffffffffc130644b [mdd] #8 [ffff881cb4b57be0] mdd_close at ffffffffc130779b [mdd] #9 [ffff881cb4b57c40] mdt_mfd_close at ffffffffc11c9073 [mdt] #10 [ffff881cb4b57c90] mdt_close_internal at ffffffffc11ce651 [mdt] #11 [ffff881cb4b57cc0] mdt_close at ffffffffc11ce970 [mdt] #12 [ffff881cb4b57d00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #13 [ffff881cb4b57d48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #14 [ffff881cb4b57de8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #15 [ffff881cb4b57ec8] kthread at ffffffff810b098f #16 [ffff881cb4b57f50] ret_from_fork at ffffffff816b4f58 ..................... PID: 193852 TASK: ffff88202234bf40 CPU: 3 COMMAND: "mdt01_098" #0 [ffff881aa61fb5e0] __schedule at ffffffff816a8f65 #1 [ffff881aa61fb648] schedule at ffffffff816a94e9 #2 [ffff881aa61fb658] schedule_timeout at ffffffff816a6ff9 #3 [ffff881aa61fb700] wait_for_completion at ffffffff816a989d #4 [ffff881aa61fb760] kthread_create_on_node at ffffffff810b07ea #5 [ffff881aa61fb818] mdd_changelog_store at ffffffffc12f7ac4 [mdd] #6 [ffff881aa61fb880] mdd_changelog_ns_store at ffffffffc12f8096 [mdd] #7 [ffff881aa61fb918] mdd_create at ffffffffc12fb40a [mdd] #8 [ffff881aa61fba10] mdt_reint_open at ffffffffc11cc9bc [mdt] #9 [ffff881aa61fbb00] mdt_reint_rec at ffffffffc11c18a0 [mdt] #10 [ffff881aa61fbb28] mdt_reint_internal at ffffffffc11a330b [mdt] #11 [ffff881aa61fbb60] mdt_intent_reint at ffffffffc11a3832 [mdt] #12 [ffff881aa61fbba0] mdt_intent_policy at ffffffffc11ae59e [mdt] #13 [ffff881aa61fbbf8] ldlm_lock_enqueue at ffffffffc0b39277 [ptlrpc] #14 [ffff881aa61fbc50] ldlm_handle_enqueue0 at ffffffffc0b62903 [ptlrpc] #15 [ffff881aa61fbce0] tgt_enqueue at ffffffffc0be7ea2 [ptlrpc] #16 [ffff881aa61fbd00] tgt_request_handle at ffffffffc0bebda5 [ptlrpc] #17 [ffff881aa61fbd48] ptlrpc_server_handle_request at ffffffffc0b94b16 [ptlrpc] #18 [ffff881aa61fbde8] ptlrpc_main at ffffffffc0b98252 [ptlrpc] #19 [ffff881aa61fbec8] kthread at ffffffff810b098f #20 [ffff881aa61fbf50] ret_from_fork at ffffffff816b4f58 ................... either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record. But as I have already indicated, kthreadd is stuck with the following stack : PID: 2 TASK: ffff880169448fd0 CPU: 6 COMMAND: "kthreadd" #0 [ffff88016946b500] __schedule at ffffffff816a8f65 #1 [ffff88016946b568] schedule at ffffffff816a94e9 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2] #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2] #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2] #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2] #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs] #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs] #8 [ffff88016946b770] dqget at ffffffff81267814 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4 #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs] #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota] #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota] #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota] #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs] #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass] #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass] #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass] #18 [ffff88016946ba60] shrink_slab at ffffffff81195413 #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091 #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264 #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6 #22 [ffff88016946bd68] copy_process at ffffffff8108510d #23 [ffff88016946bdf8] do_fork at ffffffff81086a51 #24 [ffff88016946be70] kernel_thread at ffffffff81086d06 #25 [ffff88016946be80] kthreadd at ffffffff810b1341 #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58 as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock. I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled. I will push a new/2nd patch soon to implement this.

            Bruno,

            Wow, what an amazing troubleshooting! Strange that the uninitialized variable wasn't detected by -Wall nor QA/Maloo indeed... I just built new 2.10.3 RPMs with your patch added that I will deploy ASAP.

            Thanks much!

            sthiell Stephane Thiell added a comment - Bruno, Wow, what an amazing troubleshooting! Strange that the uninitialized variable wasn't detected by -Wall nor QA/Maloo indeed... I just built new 2.10.3 RPMs with your patch added that I will deploy ASAP. Thanks much!

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31347
            Subject: LU-10680 mdd: fix run_gc_task uninitialized
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: df5102382a42a1d366fe38d9da47a1c20ac40dd8

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31347 Subject: LU-10680 mdd: fix run_gc_task uninitialized Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: df5102382a42a1d366fe38d9da47a1c20ac40dd8

            Well after doing some analysis on the crash-dump I feel puzzled by what I seem to have discovered.
            It looks like the assembly generated code of mdd_changelog_store() routine, including my patch for LU-7340, is wrong :

            Dump of assembler code for function mdd_changelog_store:
            931     int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd,
               0xffffffffc12f7972 <+66>:    mov    0x28(%rsi),%r14
            
            932                             struct llog_changelog_rec *rec, struct thandle *th)
            933     {
               0xffffffffc12f7930 <+0>:     nopl   0x0(%rax,%rax,1)
               0xffffffffc12f7935 <+5>:     push   %rbp
               0xffffffffc12f7936 <+6>:     mov    %rsp,%rbp
               0xffffffffc12f7939 <+9>:     push   %r15
               0xffffffffc12f793b <+11>:    push   %r14
               0xffffffffc12f793d <+13>:    push   %r13
               0xffffffffc12f793f <+15>:    mov    %rdx,%r13
               0xffffffffc12f7942 <+18>:    push   %r12
               0xffffffffc12f7944 <+20>:    push   %rbx
               0xffffffffc12f7945 <+21>:    mov    %rsi,%rbx
               0xffffffffc12f794f <+31>:    sub    $0x30,%rsp
               0xffffffffc12f7957 <+39>:    mov    %rcx,-0x58(%rbp)
               0xffffffffc12f795b <+43>:    mov    %gs:0x28,%rax
               0xffffffffc12f7964 <+52>:    mov    %rax,-0x30(%rbp)
               0xffffffffc12f7968 <+56>:    xor    %eax,%eax
               0xffffffffc12f796a <+58>:    mov    %rdi,-0x48(%rbp)
            
            934             struct obd_device       *obd = mdd2obd_dev(mdd);
            935             struct llog_ctxt        *ctxt;
            936             struct thandle          *llog_th;
            937             int                      rc;
            938             bool                     run_gc_task;
            939     
            940             rec->cr_hdr.lrh_len = llog_data_len(sizeof(*rec) +
               0xffffffffc12f79b2 <+130>:   mov    %eax,0x0(%r13)
            ........................
            964     
            965             /* time to recover some space ?? */
            966             spin_lock(&mdd->mdd_cl.mc_lock);
            967             if (unlikely(mdd->mdd_changelog_gc && (ktime_get_real_seconds() -
               0xffffffffc12f7a8f <+351>:   mov    0xf8(%rbx),%eax
               0xffffffffc12f7a95 <+357>:   test   %eax,%eax
               0xffffffffc12f7a97 <+359>:   jne    0xffffffffc12f7dc0 <mdd_changelog_store+1168>
               0xffffffffc12f7dc0 <+1168>:  callq  0xffffffffc07fec90 <ktime_get_real_seconds>
               0xffffffffc12f7dc5 <+1173>:  mov    0x108(%rbx),%edx
               0xffffffffc12f7dcb <+1179>:  sub    0xf0(%rbx),%rax
               0xffffffffc12f7dd2 <+1186>:  cmp    %rdx,%rax
               0xffffffffc12f7dd5 <+1189>:  jle    0xffffffffc12f7a9d <mdd_changelog_store+365>
               0xffffffffc12f7ddb <+1195>:  cmpq   $0x0,0xe8(%rbx)
               0xffffffffc12f7de3 <+1203>:  jne    0xffffffffc12f7a9d <mdd_changelog_store+365>
               0xffffffffc12f7de9 <+1209>:  mov    0x30(%r15),%rdi
               0xffffffffc12f7ded <+1213>:  callq  0xffffffffc0919a20 <llog_cat_free_space>
               0xffffffffc12f7df2 <+1218>:  cmp    0x10c(%rbx),%eax
               0xffffffffc12f7df8 <+1224>:  ja     0xffffffffc12f7a9d <mdd_changelog_store+365>
            
            968                 mdd->mdd_cl.mc_gc_time > mdd->mdd_changelog_min_gc_interval) &&
            969                 mdd->mdd_cl.mc_gc_task == NULL &&
            970                 llog_cat_free_space(ctxt->loc_handle) <=
            971                                     mdd->mdd_changelog_min_free_cat_entries)) {
            972                     CWARN("%s: low on changelog_catalog free entries, starting "
               0xffffffffc12f7dfe <+1230>:  lea    0x40(%r14),%rdx
               0xffffffffc12f7e02 <+1234>:  mov    $0xffffffffc1316ec8,%rsi
               0xffffffffc12f7e09 <+1241>:  mov    $0xffffffffc132ba80,%rdi
               0xffffffffc12f7e10 <+1248>:  xor    %eax,%eax
               0xffffffffc12f7e12 <+1250>:  movl   $0x4,0x33c74(%rip)        # 0xffffffffc132ba90 <msgdata.69570+16>
               0xffffffffc12f7e1c <+1260>:  movq   $0xffffffffc1316b28,0x33c59(%rip)        # 0xffffffffc132ba80 <msgdata.69570>
               0xffffffffc12f7e27 <+1271>:  movq   $0xffffffffc1312b70,0x33c56(%rip)        # 0xffffffffc132ba88 <msgdata.69570+8>
               0xffffffffc12f7e32 <+1282>:  movl   $0x3cd,0x33c58(%rip)        # 0xffffffffc132ba94 <msgdata.69570+20>
               0xffffffffc12f7e3c <+1292>:  movq   $0xffffffffc132ba70,0x33c59(%rip)        # 0xffffffffc132baa0 <msgdata.69570+32>
               0xffffffffc12f7e47 <+1303>:  movl   $0x400,0x33c47(%rip)        # 0xffffffffc132ba98 <msgdata.69570+24>
               0xffffffffc12f7e51 <+1313>:  callq  0xffffffffc0809b70 <libcfs_debug_msg>
            
            973                           "ChangeLog garbage collection thread\n", obd->obd_name);
            974     
            975                     /* indicate further kthread run will occur outside right after
            976                      * critical section
            977                      */
            978                     mdd->mdd_cl.mc_gc_task = (struct task_struct *)(-1);
               0xffffffffc12f7e56 <+1318>:  movq   $0xffffffffffffffff,0xe8(%rbx)
               0xffffffffc12f7e61 <+1329>:  jmpq   0xffffffffc12f7a9d <mdd_changelog_store+365>
               0xffffffffc12f7e66:  nopw   %cs:0x0(%rax,%rax,1)
            
            979                     run_gc_task = true;
            980             }
            981             spin_unlock(&mdd->mdd_cl.mc_lock);
            982             if (run_gc_task) {
            983                     struct task_struct *gc_task;
            984     
            985                     gc_task = kthread_run(mdd_chlg_garbage_collect, mdd,
               0xffffffffc12f7aa7 <+375>:   xor    %eax,%eax
               0xffffffffc12f7aa9 <+377>:   mov    $0xffffffffc131502e,%rcx
               0xffffffffc12f7ab0 <+384>:   mov    $0xffffffff,%edx
               0xffffffffc12f7ab5 <+389>:   mov    %rbx,%rsi
               0xffffffffc12f7ab8 <+392>:   mov    $0xffffffffc12f0ac0,%rdi
               0xffffffffc12f7abf <+399>:   callq  0xffffffff810b0740 <kthread_create_on_node>
               0xffffffffc12f7ac4 <+404>:   cmp    $0xfffffffffffff000,%rax
               0xffffffffc12f7aca <+410>:   mov    %rax,%r12
               0xffffffffc12f7acd <+413>:   ja     0xffffffffc12f7d55 <mdd_changelog_store+1061>
               0xffffffffc12f7ad3 <+419>:   mov    %rax,%rdi
               0xffffffffc12f7ad6 <+422>:   callq  0xffffffff810c4750 <wake_up_process>
            ............................
            

            where it appears that the compiler has completely avoided run_gc_task variable setting/testing.
            This may have been caused y the fact run_gc_task has been mistakenly forgotten to be initialized, but then it is unclear why there has been no warning/error to be generated at build/compile time (even if we use -Wall option of gcc, but seems that -Wuninitialized specific check may be, silently?, omitted because no optimization is requested as per some forums threads are reporting?) and also why the generated assembly code do not strictly correspond to the source code.
            The side effect of all of this is that an attempt to create a kthread occurs during each ChangeLog record creation (instead of only when garbage-collection conditions have been triggered). And the consequences seems to be aggravated due to the fact now, kthreads are not started in the context of the requestor but requests to do so are queued for "kthreadd" handling which can lead to much trouble if "kthreadd" suffers some issue causing its operations to stop/slow-down.

            And this seems to be somewhat the case in your crash-dump :

            crash> bt 2
            PID: 2      TASK: ffff880169448fd0  CPU: 6   COMMAND: "kthreadd"
             #0 [ffff88016946b500] __schedule at ffffffff816a8f65
             #1 [ffff88016946b568] schedule at ffffffff816a94e9
             #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2]
             #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2]
             #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2]
             #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2]
             #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs]
             #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs]
             #8 [ffff88016946b770] dqget at ffffffff81267814
             #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4
            #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs]
            #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota]
            #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota]
            #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota]
            #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs]
            #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass]
            #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass]
            #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass]
            #18 [ffff88016946ba60] shrink_slab at ffffffff81195413
            #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091
            #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264
            #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6
            #22 [ffff88016946bd68] copy_process at ffffffff8108510d
            #23 [ffff88016946bdf8] do_fork at ffffffff81086a51
            #24 [ffff88016946be70] kernel_thread at ffffffff81086d06
            #25 [ffff88016946be80] kthreadd at ffffffff810b1341
            #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58
            crash> 
            

            and even if it seems to move forward as its last activity timestamp is much recent (-1120s) than the one of thread 157540 still stuck with the same stack.

            So 1st thing I have to do asap is to push a fix to effectivelly initialize run_gc_task variable/boolean to false, and secondly, I will continue crash-dump analysis in order to understand the reason of "kthreadd" pseudo-hang/slow-operations.

            bfaccini Bruno Faccini (Inactive) added a comment - Well after doing some analysis on the crash-dump I feel puzzled by what I seem to have discovered. It looks like the assembly generated code of mdd_changelog_store() routine, including my patch for LU-7340 , is wrong : Dump of assembler code for function mdd_changelog_store: 931 int mdd_changelog_store(const struct lu_env *env, struct mdd_device *mdd, 0xffffffffc12f7972 <+66>: mov 0x28(%rsi),%r14 932 struct llog_changelog_rec *rec, struct thandle *th) 933 { 0xffffffffc12f7930 <+0>: nopl 0x0(%rax,%rax,1) 0xffffffffc12f7935 <+5>: push %rbp 0xffffffffc12f7936 <+6>: mov %rsp,%rbp 0xffffffffc12f7939 <+9>: push %r15 0xffffffffc12f793b <+11>: push %r14 0xffffffffc12f793d <+13>: push %r13 0xffffffffc12f793f <+15>: mov %rdx,%r13 0xffffffffc12f7942 <+18>: push %r12 0xffffffffc12f7944 <+20>: push %rbx 0xffffffffc12f7945 <+21>: mov %rsi,%rbx 0xffffffffc12f794f <+31>: sub $0x30,%rsp 0xffffffffc12f7957 <+39>: mov %rcx,-0x58(%rbp) 0xffffffffc12f795b <+43>: mov %gs:0x28,%rax 0xffffffffc12f7964 <+52>: mov %rax,-0x30(%rbp) 0xffffffffc12f7968 <+56>: xor %eax,%eax 0xffffffffc12f796a <+58>: mov %rdi,-0x48(%rbp) 934 struct obd_device *obd = mdd2obd_dev(mdd); 935 struct llog_ctxt *ctxt; 936 struct thandle *llog_th; 937 int rc; 938 bool run_gc_task; 939 940 rec->cr_hdr.lrh_len = llog_data_len(sizeof(*rec) + 0xffffffffc12f79b2 <+130>: mov %eax,0x0(%r13) ........................ 964 965 /* time to recover some space ?? */ 966 spin_lock(&mdd->mdd_cl.mc_lock); 967 if (unlikely(mdd->mdd_changelog_gc && (ktime_get_real_seconds() - 0xffffffffc12f7a8f <+351>: mov 0xf8(%rbx),%eax 0xffffffffc12f7a95 <+357>: test %eax,%eax 0xffffffffc12f7a97 <+359>: jne 0xffffffffc12f7dc0 <mdd_changelog_store+1168> 0xffffffffc12f7dc0 <+1168>: callq 0xffffffffc07fec90 <ktime_get_real_seconds> 0xffffffffc12f7dc5 <+1173>: mov 0x108(%rbx),%edx 0xffffffffc12f7dcb <+1179>: sub 0xf0(%rbx),%rax 0xffffffffc12f7dd2 <+1186>: cmp %rdx,%rax 0xffffffffc12f7dd5 <+1189>: jle 0xffffffffc12f7a9d <mdd_changelog_store+365> 0xffffffffc12f7ddb <+1195>: cmpq $0x0,0xe8(%rbx) 0xffffffffc12f7de3 <+1203>: jne 0xffffffffc12f7a9d <mdd_changelog_store+365> 0xffffffffc12f7de9 <+1209>: mov 0x30(%r15),%rdi 0xffffffffc12f7ded <+1213>: callq 0xffffffffc0919a20 <llog_cat_free_space> 0xffffffffc12f7df2 <+1218>: cmp 0x10c(%rbx),%eax 0xffffffffc12f7df8 <+1224>: ja 0xffffffffc12f7a9d <mdd_changelog_store+365> 968 mdd->mdd_cl.mc_gc_time > mdd->mdd_changelog_min_gc_interval) && 969 mdd->mdd_cl.mc_gc_task == NULL && 970 llog_cat_free_space(ctxt->loc_handle) <= 971 mdd->mdd_changelog_min_free_cat_entries)) { 972 CWARN("%s: low on changelog_catalog free entries, starting " 0xffffffffc12f7dfe <+1230>: lea 0x40(%r14),%rdx 0xffffffffc12f7e02 <+1234>: mov $0xffffffffc1316ec8,%rsi 0xffffffffc12f7e09 <+1241>: mov $0xffffffffc132ba80,%rdi 0xffffffffc12f7e10 <+1248>: xor %eax,%eax 0xffffffffc12f7e12 <+1250>: movl $0x4,0x33c74(%rip) # 0xffffffffc132ba90 <msgdata.69570+16> 0xffffffffc12f7e1c <+1260>: movq $0xffffffffc1316b28,0x33c59(%rip) # 0xffffffffc132ba80 <msgdata.69570> 0xffffffffc12f7e27 <+1271>: movq $0xffffffffc1312b70,0x33c56(%rip) # 0xffffffffc132ba88 <msgdata.69570+8> 0xffffffffc12f7e32 <+1282>: movl $0x3cd,0x33c58(%rip) # 0xffffffffc132ba94 <msgdata.69570+20> 0xffffffffc12f7e3c <+1292>: movq $0xffffffffc132ba70,0x33c59(%rip) # 0xffffffffc132baa0 <msgdata.69570+32> 0xffffffffc12f7e47 <+1303>: movl $0x400,0x33c47(%rip) # 0xffffffffc132ba98 <msgdata.69570+24> 0xffffffffc12f7e51 <+1313>: callq 0xffffffffc0809b70 <libcfs_debug_msg> 973 "ChangeLog garbage collection thread\n", obd->obd_name); 974 975 /* indicate further kthread run will occur outside right after 976 * critical section 977 */ 978 mdd->mdd_cl.mc_gc_task = (struct task_struct *)(-1); 0xffffffffc12f7e56 <+1318>: movq $0xffffffffffffffff,0xe8(%rbx) 0xffffffffc12f7e61 <+1329>: jmpq 0xffffffffc12f7a9d <mdd_changelog_store+365> 0xffffffffc12f7e66: nopw %cs:0x0(%rax,%rax,1) 979 run_gc_task = true; 980 } 981 spin_unlock(&mdd->mdd_cl.mc_lock); 982 if (run_gc_task) { 983 struct task_struct *gc_task; 984 985 gc_task = kthread_run(mdd_chlg_garbage_collect, mdd, 0xffffffffc12f7aa7 <+375>: xor %eax,%eax 0xffffffffc12f7aa9 <+377>: mov $0xffffffffc131502e,%rcx 0xffffffffc12f7ab0 <+384>: mov $0xffffffff,%edx 0xffffffffc12f7ab5 <+389>: mov %rbx,%rsi 0xffffffffc12f7ab8 <+392>: mov $0xffffffffc12f0ac0,%rdi 0xffffffffc12f7abf <+399>: callq 0xffffffff810b0740 <kthread_create_on_node> 0xffffffffc12f7ac4 <+404>: cmp $0xfffffffffffff000,%rax 0xffffffffc12f7aca <+410>: mov %rax,%r12 0xffffffffc12f7acd <+413>: ja 0xffffffffc12f7d55 <mdd_changelog_store+1061> 0xffffffffc12f7ad3 <+419>: mov %rax,%rdi 0xffffffffc12f7ad6 <+422>: callq 0xffffffff810c4750 <wake_up_process> ............................ where it appears that the compiler has completely avoided run_gc_task variable setting/testing. This may have been caused y the fact run_gc_task has been mistakenly forgotten to be initialized, but then it is unclear why there has been no warning/error to be generated at build/compile time (even if we use -Wall option of gcc, but seems that -Wuninitialized specific check may be, silently?, omitted because no optimization is requested as per some forums threads are reporting?) and also why the generated assembly code do not strictly correspond to the source code. The side effect of all of this is that an attempt to create a kthread occurs during each ChangeLog record creation (instead of only when garbage-collection conditions have been triggered). And the consequences seems to be aggravated due to the fact now, kthreads are not started in the context of the requestor but requests to do so are queued for "kthreadd" handling which can lead to much trouble if "kthreadd" suffers some issue causing its operations to stop/slow-down. And this seems to be somewhat the case in your crash-dump : crash> bt 2 PID: 2 TASK: ffff880169448fd0 CPU: 6 COMMAND: "kthreadd" #0 [ffff88016946b500] __schedule at ffffffff816a8f65 #1 [ffff88016946b568] schedule at ffffffff816a94e9 #2 [ffff88016946b578] wait_transaction_locked at ffffffffc0177085 [jbd2] #3 [ffff88016946b5d0] add_transaction_credits at ffffffffc0177368 [jbd2] #4 [ffff88016946b630] start_this_handle at ffffffffc01775e1 [jbd2] #5 [ffff88016946b6c8] jbd2__journal_start at ffffffffc0177a93 [jbd2] #6 [ffff88016946b710] __ldiskfs_journal_start_sb at ffffffffc0f55069 [ldiskfs] #7 [ffff88016946b750] ldiskfs_acquire_dquot at ffffffffc0f8b753 [ldiskfs] #8 [ffff88016946b770] dqget at ffffffff81267814 #9 [ffff88016946b7d0] dquot_get_dqblk at ffffffff812685f4 #10 [ffff88016946b7f0] osd_acct_index_lookup at ffffffffc10a58df [osd_ldiskfs] #11 [ffff88016946b828] lquota_disk_read at ffffffffc1014214 [lquota] #12 [ffff88016946b858] qsd_refresh_usage at ffffffffc101bbfa [lquota] #13 [ffff88016946b890] qsd_op_adjust at ffffffffc102a881 [lquota] #14 [ffff88016946b8d0] osd_object_delete at ffffffffc106ec50 [osd_ldiskfs] #15 [ffff88016946b910] lu_object_free at ffffffffc095ce3d [obdclass] #16 [ffff88016946b968] lu_site_purge_objects at ffffffffc095da9e [obdclass] #17 [ffff88016946ba10] lu_cache_shrink at ffffffffc095e8e9 [obdclass] #18 [ffff88016946ba60] shrink_slab at ffffffff81195413 #19 [ffff88016946bb00] zone_reclaim at ffffffff81198091 #20 [ffff88016946bba8] get_page_from_freelist at ffffffff8118c264 #21 [ffff88016946bcb8] __alloc_pages_nodemask at ffffffff8118caf6 #22 [ffff88016946bd68] copy_process at ffffffff8108510d #23 [ffff88016946bdf8] do_fork at ffffffff81086a51 #24 [ffff88016946be70] kernel_thread at ffffffff81086d06 #25 [ffff88016946be80] kthreadd at ffffffff810b1341 #26 [ffff88016946bf50] ret_from_fork at ffffffff816b4f58 crash> and even if it seems to move forward as its last activity timestamp is much recent (-1120s) than the one of thread 157540 still stuck with the same stack. So 1st thing I have to do asap is to push a fix to effectivelly initialize run_gc_task variable/boolean to false, and secondly, I will continue crash-dump analysis in order to understand the reason of "kthreadd" pseudo-hang/slow-operations.

            Hey Bruno,

            Thanks for checking. I noticed that function `mdd_chlg_garbage_collect()` in the "inactivate thread" stack trace too. But this is weird because it is very unlikely cl1 was actually idle, it's a robinhood consumer and if I check the robinhood log just a bit before the issue / crash dump, it seems to be running:

             2018/02/16 10:29:18 [7022/2] STATS | ==================== Dumping stats at 2018/02/16 10:29:18 =====================
            2018/02/16 10:29:18 [7022/2] STATS | ======== General statistics =========
            2018/02/16 10:29:18 [7022/2] STATS | Daemon start time: 2018/01/29 11:23:38
            2018/02/16 10:29:18 [7022/2] STATS | Started modules: scan,log_reader,policy_run(modeguard)
            2018/02/16 10:29:18 [7022/2] STATS | ======== FS scan statistics =========
            2018/02/16 10:29:18 [7022/2] STATS | last scan  = 2018/01/30 00:24:56
            2018/02/16 10:29:18 [7022/2] STATS | duration    = 13h 01min 18s (46878 s)
            2018/02/16 10:29:18 [7022/2] STATS | status      = complete
            2018/02/16 10:29:18 [7022/2] STATS | current scan interval = 22.0d
            2018/02/16 10:29:18 [7022/2] STATS | ChangeLog reader #0:
            2018/02/16 10:29:18 [7022/2] STATS |    fs_name    =   oak
            2018/02/16 10:29:18 [7022/2] STATS |    mdt_name   =   MDT0000
            2018/02/16 10:29:18 [7022/2] STATS |    reader_id  =   cl1
            2018/02/16 10:29:18 [7022/2] STATS |    records read        = 176082077
            2018/02/16 10:29:18 [7022/2] STATS |    interesting records = 119680156
            2018/02/16 10:29:18 [7022/2] STATS |    suppressed records  = 56401921
            2018/02/16 10:29:18 [7022/2] STATS |    records pending     = 999
            2018/02/16 10:29:18 [7022/2] STATS |    last received            = 2018/02/16 10:29:17
            2018/02/16 10:29:18 [7022/2] STATS |    last read record time    = 2018/02/16 10:29:17.671089
            2018/02/16 10:29:18 [7022/2] STATS |    last read record id      = 3459537671
            2018/02/16 10:29:18 [7022/2] STATS |    last pushed record id    = 3459535680
            2018/02/16 10:29:18 [7022/2] STATS |    last committed record id = 3459535680
            2018/02/16 10:29:18 [7022/2] STATS |    last cleared record id   = 3459535680
            2018/02/16 10:29:18 [7022/2] STATS |    read speed               = 76977.50 record/sec (513.18 incl. idle time)
            2018/02/16 10:29:18 [7022/2] STATS |    processing speed ratio   = 1.00
            2018/02/16 10:29:18 [7022/2] STATS |    ChangeLog stats:
            2018/02/16 10:29:18 [7022/2] STATS |    MARK: 18414, CREAT: 16458408, MKDIR: 3318418, HLINK: 35763559, SLINK: 120840, MKNOD: 726
            2018/02/16 10:29:18 [7022/2] STATS |    UNLNK: 43112348, RMDIR: 3143745, RENME: 6424858, RNMTO: 0, OPEN: 0, CLOSE: 32387229
            2018/02/16 10:29:18 [7022/2] STATS |    LYOUT: 118087, TRUNC: 3063071, SATTR: 29921326, XATTR: 0, HSM: 0, MTIME: 695102, CTIME: 1535946
            2018/02/16 10:29:18 [7022/2] STATS |    ATIME: 0, MIGRT: 0
            
            
            

            I also just checked that we only have one changelog consumer, cl1:

            [root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdd/oak-MDT0000/changelog_users 
            current index: 3475865674
            ID    index (idle seconds)
            cl1   3475865473 (1)
            [root@oak-md1-s2 ~]#
            

            You should be able to download the debuginfo RPMs at the following links:

            Thanks!

            sthiell Stephane Thiell added a comment - Hey Bruno, Thanks for checking. I noticed that function `mdd_chlg_garbage_collect()` in the "inactivate thread" stack trace too. But this is weird because it is very unlikely cl1 was actually idle, it's a robinhood consumer and if I check the robinhood log just a bit before the issue / crash dump, it seems to be running:  2018/02/16 10:29:18 [7022/2] STATS | ==================== Dumping stats at 2018/02/16 10:29:18 ===================== 2018/02/16 10:29:18 [7022/2] STATS | ======== General statistics ========= 2018/02/16 10:29:18 [7022/2] STATS | Daemon start time: 2018/01/29 11:23:38 2018/02/16 10:29:18 [7022/2] STATS | Started modules: scan,log_reader,policy_run(modeguard) 2018/02/16 10:29:18 [7022/2] STATS | ======== FS scan statistics ========= 2018/02/16 10:29:18 [7022/2] STATS | last scan = 2018/01/30 00:24:56 2018/02/16 10:29:18 [7022/2] STATS | duration = 13h 01min 18s (46878 s) 2018/02/16 10:29:18 [7022/2] STATS | status = complete 2018/02/16 10:29:18 [7022/2] STATS | current scan interval = 22.0d 2018/02/16 10:29:18 [7022/2] STATS | ChangeLog reader #0: 2018/02/16 10:29:18 [7022/2] STATS | fs_name = oak 2018/02/16 10:29:18 [7022/2] STATS | mdt_name = MDT0000 2018/02/16 10:29:18 [7022/2] STATS | reader_id = cl1 2018/02/16 10:29:18 [7022/2] STATS | records read = 176082077 2018/02/16 10:29:18 [7022/2] STATS | interesting records = 119680156 2018/02/16 10:29:18 [7022/2] STATS | suppressed records = 56401921 2018/02/16 10:29:18 [7022/2] STATS | records pending = 999 2018/02/16 10:29:18 [7022/2] STATS | last received = 2018/02/16 10:29:17 2018/02/16 10:29:18 [7022/2] STATS | last read record time = 2018/02/16 10:29:17.671089 2018/02/16 10:29:18 [7022/2] STATS | last read record id = 3459537671 2018/02/16 10:29:18 [7022/2] STATS | last pushed record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | last committed record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | last cleared record id = 3459535680 2018/02/16 10:29:18 [7022/2] STATS | read speed = 76977.50 record/sec (513.18 incl. idle time) 2018/02/16 10:29:18 [7022/2] STATS | processing speed ratio = 1.00 2018/02/16 10:29:18 [7022/2] STATS | ChangeLog stats: 2018/02/16 10:29:18 [7022/2] STATS | MARK: 18414, CREAT: 16458408, MKDIR: 3318418, HLINK: 35763559, SLINK: 120840, MKNOD: 726 2018/02/16 10:29:18 [7022/2] STATS | UNLNK: 43112348, RMDIR: 3143745, RENME: 6424858, RNMTO: 0, OPEN: 0, CLOSE: 32387229 2018/02/16 10:29:18 [7022/2] STATS | LYOUT: 118087, TRUNC: 3063071, SATTR: 29921326, XATTR: 0, HSM: 0, MTIME: 695102, CTIME: 1535946 2018/02/16 10:29:18 [7022/2] STATS | ATIME: 0, MIGRT: 0 I also just checked that we only have one changelog consumer, cl1: [root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdd/oak-MDT0000/changelog_users current index: 3475865674 ID index (idle seconds) cl1 3475865473 (1) [root@oak-md1-s2 ~]# You should be able to download the debuginfo RPMs at the following links: lustre-debuginfo-2.10.3_RC1-1.el7.centos.x86_64.rpm: https://stanford.box.com/s/e3l44651do73iz92aaftx3ayf4wge254 kernel-debuginfo-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm: https://stanford.box.com/s/6zknljpgd64xpas3xfv9a3rghuhwwi7f kernel-debuginfo-common-x86_64-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm: https://stanford.box.com/s/636nhh2zxxt13o4chlgxx9qyuvc8p5gg Thanks!

            Stephane,
            I wonder if according to the stack trace, in the infos you have provided, a thead is in the process to be started in order to do garbage collection (introduced by my patch for LU-7340) of space consumed by changelogs due to a user/consumer (cl1?) being idle since quite a long time ?

            But for some unknown reason kthreadd is not starting it, or has still not reported it did... When on the other hand, the mdd_changelog_clear() error msgs seem to indicate changelogs garbage collection may have already started!

            Thus, I would like to analyze the crash-dump available, but to do so I also need the lustre-debuginfo RPM from your Lustre version, can you also make it available ?

            bfaccini Bruno Faccini (Inactive) added a comment - Stephane, I wonder if according to the stack trace, in the infos you have provided, a thead is in the process to be started in order to do garbage collection (introduced by my patch for LU-7340 ) of space consumed by changelogs due to a user/consumer (cl1?) being idle since quite a long time ? But for some unknown reason kthreadd is not starting it, or has still not reported it did... When on the other hand, the mdd_changelog_clear() error msgs seem to indicate changelogs garbage collection may have already started! Thus, I would like to analyze the crash-dump available, but to do so I also need the lustre-debuginfo RPM from your Lustre version, can you also make it available ?

            People

              bfaccini Bruno Faccini (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: