Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.3
-
None
-
3.10.0-693.2.2.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this:
[52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000 [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [62145.591250] Pid: 157540, comm: mdt_rdpg01_050 [62145.591486] Call Trace: [62145.591949] [<ffffffff816a94e9>] schedule+0x29/0x70 [62145.592200] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [62145.592444] [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [62145.592708] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [62145.592971] [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 [62145.593223] [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [62145.593472] [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd] [62145.593760] [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass] [62145.594025] [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0 [62145.594272] [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd] [62145.594537] [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd] [62145.595020] [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd] [62145.595267] [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd] [62145.595540] [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt] [62145.595790] [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt] [62145.596051] [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt] [62145.596352] [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc] [62145.596633] [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [62145.597138] [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [62145.597385] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [62145.597651] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [62145.597927] [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [62145.598176] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [62145.598420] [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0 [62145.598695] [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [62145.598955] [<ffffffff810b098f>] kthread+0xcf/0xe0 [62145.604690] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [62145.604932] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [62145.605175] [<ffffffff810b08c0>] ? kthread+0x0/0xe0
... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:
vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:
[https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn
]
vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:
https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp
also attaching foreach bt and memory usage
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-10734 sanity test_160g: User cl8 still found in changelog_users
-
- Resolved
-
-
LU-12865 sanity test 160f fails with ‘mds1: User cl6 not registered’
-
- Resolved
-
-
LU-12871 enable changelog garbage collection by default
-
- Resolved
-
- is related to
-
LU-7340 ChangeLogs catalog full condition should be handled more gracefully
-
- Resolved
-
Hmm, I think that I have finally understood what has caused the kthreadd thread to hang and then the whole MDT operations to also become blocked/slow.
We have triggered the following situation/deadlock, where a lot of MDT service threads (mdt[0-9], mdt_rdpg) were stuck in uninterruptible state and waiting for kthreadd to spawn with the following kinds of stacks :
either trying to spawn new service threads, or due to previous described problem of having one changelog garbage collection thread being spawned for each new changelog record.
But as I have already indicated, kthreadd is stuck with the following stack :
as upon trying to launch a new thread, it has unfortunately triggered memory reclaim condition with the consequence of shrinkers to be called, including the one for the lu_cache in order to drain unreferenced objects, and doing so it tries to update quotas in quota slave file, but is being blocked waiting to grant journal credits because the current/running transaction has still not committed. And this is because the transaction is awaiting for a lot of its attached/started handles to complete (jbd2_journal_stop!), but this will never occur as they are almost all from the previous threads waiting for kthreadd. Hence the deadlock.
I think that to fix and prevent such unexpected scenario to re-occur, I need to find a way to start changelog garbage-collection thread outside of the interval/section-of-code where a journal transaction is being filled.
I will push a new/2nd patch soon to implement this.