Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.3
-
None
-
3.10.0-693.2.2.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this:
[52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000 [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22 [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [62145.591250] Pid: 157540, comm: mdt_rdpg01_050 [62145.591486] Call Trace: [62145.591949] [<ffffffff816a94e9>] schedule+0x29/0x70 [62145.592200] [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0 [62145.592444] [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700 [62145.592708] [<ffffffff816a989d>] wait_for_completion+0xfd/0x140 [62145.592971] [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 [62145.593223] [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140 [62145.593472] [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd] [62145.593760] [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass] [62145.594025] [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0 [62145.594272] [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd] [62145.594537] [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd] [62145.595020] [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd] [62145.595267] [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd] [62145.595540] [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt] [62145.595790] [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt] [62145.596051] [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt] [62145.596352] [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc] [62145.596633] [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [62145.597138] [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [62145.597385] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [62145.597651] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [62145.597927] [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [62145.598176] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [62145.598420] [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0 [62145.598695] [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [62145.598955] [<ffffffff810b098f>] kthread+0xcf/0xe0 [62145.604690] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [62145.604932] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [62145.605175] [<ffffffff810b08c0>] ? kthread+0x0/0xe0
... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:
vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:
[https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn
]
vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:
https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp
also attaching foreach bt and memory usage
Thanks!
Stephane
Attachments
Issue Links
- is related to
-
LU-10734 sanity test_160g: User cl8 still found in changelog_users
- Resolved
-
LU-12865 sanity test 160f fails with ‘mds1: User cl6 not registered’
- Resolved
-
LU-12871 enable changelog garbage collection by default
- Resolved
- is related to
-
LU-7340 ChangeLogs catalog full condition should be handled more gracefully
- Resolved