[LU-10680] MDT becoming unresponsive in 2.10.3 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this:

[52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
[54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
[56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
[56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
[60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
[62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[62145.591250] Pid: 157540, comm: mdt_rdpg01_050
[62145.591486] 
               Call Trace:
[62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
[62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
[62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
[62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
[62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
[62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
[62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
[62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
[62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
[62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
[62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
[62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
[62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
[62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
[62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
[62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
[62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
[62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0

... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

[https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

]

vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

also attaching foreach bt and memory usage

Thanks!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

foreach_bt-oak-md1-s2-2018-02-16-11-02-21.txt
661 kB
17/Feb/18 12:22 AM

Issue Links

is related to

LU-10734 sanity test_160g: User cl8 still found in changelog_users

Resolved

LU-12865 sanity test 160f fails with ‘mds1: User cl6 not registered’

Resolved

LU-12871 enable changelog garbage collection by default

Resolved

is related to

LU-7340 ChangeLogs catalog full condition should be handled more gracefully

Resolved

MDT becoming unresponsive in 2.10.3

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates