Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10680

MDT becoming unresponsive in 2.10.3

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0
    • Lustre 2.10.3
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this: 

      [52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
      [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
      [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
      [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [62145.591250] Pid: 157540, comm: mdt_rdpg01_050
      [62145.591486] 
                     Call Trace:
      [62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
      [62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
      [62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
      [62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      [62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
      [62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
      [62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
      [62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
      [62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
      [62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
      [62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
      [62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
      [62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
      [62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
      [62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
      [62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      [62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      [62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      
      

      ... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

       

      vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

      [https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

      ]

      vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

      https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

      also attaching foreach bt and memory usage

      Thanks!

      Stephane

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: