Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0
    • Lustre 2.10.3
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      2.10 LTS continues to causing us some trouble, and this time it's the MDT that has been quite unstable lately. A typical problem looks to start with this: 

      [52968.460453] Lustre: 5897:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.181780] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [53955.182270] Lustre: 8034:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [54269.697741] Lustre: 5911:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202025] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [54362.202511] Lustre: 7920:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 1 previous similar message
      [54808.564380] perf: interrupt took too long (2579 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
      [56283.046677] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [56283.047150] Lustre: 7958:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 14 previous similar messages
      [56867.725405] Lustre: 5397:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60578.888437] Lustre: 8014:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.796894] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
      [60579.797374] Lustre: 8078:0:(mdd_device.c:1597:mdd_changelog_clear()) Skipped 2 previous similar messages
      [62145.590529] LNet: Service thread pid 157540 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [62145.591250] Pid: 157540, comm: mdt_rdpg01_050
      [62145.591486] 
                     Call Trace:
      [62145.591949]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [62145.592200]  [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
      [62145.592444]  [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
      [62145.592708]  [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
      [62145.592971]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      [62145.593223]  [<ffffffff810b07ea>] kthread_create_on_node+0xaa/0x140
      [62145.593472]  [<ffffffffc12f0ac0>] ? mdd_chlg_garbage_collect+0x0/0x640 [mdd]
      [62145.593760]  [<ffffffffc091346a>] ? llog_add+0x7a/0x1a0 [obdclass]
      [62145.594025]  [<ffffffff810ea8ca>] ? __getnstimeofday64+0x3a/0xd0
      [62145.594272]  [<ffffffffc12f7ac4>] mdd_changelog_store+0x194/0x540 [mdd]
      [62145.594537]  [<ffffffffc130532d>] mdd_changelog_data_store_by_fid+0xed/0x1a0 [mdd]
      [62145.595020]  [<ffffffffc130644b>] mdd_changelog_data_store+0x15b/0x210 [mdd]
      [62145.595267]  [<ffffffffc130779b>] mdd_close+0x25b/0xcf0 [mdd]
      [62145.595540]  [<ffffffffc11c9073>] mdt_mfd_close+0x353/0x610 [mdt]
      [62145.595790]  [<ffffffffc11ce651>] mdt_close_internal+0x121/0x220 [mdt]
      [62145.596051]  [<ffffffffc11ce970>] mdt_close+0x220/0x780 [mdt]
      [62145.596352]  [<ffffffffc0bebda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [62145.596633]  [<ffffffffc0b94b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [62145.597138]  [<ffffffffc0b91148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [62145.597385]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [62145.597651]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [62145.597927]  [<ffffffffc0b98252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [62145.598176]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [62145.598420]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      [62145.598695]  [<ffffffffc0b977c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [62145.598955]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [62145.604690]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      [62145.604932]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [62145.605175]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      
      

      ... leading to the end of connectivity, but the server doesn't really crash. So I took a crash dump when the MDS was unresponsive, it is available at the following link:

       

      vmcore-oak-md1-s2-2018-02-16-11:02:21.gz:

      [https://stanford.box.com/s/n20wj34xhkmmltiav427l3btdz9qjwvn

      ]

      vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz:

      https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp

      also attaching foreach bt and memory usage

      Thanks!

      Stephane

      Attachments

        Issue Links

          Activity

            [LU-10680] MDT becoming unresponsive in 2.10.3

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53773/
            Subject: LU-10680 tests: fix interop for sanity test_160h
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: c929f30fd103f9544ee15237f8680839664fdef7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53773/ Subject: LU-10680 tests: fix interop for sanity test_160h Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: c929f30fd103f9544ee15237f8680839664fdef7

            "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53773
            Subject: LU-10680 tests: fix interop for sanity test_160h
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 32f7dd91985cd43e6ad2ed9854f9bed114a0e640

            adilger Andreas Dilger added a comment - "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53773 Subject: LU-10680 tests: fix interop for sanity test_160h Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 32f7dd91985cd43e6ad2ed9854f9bed114a0e640
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/
            Subject: LU-10680 mdd: create gc thread when no current transaction
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 31fef6845e8be39ae8dfd242714a84001925b7ad

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31376/ Subject: LU-10680 mdd: create gc thread when no current transaction Project: fs/lustre-release Branch: master Current Patch Set: Commit: 31fef6845e8be39ae8dfd242714a84001925b7ad

            Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch "LU-10680 mdd: create gc thread when no current transaction" for this ticket can land when it is ready, but no longer blocks the 2.11 release.

            adilger Andreas Dilger added a comment - Moving over to 2.12 now that the GC feature is no longer enabled by default, and the failing tests have been disabled. The remaining patch " LU-10680 mdd: create gc thread when no current transaction " for this ticket can land when it is ready, but no longer blocks the 2.11 release.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/
            Subject: LU-10680 mdd: disable changelog garbage collection by default
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 57719d0c8833b8a8a8b6516d3bd4f2ad2b226285

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31552/ Subject: LU-10680 mdd: disable changelog garbage collection by default Project: fs/lustre-release Branch: master Current Patch Set: Commit: 57719d0c8833b8a8a8b6516d3bd4f2ad2b226285

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552
            Subject: LU-10680 mdd: disable changelog garbage collection by default
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 62e4519de3dcb588b88311950e8c6b7fefcca6a8

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31552 Subject: LU-10680 mdd: disable changelog garbage collection by default Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 62e4519de3dcb588b88311950e8c6b7fefcca6a8

            Hey Bruno,

            Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS.

            sthiell Stephane Thiell added a comment - Hey Bruno, Thanks for the update! And yes, this happened with vm.zone_reclaim_mode=0 on MDS.

            Hello Stephane,
            Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock.
            As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list.

            Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Stephane, Just one more comment I have in mind and to explain why you have been able to trigger the 2nd problem/deadlock. As part of LU-10709 I understand that you are running with vm.zone_reclaim_mode=0 also on MDS, so, according to mm code in this area, this seem to be definitely the reason why you have triggered memory-reclaim, with shrinkers run, during khreadd create of new kthread, instead to grab new memory/pages from preferred/current node+zone free-list. Hopefully this specific case should be handled and fixed by my 2nd fix for this ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/
            Subject: LU-10680 mdd: fix run_gc_task uninitialized
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31347/ Subject: LU-10680 mdd: fix run_gc_task uninitialized Project: fs/lustre-release Branch: master Current Patch Set: Commit: 25cdbbb475b8a42e3a1f594898d2ac1406b466f6

            People

              bfaccini Bruno Faccini (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: