Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12834

MDT hung during failover

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • lustre-master-ib #328 EL7.7
    • 3
    • 9223372036854775807

    Description

      1 MDT got stuck after it came back from a reboot. This issue looks like LU-12354, not sure if it is dup

      [2019-10-04T18:13:18+00:00] INFO: template[/etc/ssh/sshd_config] sending restart action to service[sshd] (delayed)
      [2019-10-04T18:13:18+00:00] INFO: Processing service[sshd] action restart (ssh::server line 19)
      [2019-10-04T18:13:18+00:00] INFO: service[sshd] restarted
      [2019-10-04T18:13:18+00:00] INFO: Chef Run complete in 40.090920258 seconds
      [2019-10-04T18:13:18+00:00] INFO: Running report handlers
      [2019-10-04T18:13:18+00:00] INFO: Creating JSON run report
      [2019-10-04T18:13:18+00:00] INFO: Report handlers complete
      [  238.881257] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
      [  238.891233] alg: No test for adler32 (adler32-zlib)
      [  239.698265] Lustre: Lustre: Build Version: 2.12.58_104_g279c264
      [  239.880390] LNet: Using FMR for registration
      [  239.882506] LNetError: 215:0:(o2iblnd_cb.c:2496:kiblnd_passive_connect()) Can't accept conn from 192.168.1.118@o2ib on NA (ib0:0:192.168.1.109): bad dst nid
       192.168.1.109@o2ib
      [  239.915188] LNet: Added LNI 192.168.1.109@o2ib [8/256/0/180]
      [  239.969648] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
      [  239.969648] 
      [  292.347499] LDISKFS-fs (dm-1): recovery complete
      [  292.353139] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
      [  293.137144] Lustre: osd-ldiskfs create tunables for soaked-MDT0001
      [  293.641326] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  299.332695] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 28 clients reconnect
      [  300.344031] Lustre: soaked-MDT0001: Connection restored to 431caaf2-d303-4 (at 192.168.1.136@o2ib)
      [  301.115589] Lustre: soaked-MDT0001: Connection restored to 81b942e1-4e72-4 (at 192.168.1.125@o2ib)
      [  302.209118] Lustre: soaked-MDT0001: Connection restored to d81c2840-9c51-4 (at 192.168.1.128@o2ib)
      [  302.219237] Lustre: Skipped 2 previous similar messages
      [  305.751655] Lustre: soaked-MDT0001: Connection restored to 6780d2ed-c1f6-4 (at 192.168.1.119@o2ib)
      [  305.761769] Lustre: Skipped 5 previous similar messages
      [  312.381584] Lustre: soaked-MDT0001: Connection restored to 156ff267-111c-4 (at 192.168.1.122@o2ib)
      [  312.391687] Lustre: Skipped 2 previous similar messages
      [  320.759167] Lustre: soaked-MDT0001: Connection restored to 925c0ae9-c415-4 (at 192.168.1.126@o2ib)
      [  320.769289] Lustre: Skipped 4 previous similar messages
      [  338.178116] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0001-lwp-OST0008_UUID (at 192.168.1.104@o2ib)
      [  338.189826] Lustre: Skipped 21 previous similar messages
      [  348.048244] Lustre: soaked-MDT0001: Recovery over after 0:49, of 28 clients 28 recovered and 0 were evicted.
      [  548.286678] Lustre: mdt00_005: service thread pid 5290 was inactive for 200.215 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [  548.286697] Lustre: mdt01_004: service thread pid 5269 was inactive for 200.229 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  548.286701] Pid: 5271, comm: mdt01_006 3.10.0-1062.el7_lustre.x86_64 #1 SMP Mon Sep 30 22:06:44 UTC 2019
      [  548.286702] Lustre: Skipped 5 previous similar messages
      [  548.286703] Call Trace:
      [  548.286811]  [<ffffffffc1035b10>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [  548.286864]  [<ffffffffc1037caf>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      [  548.286917]  [<ffffffffc103a561>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc]
      [  548.286935]  [<ffffffffc1655d62>] osp_md_object_lock+0x162/0x2d0 [osp]
      [  548.286959]  [<ffffffffc1566974>] lod_object_lock+0xf4/0x780 [lod]
      [  548.286980]  [<ffffffffc15ebbfe>] mdd_object_lock+0x3e/0xe0 [mdd]
      [  548.287009]  [<ffffffffc14847d1>] mdt_remote_object_lock_try+0x1e1/0x520 [mdt]
      [  548.287028]  [<ffffffffc1484b3a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [  548.287050]  [<ffffffffc149947e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
      [  548.287071]  [<ffffffffc149ad75>] mdt_reint_rename+0x2c5/0x2b60 [mdt]
      [  548.287092]  [<ffffffffc14a6883>] mdt_reint_rec+0x83/0x210 [mdt]
      [  548.287110]  [<ffffffffc1480930>] mdt_reint_internal+0x7b0/0xba0 [mdt]
      [  548.287129]  [<ffffffffc148be37>] mdt_reint+0x67/0x140 [mdt]
      [  548.287216]  [<ffffffffc10d772a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [  548.287278]  [<ffffffffc1079976>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  548.287338]  [<ffffffffc107d4ac>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [  548.287345]  [<ffffffff838c50d1>] kthread+0xd1/0xe0
      [  548.287350]  [<ffffffff83f8bd37>] ret_from_fork_nospec_end+0x0/0x39
      [  548.287383]  [<ffffffffffffffff>] 0xffffffffffffffff
      [  548.287386] Pid: 5270, comm: mdt01_005 3.10.0-1062.el7_lustre.x86_64 #1 SMP Mon Sep 30 22:06:44 UTC 2019
      [  548.287387] Call Trace:
      [  548.287463]  [<ffffffffc1035b10>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [  548.287516]  [<ffffffffc1037caf>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
      [  548.287568]  [<ffffffffc103a561>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc]
      [  548.287582]  [<ffffffffc1655d62>] osp_md_object_lock+0x162/0x2d0 [osp]
      [  548.287599]  [<ffffffffc1566974>] lod_object_lock+0xf4/0x780 [lod]
      [  548.287614]  [<ffffffffc15ebbfe>] mdd_object_lock+0x3e/0xe0 [mdd]
      [  548.287634]  [<ffffffffc14847d1>] mdt_remote_object_lock_try+0x1e1/0x520 [mdt]
      [  548.287678]  [<ffffffffc1484b3a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [  548.287701]  [<ffffffffc149947e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
      [  548.287722]  [<ffffffffc149ad75>] mdt_reint_rename+0x2c5/0x2b60 [mdt]
      [  548.287744]  [<ffffffffc14a6883>] mdt_reint_rec+0x83/0x210 [mdt]
      [  548.287764]  [<ffffffffc1480930>] mdt_reint_internal+0x7b0/0xba0 [mdt]
      [  548.287784]  [<ffffffffc148be37>] mdt_reint+0x67/0x140 [mdt]
      [  548.287863]  [<ffffffffc10d772a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [  548.287930]  [<ffffffffc1079976>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [  548.287996]  [<ffffffffc107d4ac>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [  548.288001]  [<ffffffff838c50d1>] kthread+0xd1/0xe0
      [  548.288005]  [<ffffffff83f8bd37>] ret_from_fork_nospec_end+0x0/0x39
      [  548.288017]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: