Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8250

MDT recovery stalled on secondary node

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • lola
      build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches
    • 3
    • 9223372036854775807

    Description

      The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.

      The issue might be related to LU-7848 although the change is part of build under test.
      The error results in the start of the oom-killer, which is documented in LU-7836. This ticket might be a duplicate of LU-7836.

      Events:
      1st Event:

      • 2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
      • 2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
      • till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
      • 2016-06-04-00:44:52 - lola-11 crash with oom-killer

      2nd Event:

      • 2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
      • 2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
      • since 2016-06-07 08:39:32,155 Wait for recovery to complete
      • memory resources are nearly exhausted:
        [root@lola-11 ~]# date
        Wed Jun  8 07:59:49 PDT 2016
        [root@lola-11 ~]# collectl -sm --verbose
        waiting for 1 second sample...
        
        # MEMORY SUMMARY
        #<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------>
        #   Total    Used    Free    Buff  Cached    Slab  Mapped    Anon  Commit  Locked Inact Total  Used  Free   In  Out Fault MajFt   In  Out
           32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    28     0    0    8
           32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    63     0    0    4
           32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0     1     0    0    0
           32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    17     0    0    0
        

        Attached files:
        *1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52

      • 2nd event only: kernel debug log of lola-11, dmesg
      • Both event: messages, console logs

      Attachments

        Issue Links

          Activity

            [LU-8250] MDT recovery stalled on secondary node
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23096/
            Subject: LU-8250 mdd: move linkea prepare out of transaction.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b596047cae1d3381cafae9c4132e1a84e99ca9d0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23096/ Subject: LU-8250 mdd: move linkea prepare out of transaction. Project: fs/lustre-release Branch: master Current Patch Set: Commit: b596047cae1d3381cafae9c4132e1a84e99ca9d0

            http://review.whamcloud.com/23111 has been moved to LU-8704 per discussion with Di.

            jgmitter Joseph Gmitter (Inactive) added a comment - http://review.whamcloud.com/23111 has been moved to LU-8704 per discussion with Di.
            di.wang Di Wang added a comment - - edited

            There is another panic with 23111. Since this is a blocker of 2.9, I will create a new ticket (LU-8704).

            di.wang Di Wang added a comment - - edited There is another panic with 23111. Since this is a blocker of 2.9, I will create a new ticket ( LU-8704 ).

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111
            Subject: LU-8250 osd: add journal info check
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c8bc88c85d6f273b18f7741cc9c0cdc1bcfca4d5

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111 Subject: LU-8250 osd: add journal info check Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c8bc88c85d6f273b18f7741cc9c0cdc1bcfca4d5

            It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait().

            adilger Andreas Dilger added a comment - It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait() .

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096
            Subject: LU-8250 mdd: move linkea prepare out of transaction.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b67d62eefef30fbb3365da3382021820985b5e39

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096 Subject: LU-8250 mdd: move linkea prepare out of transaction. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b67d62eefef30fbb3365da3382021820985b5e39
            di.wang Di Wang added a comment -

            It turns out one MDS is stuck here

            mdt00_016     S 0000000000000005     0  6217      2 0x00000080
             ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724
             0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286
             ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8
            Call Trace:
             [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70
             [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0
             [<ffffffff81089fa0>] ? process_timeout+0x0/0x10
             [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc]
             [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
             [<ffffffff81067650>] ? default_wake_function+0x0/0x20
             [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
             [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
             [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp]
             [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp]
             [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod]
             [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd]
             [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd]
             [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd]
             [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd]
             [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt]
             [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
             [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt]
             [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt]
             [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt]
             [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt]
             [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
             [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
             [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
             [<ffffffff81539896>] ? schedule+0x176/0x3a0
             [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
             [<ffffffff810a138e>] kthread+0x9e/0xc0
             [<ffffffff8100c28a>] child_rip+0xa/0x20
             [<ffffffff810a12f0>] ? kthread+0x0/0xc0
             [<ffffffff8100c280>] ? child_rip+0x0/0x20
            

            Then other threads waiting for the journal

            mdt_out01_008 D 000000000000000b     0  6281      2 0x00000080
             ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0
             ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80
             ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8
            Call Trace:
             [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2]
             [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240
             [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
             [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2]
             [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
             [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs]
             [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc]
             [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc]
             [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
             [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
             [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
             [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
             [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
             [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
             [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
             [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
             [<ffffffff81539896>] ? schedule+0x176/0x3a0
             [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
             [<ffffffff810a138e>] kthread+0x9e/0xc0
             [<ffffffff8100c28a>] child_rip+0xa/0x20
             [<ffffffff810a12f0>] ? kthread+0x0/0xc0
             [<ffffffff8100c280>] ? child_rip+0x0/0x20
            

            Then it cause other MDTs can not reconnect to this MDT.

            LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
            LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
            LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
            

            Then it cause the whole recovery stuck. I will cook a fix.

            di.wang Di Wang added a comment - It turns out one MDS is stuck here mdt00_016 S 0000000000000005 0 6217 2 0x00000080 ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724 0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286 ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8 Call Trace: [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc] [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp] [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp] [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod] [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd] [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd] [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd] [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd] [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt] [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt] [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then other threads waiting for the journal mdt_out01_008 D 000000000000000b 0 6281 2 0x00000080 ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0 ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80 ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8 Call Trace: [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2] [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs] [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc] [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then it cause other MDTs can not reconnect to this MDT. LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 Then it cause the whole recovery stuck. I will cook a fix.

            Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it.

            tappro Mikhail Pershin added a comment - Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it.
            di.wang Di Wang added a comment -

            Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently.

            Jun  7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52
            Jun  7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages
            Jun  7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2
            Jun  7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Jun  7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user
            Jun  7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root
            Jun  7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2
            Jun  7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0)
            Jun  7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user
            Jun  7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root
            Jun  7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports
            Jun  7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout
            Jun  7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            Jun  7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients
            

            I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks.

            di.wang Di Wang added a comment - Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently. Jun 7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52 Jun 7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages Jun 7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2 Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2 Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks.

            People

              tappro Mikhail Pershin
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: