[LU-8250] MDT recovery stalled on secondary node - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
lola
build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.

The issue might be related to ~~LU-7848~~ although the change is part of build under test.
The error results in the start of the oom-killer, which is documented in ~~LU-7836~~. This ticket might be a duplicate of ~~LU-7836~~.

Events:
1st Event:

2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
2016-06-04-00:44:52 - lola-11 crash with oom-killer

2nd Event:

2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
since 2016-06-07 08:39:32,155 Wait for recovery to complete

memory resources are nearly exhausted:

[root@lola-11 ~]# date
Wed Jun  8 07:59:49 PDT 2016
[root@lola-11 ~]# collectl -sm --verbose
waiting for 1 second sample...

# MEMORY SUMMARY
#<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------>
#   Total    Used    Free    Buff  Cached    Slab  Mapped    Anon  Commit  Locked Inact Total  Used  Free   In  Out Fault MajFt   In  Out
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    28     0    0    8
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    63     0    0    4
   32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0     1     0    0    0
   32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    17     0    0    0

Attached files:
*1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52

2nd event only: kernel debug log of lola-11, dmesg
Both event: messages, console logs

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-11.log-20160608.bz2
63 kB
08/Jun/16 4:04 PM
dmesg-lola-11-20160609-0811.bz2
26 kB
08/Jun/16 4:04 PM
lola-11-lustre-log.20160608-0656.bz2
0.3 kB
08/Jun/16 4:04 PM
messages-lola-11.log-20160608.bz2
311 kB
08/Jun/16 4:04 PM
soak.log.2016-06-11-1417.txt.bz2
366 kB
21/Sep/16 11:51 AM

Issue Links

is duplicated by

LU-8428 Very long recovery times for MDTs after MDS restart

Resolved

is related to

LU-8704 RPC sent inside mdd_create transaction.

Open

LU-8714 too many update logs during soak-test.

Open

Activity

[LU-8250] MDT recovery stalled on secondary node

Gerrit Updater added a comment - 12/Oct/16 4:44 PM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111
Subject: ~~LU-8250~~ osd: add journal info check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c8bc88c85d6f273b18f7741cc9c0cdc1bcfca4d5

Gerrit Updater added a comment - 12/Oct/16 4:44 PM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111 Subject: LU-8250 osd: add journal info check Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c8bc88c85d6f273b18f7741cc9c0cdc1bcfca4d5

Andreas Dilger added a comment - 12/Oct/16 4:54 AM

It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait().

Andreas Dilger added a comment - 12/Oct/16 4:54 AM It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait() .

Gerrit Updater added a comment - 12/Oct/16 3:32 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096
Subject: ~~LU-8250~~ mdd: move linkea prepare out of transaction.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b67d62eefef30fbb3365da3382021820985b5e39

Gerrit Updater added a comment - 12/Oct/16 3:32 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096 Subject: LU-8250 mdd: move linkea prepare out of transaction. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b67d62eefef30fbb3365da3382021820985b5e39

Di Wang added a comment - 12/Oct/16 3:21 AM

It turns out one MDS is stuck here

mdt00_016     S 0000000000000005     0  6217      2 0x00000080
 ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724
 0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286
 ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8
Call Trace:
 [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10
 [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc]
 [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp]
 [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp]
 [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod]
 [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd]
 [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd]
 [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd]
 [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd]
 [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt]
 [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
 [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt]
 [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt]
 [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt]
 [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt]
 [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff81539896>] ? schedule+0x176/0x3a0
 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Then other threads waiting for the journal

mdt_out01_008 D 000000000000000b     0  6281      2 0x00000080
 ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0
 ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80
 ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8
Call Trace:
 [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2]
 [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240
 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
 [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs]
 [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc]
 [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc]
 [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
 [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
 [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130
 [<ffffffff81539896>] ? schedule+0x176/0x3a0
 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Then it cause other MDTs can not reconnect to this MDT.

LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114

Then it cause the whole recovery stuck. I will cook a fix.

Di Wang added a comment - 12/Oct/16 3:21 AM It turns out one MDS is stuck here mdt00_016 S 0000000000000005 0 6217 2 0x00000080 ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724 0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286 ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8 Call Trace: [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc] [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp] [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp] [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod] [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd] [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd] [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd] [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd] [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt] [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt] [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then other threads waiting for the journal mdt_out01_008 D 000000000000000b 0 6281 2 0x00000080 ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0 ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80 ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8 Call Trace: [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2] [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs] [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc] [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then it cause other MDTs can not reconnect to this MDT. LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 Then it cause the whole recovery stuck. I will cook a fix.

Mikhail Pershin added a comment - 30/Sep/16 8:05 AM

Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it.

Mikhail Pershin added a comment - 30/Sep/16 8:05 AM Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it.

Di Wang added a comment - 21/Sep/16 9:44 AM

Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently.

Jun  7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52
Jun  7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages
Jun  7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2
Jun  7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jun  7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user
Jun  7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root
Jun  7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2
Jun  7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jun  7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user
Jun  7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root
Jun  7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports
Jun  7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout
Jun  7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Jun  7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients

I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks.

Di Wang added a comment - 21/Sep/16 9:44 AM Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently. Jun 7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52 Jun 7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages Jun 7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2 Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2 Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks.

Frank Heckes (Inactive) added a comment - 23/Aug/16 2:13 PM - edited

Hi Hongchao,
there's a crash file available that had been stored at: lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 (see above). It's easy to access if you have an account on Lola (DCO create that within 2 hours if you file a ticket containing your ssh-public-key). crashdump tools and lustre.so are installed on the head node.

Frank Heckes (Inactive) added a comment - 23/Aug/16 2:13 PM - edited Hi Hongchao, there's a crash file available that had been stored at: lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 (see above). It's easy to access if you have an account on Lola (DCO create that within 2 hours if you file a ticket containing your ssh-public-key). crashdump tools and lustre.so are installed on the head node.

Hongchao Zhang added a comment - 23/Aug/16 11:13 AM

I have looked at the logs, but I have not found useful information to help identify the cause of the problem.

the increased memory usage could be related to the continuous lock replay req to enqueue

...
00010000:00080000:26.0:1465394173.485502:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88017e947050 x1536423085514416/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1
00010000:00080000:27.0:1465394173.485521:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff880247be9c50 x1536423085514432/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1

...

00010000:00080000:26.0:1465394179.502412:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88016a6c2450 x1536423085514928/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1
00010000:00080000:27.0:1465394179.502417:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff8802828f5050 x1536423085514944/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1

...

00010000:00080000:27.0:1465394185.503466:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff880141b1e850 x1536423085515456/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1
00010000:00080000:26.0:1465394185.503491:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88031ca74050 x1536423085515472/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1

...

00010000:00080000:27.0:1465394191.504477:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88023309b450 x1536423085516016/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1
00010000:00080000:10.0:1465394191.504487:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88023309bc50 x1536423085516032/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1

...

00010000:00080000:10.0:1465394209.507404:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff880247b59850 x1536423085517600/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1
00010000:00080000:11.0:1465394209.507410:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req  req@ffff88017e9e1c50 x1536423085517616/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1

the problem that MDT can't be mounted (with error -114, -EALREADY) is the same problem described in ~~LU-7794~~.

Is the stack trace of all processes available, it could help to find where the recovery is stuck.
Thanks!

Hongchao Zhang added a comment - 23/Aug/16 11:13 AM I have looked at the logs, but I have not found useful information to help identify the cause of the problem. the increased memory usage could be related to the continuous lock replay req to enqueue ... 00010000:00080000:26.0:1465394173.485502:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e947050 x1536423085514416/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394173.485521:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247be9c50 x1536423085514432/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:26.0:1465394179.502412:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88016a6c2450 x1536423085514928/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394179.502417:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff8802828f5050 x1536423085514944/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394185.503466:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880141b1e850 x1536423085515456/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:26.0:1465394185.503491:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88031ca74050 x1536423085515472/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394191.504477:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309b450 x1536423085516016/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:10.0:1465394191.504487:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309bc50 x1536423085516032/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:10.0:1465394209.507404:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247b59850 x1536423085517600/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:11.0:1465394209.507410:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e9e1c50 x1536423085517616/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1 the problem that MDT can't be mounted (with error -114, -EALREADY) is the same problem described in LU-7794 . Is the stack trace of all processes available, it could help to find where the recovery is stuck. Thanks!

Peter Jones added a comment - 19/Aug/16 5:37 PM

Hongchao

Could you please look into this issue?

THanks

Peter

Peter Jones added a comment - 19/Aug/16 5:37 PM Hongchao Could you please look into this issue? THanks Peter

Cliff White (Inactive) added a comment - 18/Aug/16 6:38 PM

We seem to have hit this again.
Build lustre-master 3424.
MDS crashes:

Kernel panic - not syncing: Attempted to kill init!^M
Pid: 1, comm: init Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1^M
Call Trace:^M
 [<ffffffff81539407>] ? panic+0xa7/0x16f^M
 [<ffffffff8107d0a7>] ? do_exit+0x867/0x870^M
 [<ffffffff8107d1b7>] ? sys_exit+0x17/0x20^M
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M
^[[2J^[[1;1HCopyright(c) 2009 - 2012 Intel Corporation.All rights reserved. ^

Upon restart, MDS fails to complete recovery, times out, During this time the soak controller halted lola-11 for failover. Attempting to abort recovery results in timeouts

LustreError: 7532:0:(ldlm_lib.c:2576:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery^M
Lustre: 6032:0:(ldlm_lib.c:2025:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M
Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1471532976/real 1471532976]  req@ffff8804070af9c0 x1543013122722384/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1471533032 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M
Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages^M
Aug 18 08:10:01 lola-10 TIME: Time stamp for console
INFO: task lctl:7532 blocked for more than 120 seconds.^M
      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1^M
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
lctl          D 0000000000000011     0  7532   7440 0x00000080^M
 ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M
 ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M
 ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M
Call Trace:^M
 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M
 [<ffffffff8153a693>] wait_for_common+0x123/0x180^M
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M
 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M
 [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M
 [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M
 [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M
 [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M
 [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M
 [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M
 [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M
 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M
 [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M
 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M
 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M
LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 192.168.1.131@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.^M
LustreError: Skipped 351 previous similar messages^M
INFO: task lctl:7532 blocked for more than 120 seconds.^M
      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1^M
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
lctl          D 0000000000000011     0  7532   7440 0x00000080^M
 ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M
 ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M
 ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M
Call Trace:^M
 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M
 [<ffffffff8153a693>] wait_for_common+0x123/0x180^M
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M
 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M
 [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M
 [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M
 [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M
 [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M
 [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M
 [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M
 [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M
 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M
 [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M
 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M
 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M
INFO: task lctl:7532 blocked for more than 120 seconds.^M

At this point the systems were rebooted.
After reboot, the MDT could not be mounted, as MGS reported a duplicate export:

Aug 18 11:24:44 lola-10 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts:
Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(tgt_lastrcvd.c:1474:tgt_clients_data_init()) soaked-MDT0002: duplicate export for client generation 1
Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:578:class_setup()) setup soaked-MDT0002 failed (-114)
Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:1671:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -114
Aug 18 11:24:45 lola-10 kernel: Lustre:    cmd=cf003 0:soaked-MDT0002  1:soaked-MDT0002_UUID  2:2  3:soaked-MDT0002-mdtlov  4:f
Aug 18 11:24:45 lola-10 kernel:
Aug 18 11:24:45 lola-10 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0002' failed (-114). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server soaked-MDT0002: -114
Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -114
Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_config.c:625:class_cleanup()) Device 4 not setup
Aug 18 11:24:46 lola-10 kernel: Lustre: server umount soaked-MDT0002 complete
Aug 18 11:24:46 lola-10 kernel: LustreError: 5951:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount  (-114)

Cliff White (Inactive) added a comment - 18/Aug/16 6:38 PM We seem to have hit this again. Build lustre-master 3424. MDS crashes: Kernel panic - not syncing: Attempted to kill init!^M Pid: 1, comm: init Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M Call Trace:^M [<ffffffff81539407>] ? panic+0xa7/0x16f^M [<ffffffff8107d0a7>] ? do_exit+0x867/0x870^M [<ffffffff8107d1b7>] ? sys_exit+0x17/0x20^M [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M ^[[2J^[[1;1HCopyright(c) 2009 - 2012 Intel Corporation.All rights reserved. ^ Upon restart, MDS fails to complete recovery, times out, During this time the soak controller halted lola-11 for failover. Attempting to abort recovery results in timeouts LustreError: 7532:0:(ldlm_lib.c:2576:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery^M Lustre: 6032:0:(ldlm_lib.c:2025:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1471532976/real 1471532976] req@ffff8804070af9c0 x1543013122722384/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1471533032 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages^M Aug 18 08:10:01 lola-10 TIME: Time stamp for console INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 192.168.1.131@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.^M LustreError: Skipped 351 previous similar messages^M INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M INFO: task lctl:7532 blocked for more than 120 seconds.^M At this point the systems were rebooted. After reboot, the MDT could not be mounted, as MGS reported a duplicate export: Aug 18 11:24:44 lola-10 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(tgt_lastrcvd.c:1474:tgt_clients_data_init()) soaked-MDT0002: duplicate export for client generation 1 Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:578:class_setup()) setup soaked-MDT0002 failed (-114) Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:1671:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -114 Aug 18 11:24:45 lola-10 kernel: Lustre: cmd=cf003 0:soaked-MDT0002 1:soaked-MDT0002_UUID 2:2 3:soaked-MDT0002-mdtlov 4:f Aug 18 11:24:45 lola-10 kernel: Aug 18 11:24:45 lola-10 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0002' failed (-114). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server soaked-MDT0002: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_config.c:625:class_cleanup()) Device 4 not setup Aug 18 11:24:46 lola-10 kernel: Lustre: server umount soaked-MDT0002 complete Aug 18 11:24:46 lola-10 kernel: LustreError: 5951:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-114)

Frank Heckes (Inactive) added a comment - 08/Jun/16 4:22 PM

Triggering an abort of the recovery hangs:

LustreError: 139015:0:(mdt_handler.c:5896:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
LustreError: 139015:0:(ldlm_lib.c:2577:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
INFO: task lctl:139015 blocked for more than 120 seconds.
      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl          D 0000000000000013     0 139015 129872 0x00000080
 ffff8804187cfb68 0000000000000082 0000000000000000 ffff880431d7eab0
 ffff880431d7eae8 ffff880431d7f150 00ff8804187cfb18 0000000000000000
 ffff8803fc2c0078 ffff880038775a00 ffff8803fc2c05f8 ffff8804187cffd8
Call Trace:
 [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0
 [<ffffffff8153a693>] wait_for_common+0x123/0x180
 [<ffffffff81067650>] ? default_wake_function+0x0/0x20
 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20
 [<ffffffffa0ae3e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]
 [<ffffffffa1223c02>] mdt_iocontrol+0x6b2/0x920 [mdt]
 [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30
 [<ffffffffa08d4edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]
 [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20
 [<ffffffffa08b92ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
 [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0
 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580
 [<ffffffff811a03f6>] ? final_putname+0x26/0x50
 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0
 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Frank Heckes (Inactive) added a comment - 08/Jun/16 4:22 PM Triggering an abort of the recovery hangs: LustreError: 139015:0:(mdt_handler.c:5896:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device LustreError: 139015:0:(ldlm_lib.c:2577:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery INFO: task lctl:139015 blocked for more than 120 seconds. Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lctl D 0000000000000013 0 139015 129872 0x00000080 ffff8804187cfb68 0000000000000082 0000000000000000 ffff880431d7eab0 ffff880431d7eae8 ffff880431d7f150 00ff8804187cfb18 0000000000000000 ffff8803fc2c0078 ffff880038775a00 ffff8803fc2c05f8 ffff8804187cffd8 Call Trace: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0 [<ffffffff8153a693>] wait_for_common+0x123/0x180 [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20 [<ffffffffa0ae3e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc] [<ffffffffa1223c02>] mdt_iocontrol+0x6b2/0x920 [mdt] [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30 [<ffffffffa08d4edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass] [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20 [<ffffffffa08b92ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580 [<ffffffff811a03f6>] ? final_putname+0x26/0x50 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

People

Assignee:: Mikhail Pershin

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 08/Jun/16 3:54 PM

Updated:: 26/Oct/16 11:24 PM

Resolved:: 26/Oct/16 11:24 PM