[LU-8250] MDT recovery stalled on secondary node Created: 08/Jun/16 Updated: 26/Oct/16 Resolved: 26/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration. The issue might be related to Events:
2nd Event:
|
| Comments |
| Comment by Frank Heckes (Inactive) [ 08/Jun/16 ] |
|
Triggering an abort of the recovery hangs: LustreError: 139015:0:(mdt_handler.c:5896:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
LustreError: 139015:0:(ldlm_lib.c:2577:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
INFO: task lctl:139015 blocked for more than 120 seconds.
Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lctl D 0000000000000013 0 139015 129872 0x00000080
ffff8804187cfb68 0000000000000082 0000000000000000 ffff880431d7eab0
ffff880431d7eae8 ffff880431d7f150 00ff8804187cfb18 0000000000000000
ffff8803fc2c0078 ffff880038775a00 ffff8803fc2c05f8 ffff8804187cffd8
Call Trace:
[<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
[<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0
[<ffffffff8153a693>] wait_for_common+0x123/0x180
[<ffffffff81067650>] ? default_wake_function+0x0/0x20
[<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20
[<ffffffffa0ae3e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]
[<ffffffffa1223c02>] mdt_iocontrol+0x6b2/0x920 [mdt]
[<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30
[<ffffffffa08d4edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]
[<ffffffff811a5998>] ? do_filp_open+0x798/0xd20
[<ffffffffa08b92ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
[<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0
[<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580
[<ffffffff811a03f6>] ? final_putname+0x26/0x50
[<ffffffff811a86a1>] sys_ioctl+0x81/0xa0
[<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
|
| Comment by Cliff White (Inactive) [ 18/Aug/16 ] |
|
We seem to have hit this again. Kernel panic - not syncing: Attempted to kill init!^M Pid: 1, comm: init Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M Call Trace:^M [<ffffffff81539407>] ? panic+0xa7/0x16f^M [<ffffffff8107d0a7>] ? do_exit+0x867/0x870^M [<ffffffff8107d1b7>] ? sys_exit+0x17/0x20^M [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M ^[[2J^[[1;1HCopyright(c) 2009 - 2012 Intel Corporation.All rights reserved. ^ Upon restart, MDS fails to complete recovery, times out, During this time the soak controller halted lola-11 for failover. Attempting to abort recovery results in timeouts LustreError: 7532:0:(ldlm_lib.c:2576:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery^M Lustre: 6032:0:(ldlm_lib.c:2025:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1471532976/real 1471532976] req@ffff8804070af9c0 x1543013122722384/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1471533032 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages^M Aug 18 08:10:01 lola-10 TIME: Time stamp for console INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 192.168.1.131@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.^M LustreError: Skipped 351 previous similar messages^M INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M INFO: task lctl:7532 blocked for more than 120 seconds.^M At this point the systems were rebooted. Aug 18 11:24:44 lola-10 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(tgt_lastrcvd.c:1474:tgt_clients_data_init()) soaked-MDT0002: duplicate export for client generation 1 Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:578:class_setup()) setup soaked-MDT0002 failed (-114) Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:1671:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -114 Aug 18 11:24:45 lola-10 kernel: Lustre: cmd=cf003 0:soaked-MDT0002 1:soaked-MDT0002_UUID 2:2 3:soaked-MDT0002-mdtlov 4:f Aug 18 11:24:45 lola-10 kernel: Aug 18 11:24:45 lola-10 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0002' failed (-114). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server soaked-MDT0002: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_config.c:625:class_cleanup()) Device 4 not setup Aug 18 11:24:46 lola-10 kernel: Lustre: server umount soaked-MDT0002 complete Aug 18 11:24:46 lola-10 kernel: LustreError: 5951:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-114) |
| Comment by Peter Jones [ 19/Aug/16 ] |
|
Hongchao Could you please look into this issue? THanks Peter |
| Comment by Hongchao Zhang [ 23/Aug/16 ] |
|
I have looked at the logs, but I have not found useful information to help identify the cause of the problem. the increased memory usage could be related to the continuous lock replay req to enqueue ... 00010000:00080000:26.0:1465394173.485502:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e947050 x1536423085514416/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394173.485521:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247be9c50 x1536423085514432/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:26.0:1465394179.502412:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88016a6c2450 x1536423085514928/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394179.502417:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff8802828f5050 x1536423085514944/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394185.503466:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880141b1e850 x1536423085515456/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:26.0:1465394185.503491:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88031ca74050 x1536423085515472/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394191.504477:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309b450 x1536423085516016/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:10.0:1465394191.504487:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309bc50 x1536423085516032/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:10.0:1465394209.507404:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247b59850 x1536423085517600/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:11.0:1465394209.507410:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e9e1c50 x1536423085517616/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1 the problem that MDT can't be mounted (with error -114, -EALREADY) is the same problem described in Is the stack trace of all processes available, it could help to find where the recovery is stuck. |
| Comment by Frank Heckes (Inactive) [ 23/Aug/16 ] |
|
Hi Hongchao, |
| Comment by Di Wang [ 21/Sep/16 ] |
|
Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently. Jun 7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52 Jun 7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages Jun 7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2 Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2 Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks. |
| Comment by Mikhail Pershin [ 30/Sep/16 ] |
|
Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it. |
| Comment by Di Wang [ 12/Oct/16 ] |
|
It turns out one MDS is stuck here mdt00_016 S 0000000000000005 0 6217 2 0x00000080 ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724 0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286 ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8 Call Trace: [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc] [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp] [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp] [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod] [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd] [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd] [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd] [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd] [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt] [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt] [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then other threads waiting for the journal mdt_out01_008 D 000000000000000b 0 6281 2 0x00000080 ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0 ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80 ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8 Call Trace: [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2] [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs] [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc] [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Then it cause other MDTs can not reconnect to this MDT. LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 Then it cause the whole recovery stuck. I will cook a fix. |
| Comment by Gerrit Updater [ 12/Oct/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096 |
| Comment by Andreas Dilger [ 12/Oct/16 ] |
|
It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait(). |
| Comment by Gerrit Updater [ 12/Oct/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111 |
| Comment by Di Wang [ 13/Oct/16 ] |
|
There is another panic with 23111. Since this is a blocker of 2.9, I will create a new ticket (LU-8704). |
| Comment by Joseph Gmitter (Inactive) [ 18/Oct/16 ] |
|
http://review.whamcloud.com/23111 has been moved to LU-8704 per discussion with Di. |
| Comment by Gerrit Updater [ 26/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/23096/ |
| Comment by Peter Jones [ 26/Oct/16 ] |
|
Landed for 2.9 |