Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.9.0
-
lola
build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches
-
3
-
9223372036854775807
Description
The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.
The issue might be related to LU-7848 although the change is part of build under test.
The error results in the start of the oom-killer, which is documented in LU-7836. This ticket might be a duplicate of LU-7836.
Events:
1st Event:
- 2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
- 2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
- till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
- 2016-06-04-00:44:52 - lola-11 crash with oom-killer
2nd Event:
- 2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
- 2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
- since 2016-06-07 08:39:32,155 Wait for recovery to complete
- memory resources are nearly exhausted:
[root@lola-11 ~]# date Wed Jun 8 07:59:49 PDT 2016 [root@lola-11 ~]# collectl -sm --verbose waiting for 1 second sample... # MEMORY SUMMARY #<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------> # Total Used Free Buff Cached Slab Mapped Anon Commit Locked Inact Total Used Free In Out Fault MajFt In Out 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 28 0 0 8 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 63 0 0 4 32006M 30565M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 1 0 0 0 32006M 30564M 1441M 127144K 676256K 28701M 16196K 69072K 201740K 5008K 509M 15G 0 15G 0 0 17 0 0 0
Attached files:
*1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 - 2nd event only: kernel debug log of lola-11, dmesg
- Both event: messages, console logs
Attachments
Issue Links
Activity
It would be useful to add a check for an open journal handle in the declare code and in the ptlrpc code to detect any other code paths like this. That can be done relatively easily with ldiskfs, check current->journal_info in osd_trans_declare_op() and ptlrpc_set_wait().
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23096
Subject: LU-8250 mdd: move linkea prepare out of transaction.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b67d62eefef30fbb3365da3382021820985b5e39
It turns out one MDS is stuck here
mdt00_016 S 0000000000000005 0 6217 2 0x00000080 ffff880404c37760 0000000000000046 0000000000000000 ffff880404c37724 0000000000000000 ffff88043fe82800 00000db7708e9e5b 0000000000000286 ffff880404c37700 ffffffff81089e8c ffff8804066d9068 ffff880404c37fd8 Call Trace: [<ffffffff81089e8c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0b09821>] ptlrpc_set_wait+0x321/0x960 [ptlrpc] [<ffffffffa0afe980>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0b15d05>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0b09ee1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa13d4bc2>] osp_remote_sync+0xf2/0x1e0 [osp] [<ffffffffa13ba821>] osp_xattr_get+0x681/0xf90 [osp] [<ffffffffa12eb5b5>] lod_xattr_get+0x185/0x760 [lod] [<ffffffffa134f917>] mdd_links_read+0x117/0x270 [mdd] [<ffffffffa1364996>] ? mdd_attr_set_internal+0xd6/0x2c0 [mdd] [<ffffffffa13515bc>] mdd_linkea_prepare+0x3ec/0x4d0 [mdd] [<ffffffffa13573f2>] mdd_link+0xde2/0x10c0 [mdd] [<ffffffffa1222a8f>] mdt_reint_link+0xb2f/0xce0 [mdt] [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60 [<ffffffffa12178cf>] ? ucred_set_jobid+0x5f/0x70 [mdt] [<ffffffffa121b04d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1205d5b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa120662b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Then other threads waiting for the journal
mdt_out01_008 D 000000000000000b 0 6281 2 0x00000080 ffff8804030a7a40 0000000000000046 0000000000000000 ffff88079c63b5c0 ffff8807fbe51ad8 ffff88082d91f400 00000bef4b83696a ffff880828bdba80 ffff8804030a7a10 0000000100c3b96b ffff8803fbd31068 ffff8804030a7fd8 Call Trace: [<ffffffffa0fe7fca>] start_this_handle+0x25a/0x480 [jbd2] [<ffffffff811781fb>] ? cache_alloc_refill+0x15b/0x240 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0fe83d5>] jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa1036a36>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] [<ffffffffa10870f1>] osd_trans_start+0x1e1/0x430 [osd_ldiskfs] [<ffffffffa0b7c53c>] out_tx_end+0x9c/0x5d0 [ptlrpc] [<ffffffffa0b81ec9>] out_handle+0x11d9/0x18d0 [ptlrpc] [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80 [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90 [<ffffffffa080bc5a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50 [<ffffffffa0b71eda>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0b790cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b25821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff8106ee50>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81539896>] ? schedule+0x176/0x3a0 [<ffffffffa0b24af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Then it cause other MDTs can not reconnect to this MDT.
LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0001-osp-MDT0003: operation mds_connect to node 192.168.1.109@o2ib10 failed: rc = -114
Then it cause the whole recovery stuck. I will cook a fix.
Frank, is patch from http://review.whamcloud.com/#/c/13726/ applied in the build or not? If not, I'd try with it.
Hmm, right now, we do not fail the connection between MDTs to make sure the FS will not corrupt silently.
Jun 7 08:40:11 lola-11 kernel: Lustre: soaked-MDT0002: Client f6c679b2-f46e-20c5-89f7-51193ed93a53 (at 192.168.1.121@o2ib100) reconnecting, waiting for 20 clients in recovery for 1:52 Jun 7 08:40:11 lola-11 kernel: Lustre: Skipped 99 previous similar messages Jun 7 08:40:18 lola-11 sshd[7507]: Accepted publickey for root from 10.4.0.116 port 47943 ssh2 Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:18 lola-11 sshd[7507]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:18 lola-11 sshd[7507]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:33 lola-11 sshd[7530]: Accepted publickey for root from 10.4.0.116 port 47963 ssh2 Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session opened for user root by (uid=0) Jun 7 08:40:33 lola-11 sshd[7530]: Received disconnect from 10.4.0.116: 11: disconnected by user Jun 7 08:40:33 lola-11 sshd[7530]: pam_unix(sshd:session): session closed for user root Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: recovery is timed out, evict stale exports Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2016:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout Jun 7 08:40:47 lola-11 kernel: Lustre: 6183:0:(ldlm_lib.c:2026:target_recovery_overseer()) recovery is aborted, evict exports in recovery Jun 7 08:40:47 lola-11 kernel: Lustre: soaked-MDT0003: disconnecting 1 stale clients
I just glanced the log a bit, and it seems MDT0002 and MDT0003 were recovering at the same time. I thought soak-test will wait the MDT recovery finish, before fail another MDT? no? I might miss sth. thanks.
Hi Hongchao,
there's a crash file available that had been stored at: lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52 (see above). It's easy to access if you have an account on Lola (DCO create that within 2 hours if you file a ticket containing your ssh-public-key). crashdump tools and lustre.so are installed on the head node.
I have looked at the logs, but I have not found useful information to help identify the cause of the problem.
the increased memory usage could be related to the continuous lock replay req to enqueue
... 00010000:00080000:26.0:1465394173.485502:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e947050 x1536423085514416/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394173.485521:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247be9c50 x1536423085514432/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:334/0 lens 328/0 e 0 to 0 dl 1465394179 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:26.0:1465394179.502412:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88016a6c2450 x1536423085514928/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:27.0:1465394179.502417:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff8802828f5050 x1536423085514944/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:340/0 lens 328/0 e 0 to 0 dl 1465394185 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394185.503466:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880141b1e850 x1536423085515456/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:26.0:1465394185.503491:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88031ca74050 x1536423085515472/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:346/0 lens 328/0 e 0 to 0 dl 1465394191 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:27.0:1465394191.504477:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309b450 x1536423085516016/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:10.0:1465394191.504487:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88023309bc50 x1536423085516032/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:352/0 lens 328/0 e 0 to 0 dl 1465394197 ref 2 fl Interpret:/40/ffffffff rc 0/-1 ... 00010000:00080000:10.0:1465394209.507404:0:6068:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff880247b59850 x1536423085517600/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1 00010000:00080000:11.0:1465394209.507410:0:6067:0:(ldlm_lib.c:2733:target_queue_recovery_request()) @@@ queue lock replay req req@ffff88017e9e1c50 x1536423085517616/t0(0) o101->soaked-MDT0001-mdtlov_UUID@192.168.1.109@o2ib10:370/0 lens 328/0 e 0 to 0 dl 1465394215 ref 2 fl Interpret:/40/ffffffff rc 0/-1
the problem that MDT can't be mounted (with error -114, -EALREADY) is the same problem described in LU-7794.
Is the stack trace of all processes available, it could help to find where the recovery is stuck.
Thanks!
We seem to have hit this again.
Build lustre-master 3424.
MDS crashes:
Kernel panic - not syncing: Attempted to kill init!^M Pid: 1, comm: init Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M Call Trace:^M [<ffffffff81539407>] ? panic+0xa7/0x16f^M [<ffffffff8107d0a7>] ? do_exit+0x867/0x870^M [<ffffffff8107d1b7>] ? sys_exit+0x17/0x20^M [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M ^[[2J^[[1;1HCopyright(c) 2009 - 2012 Intel Corporation.All rights reserved. ^
Upon restart, MDS fails to complete recovery, times out, During this time the soak controller halted lola-11 for failover. Attempting to abort recovery results in timeouts
LustreError: 7532:0:(ldlm_lib.c:2576:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery^M Lustre: 6032:0:(ldlm_lib.c:2025:target_recovery_overseer()) recovery is aborted, evict exports in recovery^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1471532976/real 1471532976] req@ffff8804070af9c0 x1543013122722384/t0(0) o38->soaked-MDT0002-osp-MDT0003@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1471533032 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1^M Lustre: 4563:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages^M Aug 18 08:10:01 lola-10 TIME: Time stamp for console INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M LustreError: 137-5: soaked-MDT0002_UUID: not available for connect from 192.168.1.131@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.^M LustreError: Skipped 351 previous similar messages^M INFO: task lctl:7532 blocked for more than 120 seconds.^M Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000011 0 7532 7440 0x00000080^M ffff8804069dbb68 0000000000000086 0000000000000000 ffff880835ef3520^M ffff880835ef3558 ffff880835ef3bc0 000000d247916847 0000000000000000^M ffff880437394078 0000000100092ec2 ffff8804373945f8 ffff8804069dbfd8^M Call Trace:^M [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0^M [<ffffffff8153a693>] wait_for_common+0x123/0x180^M [<ffffffff81067650>] ? default_wake_function+0x0/0x20^M [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20^M [<ffffffffa0970f80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc]^M [<ffffffffa10ebf22>] mdt_iocontrol+0x6a2/0xa80 [mdt]^M [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30^M [<ffffffffa07843ec>] class_handle_ioctl+0x15fc/0x20d0 [obdclass]^M [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20^M [<ffffffffa07682ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0^M [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580^M [<ffffffff811a03f6>] ? final_putname+0x26/0x50^M [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0^M [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M INFO: task lctl:7532 blocked for more than 120 seconds.^M
At this point the systems were rebooted.
After reboot, the MDT could not be mounted, as MGS reported a duplicate export:
Aug 18 11:24:44 lola-10 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(tgt_lastrcvd.c:1474:tgt_clients_data_init()) soaked-MDT0002: duplicate export for client generation 1 Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:578:class_setup()) setup soaked-MDT0002 failed (-114) Aug 18 11:24:45 lola-10 kernel: LustreError: 6045:0:(obd_config.c:1671:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -114 Aug 18 11:24:45 lola-10 kernel: Lustre: cmd=cf003 0:soaked-MDT0002 1:soaked-MDT0002_UUID 2:2 3:soaked-MDT0002-mdtlov 4:f Aug 18 11:24:45 lola-10 kernel: Aug 18 11:24:45 lola-10 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0002' failed (-114). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server soaked-MDT0002: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -114 Aug 18 11:24:45 lola-10 kernel: LustreError: 5951:0:(obd_config.c:625:class_cleanup()) Device 4 not setup Aug 18 11:24:46 lola-10 kernel: Lustre: server umount soaked-MDT0002 complete Aug 18 11:24:46 lola-10 kernel: LustreError: 5951:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-114)
Triggering an abort of the recovery hangs:
LustreError: 139015:0:(mdt_handler.c:5896:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device LustreError: 139015:0:(ldlm_lib.c:2577:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery INFO: task lctl:139015 blocked for more than 120 seconds. Tainted: P -- ------------ 2.6.32-573.26.1.el6_lustre.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lctl D 0000000000000013 0 139015 129872 0x00000080 ffff8804187cfb68 0000000000000082 0000000000000000 ffff880431d7eab0 ffff880431d7eae8 ffff880431d7f150 00ff8804187cfb18 0000000000000000 ffff8803fc2c0078 ffff880038775a00 ffff8803fc2c05f8 ffff8804187cffd8 Call Trace: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 [<ffffffff8153aa35>] schedule_timeout+0x215/0x2e0 [<ffffffff8153a693>] wait_for_common+0x123/0x180 [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffff8153a7cd>] wait_for_completion+0x1d/0x20 [<ffffffffa0ae3e80>] target_stop_recovery_thread+0x50/0xe0 [ptlrpc] [<ffffffffa1223c02>] mdt_iocontrol+0x6b2/0x920 [mdt] [<ffffffff812337cf>] ? security_inode_permission+0x1f/0x30 [<ffffffffa08d4edc>] class_handle_ioctl+0x15fc/0x20d0 [obdclass] [<ffffffff811a5998>] ? do_filp_open+0x798/0xd20 [<ffffffffa08b92ab>] obd_class_ioctl+0x4b/0x190 [obdclass] [<ffffffff811a7f82>] vfs_ioctl+0x22/0xa0 [<ffffffff811a8124>] do_vfs_ioctl+0x84/0x580 [<ffffffff811a03f6>] ? final_putname+0x26/0x50 [<ffffffff811a86a1>] sys_ioctl+0x81/0xa0 [<ffffffff810e905e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/23111
Subject:
LU-8250osd: add journal info checkProject: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c8bc88c85d6f273b18f7741cc9c0cdc1bcfca4d5