Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
lola
build: master, 2.7.64-81-g6fc8da4, 6fc8da41f2ff5156639e89f379adcdbb73ac8567
-
3
-
9223372036854775807
Description
Error happened during lfsck run of soak FS using build '20160108'. (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160108)
DNE is enabled.
- lfsck started on MDS hosting mdt-0:
[root@lola-8 ~]# date; lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A ; date Wed Jan 13 04:42:28 PST 2016 Started LFSCK on the device soaked-MDT0000: scrub layout namespace Wed Jan 13 04:42:28 PST 2016
No soak test was running
- lfsck_namespace don't complete phase scanning-phase2
- MDSes lola-9,11 showed an increasing number of blocked mdt_out* - threads
- Triggering stack trace lead kernel panic on lola-11 (2016-01-13-08:15:22)
- All MDSes show only minimal utilization of system resources
Attached files:
- console, messages files of lola-9,11; containing stack trace information
- vmcore-dmesg.txt of lola-11
- lfsck status information of all MDTs
Crash file location see next comment.
Attachments
Issue Links
- duplicates
-
LU-6684 lctl lfsck_stop hangs
-
- Resolved
-
Activity
The patch http://review.whamcloud.com/#/c/18082/ have been improved to handle lola trouble more properly.
The patch http://review.whamcloud.com/#/c/18082/ has been verified on lola today, works. But there is something can be improved.
In triage today, it was reported that further work on the patch is needed after experiencing more failures. nasf is actively looking at it.
The patch http://review.whamcloud.com/17032/ has already been landed to the latest master branch. If you are working on the latest master, then please apply the 2nd and 3rd patches directly.
Hm, the first patch can't be applied:
[soakbuilder@lhn lustre-release]$ for i in /scratch/rpms/20160126/patches/*.patch; do git am $i; done Applying: LU-0000 dne: dne llog fixes warning: lustre/tests/conf-sanity.sh has type 100755, expected 100644 Applying: LU-6684 lfsck: stop lfsck even if some servers offline error: patch failed: lustre/include/lustre_net.h:605 error: lustre/include/lustre_net.h: patch does not apply error: patch failed: lustre/include/obd_support.h:557 error: lustre/include/obd_support.h: patch does not apply error: patch failed: lustre/lfsck/lfsck_engine.c:1577 error: lustre/lfsck/lfsck_engine.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_internal.h:817 error: lustre/lfsck/lfsck_internal.h: patch does not apply error: patch failed: lustre/lfsck/lfsck_layout.c:3248 error: lustre/lfsck/lfsck_layout.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_lib.c:31 error: lustre/lfsck/lfsck_lib.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_namespace.c:3931 error: lustre/lfsck/lfsck_namespace.c: patch does not apply error: patch failed: lustre/obdclass/obd_mount_server.c:477 error: lustre/obdclass/obd_mount_server.c: patch does not apply error: patch failed: lustre/osp/osp_trans.c:454 error: lustre/osp/osp_trans.c: patch does not apply error: patch failed: lustre/ptlrpc/client.c:1661 error: lustre/ptlrpc/client.c: patch does not apply error: patch failed: lustre/tests/sanity-lfsck.sh:4291 error: lustre/tests/sanity-lfsck.sh: patch does not apply Patch failed at 0001 LU-6684 lfsck: stop lfsck even if some servers offline When you have resolved this problem run "git am --resolved". If you would prefer to skip this patch, instead run "git am --skip". To restore the original branch and stop patching run "git am --abort". previous rebase directory /home/soakbuilder/repos/lustre-release/.git/rebase-apply still exists but mbox given.
Patch details:
[soakbuilder@lhn lustre-release]$ ls -1 /scratch/rpms/20160126/patches/ 001-LU-0000_dne_dne_llog_fixes-PatchSet39.patch 002-LU-6684_lfsck_stop_lfsck_even_if_some_servers_offline-PatchSet6.patch 003-LU-6684_lfsck_set_the_lfsck_notify_as_interruptable-PatchSet3.patch 004-LU-7680_mdd_put_migrated_object_on_the_orphan_list
Status of master branch used to create sub-branch :
[soakbuilder@lhn lustre-release]$ git describe ; git log | head -1 2.7.65-38-g607f691 commit 607f6919ea67b101796630d4b55649a12ea0e859
Ok. I'll prepare a build together with Di's latest DNE patches contained in http://review.whamcloud.com/#/c/16838/39
This ticket contains server issues. Before the LFSCK, there were a lot of DNE tests on the lola cluster, and it is unknown the system consistency.
According to the log lfsck-info.txt.bz2, there were some lfsck failures, but without further logs, we do not know what happened (related LFSCK debug logs have been overwritten).
In the log console-lola-9.log.bz2, the namespace LFSCK was in double scan to verify some remote directory's linkEA. For that, the namespace LFSCK needs to locate the remote parent via OUT RPC, such RPC is uninterruptible. If the remote MDT did not handle such RPC (or the RPC handler was blocked on remote MDT), then lfsck_stop would hung there. That is why the "lfsck_stop" did not work. It is known trouble, and has been resolved by the patches http://review.whamcloud.com/17032/ and http://review.whamcloud.com/#/c/18082/. These two patches allow the LFSCK to be stopped even if related RPC are blocked.
lfsck_namespa S 0000000000000009 0 15163 2 0x00000080^M ffff880336279540 0000000000000046 0000000000000000 000000005696abdc^M 0000000000000000 0000000000000000 00013ca15062f3e9 ffffffffa0a77805^M ffff880700000018 0000000114be3200 ffff88033d7ec5f8 ffff880336279fd8^M Call Trace:^M [<ffffffffa0a77805>] ? ptl_send_rpc+0x685/0xea0 [ptlrpc]^M [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0^M [<ffffffff81087540>] ? process_timeout+0x0/0x10^M [<ffffffffa0a70cd9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]^M [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M [<ffffffffa0a7d385>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]^M [<ffffffffa0a71461>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]^M [<ffffffffa138e101>] osp_remote_sync+0x121/0x190 [osp]^M [<ffffffffa1370cd8>] osp_attr_get+0x428/0x6e0 [osp]^M [<ffffffffa13726f7>] osp_object_init+0x1c7/0x330 [osp]^M [<ffffffffa0853648>] lu_object_alloc+0xd8/0x320 [obdclass]^M [<ffffffffa0854a31>] lu_object_find_try+0x151/0x260 [obdclass]^M [<ffffffffa0854bf1>] lu_object_find_at+0xb1/0xe0 [obdclass]^M [<ffffffffa0cf8b73>] ? fld_server_lookup+0x53/0x330 [fld]^M [<ffffffffa0854c5f>] lu_object_find_slice+0x1f/0x80 [obdclass]^M [<ffffffffa1101220>] lfsck_namespace_dsd_single+0x200/0xd50 [lfsck]^M [<ffffffffa1106406>] lfsck_namespace_double_scan_dir+0x6d6/0xe40 [lfsck]^M [<ffffffffa1106ec4>] lfsck_namespace_double_scan_one+0x354/0x1330 [lfsck]^M [<ffffffffa0854bf1>] ? lu_object_find_at+0xb1/0xe0 [obdclass]^M [<ffffffffa110845d>] lfsck_namespace_double_scan_one_trace_file+0x5bd/0x8d0 [lfsck]^M [<ffffffffa110c643>] lfsck_namespace_assistant_handler_p2+0x3b3/0x1830 [lfsck]^M [<ffffffff81087540>] ? process_timeout+0x0/0x10^M [<ffffffffa10ed913>] lfsck_assistant_engine+0x1633/0x2010 [lfsck]^M [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M [<ffffffffa10ec2e0>] ? lfsck_assistant_engine+0x0/0x2010 [lfsck]^M [<ffffffff8109e78e>] kthread+0x9e/0xc0^M [<ffffffff8100c28a>] child_rip+0xa/0x20^M [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M [<ffffffff8100c280>] ? child_rip+0x0/0x20^M
Another problem is that there were a lot of "mdt_outxxx" thread hang on the MDT as the shown stack:
[<ffffffffa0866b7d>] lu_object_find_at+0x3d/0xe0 [obdclass] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0867a9c>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa0b01e2e>] out_handle+0x105e/0x19a0 [ptlrpc] [<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0af132a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0af8bbc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] [<ffffffffa0aa0231>] ptlrpc_main+0xe41/0x1910 [ptlrpc] [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0a9f3f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
The log shows that all the "mdt_outxxx" threads were blocked at the lu_object_find_at(). Related OUT RPCs were triggered by the namespace LFSCK on remote MDT for locating remote parent object. It must because that the parent object was in cache and marked as dying. But it is unknown (no logs) who was destroying the parent object and why it was not purged out of cache after the destroying.
Di, is there any known issue about destroying object under DNE mode?
One thing to be suspected is LU-7680. So please try with the following three patches for further verification. Thanks!
1) http://review.whamcloud.com/17032/
2) http://review.whamcloud.com/#/c/18082/
3) http://review.whamcloud.com/#/c/18032/
I tried to stop lfsck by running
[root@lola-8 ~]# lctl lfsck_stop -M soaked-MDT0000 -A
on node hosting mdt-0.
Operation is stall (for hours):
---------------- lola-8 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 30223 0.1 0.0 0 0 ? S Jan13 2:32 [lfsck] root 30226 0.0 0.0 0 0 ? S Jan13 1:04 [lfsck_namespace] root 30227 0.1 0.0 0 0 ? S Jan13 2:48 [lfsck] root 30230 0.0 0.0 0 0 ? S Jan13 1:30 [lfsck_namespace] root 42585 0.0 0.0 11140 684 pts/0 S+ 04:52 0:00 lctl lfsck_stop -M soaked-MDT0000 -A ---------------- lola-9 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 15157 0.1 0.0 0 0 ? S Jan13 2:33 [lfsck] root 15158 0.1 0.0 0 0 ? S Jan13 2:31 [lfsck] root 15163 0.0 0.0 0 0 ? S Jan13 1:11 [lfsck_namespace] root 15164 0.0 0.0 0 0 ? S Jan13 0:53 [lfsck_namespace] ---------------- lola-10 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 19219 0.1 0.0 0 0 ? S Jan13 2:30 [lfsck] root 19220 0.1 0.0 0 0 ? S Jan13 2:40 [lfsck] root 19225 0.0 0.0 0 0 ? S Jan13 0:59 [lfsck_namespace] root 19226 0.0 0.0 0 0 ? S Jan13 0:57 [lfsck_namespace] ---------------- lola-11 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6170 0.0 0.0 0 0 ? S Jan13 0:00 [lfsck] root 6172 0.0 0.0 0 0 ? S Jan13 0:09 [lfsck_namespace]
Huge number of mdt_out - threads are blocked on lola-9,11:
[root@lola-16 soaked]# pdsh -g mds 'uptime' | dshbak -c ---------------- lola-8 ---------------- 07:47:57 up 3 days, 6:18, 1 user, load average: 1.47, 1.36, 1.33 ---------------- lola-9 ---------------- 07:47:57 up 4 days, 20:49, 0 users, load average: 198.46, 198.44, 198.39 ---------------- lola-10 ---------------- 07:47:57 up 5 days, 7:52, 0 users, load average: 0.52, 0.44, 0.37 ---------------- lola-11 ---------------- 07:47:57 up 23:25, 1 user, load average: 195.07, 195.23, 194.51 --------------------------------------------------------------------- [root@lola-11 crash]# uptime 07:38:43 up 23:15, 1 user, load average: 195.39, 194.76, 193.83 [root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | wc -l 196 [root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | grep 'mdt_out' | wc -l 196
Typical dump file contains (see attached debug log) a continous stream of CONN -> DISCONN and vice versa messages.
Since this ticket, which was a blocker, is a duplicate of
LU-6684shouldn'tLU-6684be marked as a blocker then