[LU-7662] lfsck don't complete Created: 13/Jan/16 Updated: 29/Jan/16 Resolved: 29/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Error happened during lfsck run of soak FS using build '20160108'. (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160108)
Attached files:
Crash file location see next comment. |
| Comments |
| Comment by Frank Heckes (Inactive) [ 13/Jan/16 ] |
|
Crash file has been saved to lola-1:/scratch/crashdumps/lu-7662/lola-11-127.0.0.1-2016-01-13-08:15:22 |
| Comment by Frank Heckes (Inactive) [ 14/Jan/16 ] |
|
I tried to stop lfsck by running [root@lola-8 ~]# lctl lfsck_stop -M soaked-MDT0000 -A on node hosting mdt-0. ---------------- lola-8 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 30223 0.1 0.0 0 0 ? S Jan13 2:32 [lfsck] root 30226 0.0 0.0 0 0 ? S Jan13 1:04 [lfsck_namespace] root 30227 0.1 0.0 0 0 ? S Jan13 2:48 [lfsck] root 30230 0.0 0.0 0 0 ? S Jan13 1:30 [lfsck_namespace] root 42585 0.0 0.0 11140 684 pts/0 S+ 04:52 0:00 lctl lfsck_stop -M soaked-MDT0000 -A ---------------- lola-9 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 15157 0.1 0.0 0 0 ? S Jan13 2:33 [lfsck] root 15158 0.1 0.0 0 0 ? S Jan13 2:31 [lfsck] root 15163 0.0 0.0 0 0 ? S Jan13 1:11 [lfsck_namespace] root 15164 0.0 0.0 0 0 ? S Jan13 0:53 [lfsck_namespace] ---------------- lola-10 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 19219 0.1 0.0 0 0 ? S Jan13 2:30 [lfsck] root 19220 0.1 0.0 0 0 ? S Jan13 2:40 [lfsck] root 19225 0.0 0.0 0 0 ? S Jan13 0:59 [lfsck_namespace] root 19226 0.0 0.0 0 0 ? S Jan13 0:57 [lfsck_namespace] ---------------- lola-11 ---------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6170 0.0 0.0 0 0 ? S Jan13 0:00 [lfsck] root 6172 0.0 0.0 0 0 ? S Jan13 0:09 [lfsck_namespace] Huge number of mdt_out - threads are blocked on lola-9,11: [root@lola-16 soaked]# pdsh -g mds 'uptime' | dshbak -c ---------------- lola-8 ---------------- 07:47:57 up 3 days, 6:18, 1 user, load average: 1.47, 1.36, 1.33 ---------------- lola-9 ---------------- 07:47:57 up 4 days, 20:49, 0 users, load average: 198.46, 198.44, 198.39 ---------------- lola-10 ---------------- 07:47:57 up 5 days, 7:52, 0 users, load average: 0.52, 0.44, 0.37 ---------------- lola-11 ---------------- 07:47:57 up 23:25, 1 user, load average: 195.07, 195.23, 194.51 --------------------------------------------------------------------- [root@lola-11 crash]# uptime 07:38:43 up 23:15, 1 user, load average: 195.39, 194.76, 193.83 [root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | wc -l 196 [root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | grep 'mdt_out' | wc -l 196 Typical dump file contains (see attached debug log) a continous stream of CONN -> DISCONN and vice versa messages. |
| Comment by nasf (Inactive) [ 26/Jan/16 ] |
|
This ticket contains server issues. Before the LFSCK, there were a lot of DNE tests on the lola cluster, and it is unknown the system consistency. According to the log lfsck-info.txt.bz2, there were some lfsck failures, but without further logs, we do not know what happened (related LFSCK debug logs have been overwritten). In the log console-lola-9.log.bz2, the namespace LFSCK was in double scan to verify some remote directory's linkEA. For that, the namespace LFSCK needs to locate the remote parent via OUT RPC, such RPC is uninterruptible. If the remote MDT did not handle such RPC (or the RPC handler was blocked on remote MDT), then lfsck_stop would hung there. That is why the "lfsck_stop" did not work. It is known trouble, and has been resolved by the patches http://review.whamcloud.com/17032/ and http://review.whamcloud.com/#/c/18082/. These two patches allow the LFSCK to be stopped even if related RPC are blocked. lfsck_namespa S 0000000000000009 0 15163 2 0x00000080^M ffff880336279540 0000000000000046 0000000000000000 000000005696abdc^M 0000000000000000 0000000000000000 00013ca15062f3e9 ffffffffa0a77805^M ffff880700000018 0000000114be3200 ffff88033d7ec5f8 ffff880336279fd8^M Call Trace:^M [<ffffffffa0a77805>] ? ptl_send_rpc+0x685/0xea0 [ptlrpc]^M [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0^M [<ffffffff81087540>] ? process_timeout+0x0/0x10^M [<ffffffffa0a70cd9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]^M [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M [<ffffffffa0a7d385>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]^M [<ffffffffa0a71461>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]^M [<ffffffffa138e101>] osp_remote_sync+0x121/0x190 [osp]^M [<ffffffffa1370cd8>] osp_attr_get+0x428/0x6e0 [osp]^M [<ffffffffa13726f7>] osp_object_init+0x1c7/0x330 [osp]^M [<ffffffffa0853648>] lu_object_alloc+0xd8/0x320 [obdclass]^M [<ffffffffa0854a31>] lu_object_find_try+0x151/0x260 [obdclass]^M [<ffffffffa0854bf1>] lu_object_find_at+0xb1/0xe0 [obdclass]^M [<ffffffffa0cf8b73>] ? fld_server_lookup+0x53/0x330 [fld]^M [<ffffffffa0854c5f>] lu_object_find_slice+0x1f/0x80 [obdclass]^M [<ffffffffa1101220>] lfsck_namespace_dsd_single+0x200/0xd50 [lfsck]^M [<ffffffffa1106406>] lfsck_namespace_double_scan_dir+0x6d6/0xe40 [lfsck]^M [<ffffffffa1106ec4>] lfsck_namespace_double_scan_one+0x354/0x1330 [lfsck]^M [<ffffffffa0854bf1>] ? lu_object_find_at+0xb1/0xe0 [obdclass]^M [<ffffffffa110845d>] lfsck_namespace_double_scan_one_trace_file+0x5bd/0x8d0 [lfsck]^M [<ffffffffa110c643>] lfsck_namespace_assistant_handler_p2+0x3b3/0x1830 [lfsck]^M [<ffffffff81087540>] ? process_timeout+0x0/0x10^M [<ffffffffa10ed913>] lfsck_assistant_engine+0x1633/0x2010 [lfsck]^M [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M [<ffffffffa10ec2e0>] ? lfsck_assistant_engine+0x0/0x2010 [lfsck]^M [<ffffffff8109e78e>] kthread+0x9e/0xc0^M [<ffffffff8100c28a>] child_rip+0xa/0x20^M [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M [<ffffffff8100c280>] ? child_rip+0x0/0x20^M Another problem is that there were a lot of "mdt_outxxx" thread hang on the MDT as the shown stack: [<ffffffffa0866b7d>] lu_object_find_at+0x3d/0xe0 [obdclass] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0867a9c>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa0b01e2e>] out_handle+0x105e/0x19a0 [ptlrpc] [<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0af132a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] [<ffffffffa0af8bbc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] [<ffffffffa0aa0231>] ptlrpc_main+0xe41/0x1910 [ptlrpc] [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0a9f3f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 The log shows that all the "mdt_outxxx" threads were blocked at the lu_object_find_at(). Related OUT RPCs were triggered by the namespace LFSCK on remote MDT for locating remote parent object. It must because that the parent object was in cache and marked as dying. But it is unknown (no logs) who was destroying the parent object and why it was not purged out of cache after the destroying. Di, is there any known issue about destroying object under DNE mode? One thing to be suspected is |
| Comment by Frank Heckes (Inactive) [ 26/Jan/16 ] |
|
Ok. I'll prepare a build together with Di's latest DNE patches contained in http://review.whamcloud.com/#/c/16838/39 |
| Comment by Frank Heckes (Inactive) [ 26/Jan/16 ] |
|
Hm, the first patch can't be applied: [soakbuilder@lhn lustre-release]$ for i in /scratch/rpms/20160126/patches/*.patch; do git am $i; done Applying: LU-0000 dne: dne llog fixes warning: lustre/tests/conf-sanity.sh has type 100755, expected 100644 Applying: LU-6684 lfsck: stop lfsck even if some servers offline error: patch failed: lustre/include/lustre_net.h:605 error: lustre/include/lustre_net.h: patch does not apply error: patch failed: lustre/include/obd_support.h:557 error: lustre/include/obd_support.h: patch does not apply error: patch failed: lustre/lfsck/lfsck_engine.c:1577 error: lustre/lfsck/lfsck_engine.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_internal.h:817 error: lustre/lfsck/lfsck_internal.h: patch does not apply error: patch failed: lustre/lfsck/lfsck_layout.c:3248 error: lustre/lfsck/lfsck_layout.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_lib.c:31 error: lustre/lfsck/lfsck_lib.c: patch does not apply error: patch failed: lustre/lfsck/lfsck_namespace.c:3931 error: lustre/lfsck/lfsck_namespace.c: patch does not apply error: patch failed: lustre/obdclass/obd_mount_server.c:477 error: lustre/obdclass/obd_mount_server.c: patch does not apply error: patch failed: lustre/osp/osp_trans.c:454 error: lustre/osp/osp_trans.c: patch does not apply error: patch failed: lustre/ptlrpc/client.c:1661 error: lustre/ptlrpc/client.c: patch does not apply error: patch failed: lustre/tests/sanity-lfsck.sh:4291 error: lustre/tests/sanity-lfsck.sh: patch does not apply Patch failed at 0001 LU-6684 lfsck: stop lfsck even if some servers offline When you have resolved this problem run "git am --resolved". If you would prefer to skip this patch, instead run "git am --skip". To restore the original branch and stop patching run "git am --abort". previous rebase directory /home/soakbuilder/repos/lustre-release/.git/rebase-apply still exists but mbox given. Patch details: [soakbuilder@lhn lustre-release]$ ls -1 /scratch/rpms/20160126/patches/ 001-LU-0000_dne_dne_llog_fixes-PatchSet39.patch 002-LU-6684_lfsck_stop_lfsck_even_if_some_servers_offline-PatchSet6.patch 003-LU-6684_lfsck_set_the_lfsck_notify_as_interruptable-PatchSet3.patch 004-LU-7680_mdd_put_migrated_object_on_the_orphan_list Status of master branch used to create sub-branch : [soakbuilder@lhn lustre-release]$ git describe ; git log | head -1 2.7.65-38-g607f691 commit 607f6919ea67b101796630d4b55649a12ea0e859 |
| Comment by nasf (Inactive) [ 26/Jan/16 ] |
|
The patch http://review.whamcloud.com/17032/ has already been landed to the latest master branch. If you are working on the latest master, then please apply the 2nd and 3rd patches directly. |
| Comment by Joseph Gmitter (Inactive) [ 27/Jan/16 ] |
|
In triage today, it was reported that further work on the patch is needed after experiencing more failures. nasf is actively looking at it. |
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
The patch http://review.whamcloud.com/#/c/18082/ has been verified on lola today, works. But there is something can be improved. |
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
The patch http://review.whamcloud.com/#/c/18082/ have been improved to handle lola trouble more properly. |
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
It is another failure instance of |
| Comment by James A Simmons [ 29/Jan/16 ] |
|
Since this ticket, which was a blocker, is a duplicate of |