[LU-7662] lfsck don't complete Created: 13/Jan/16  Updated: 29/Jan/16  Resolved: 29/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: master, 2.7.64-81-g6fc8da4, 6fc8da41f2ff5156639e89f379adcdbb73ac8567


Attachments: File console-lola-11.log.bz2     File console-lola-9.log.bz2     File lfsck-info.txt.bz2     File lu-7662-lola-11-1452785464.17420-lustre-log     File messages-lola-11.log.bz2     File messages-lola-9.log.bz2     File vmcore-dmesg.txt.bz2    
Issue Links:
Duplicate
duplicates LU-6684 lctl lfsck_stop hangs Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during lfsck run of soak FS using build '20160108'. (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160108)
DNE is enabled.

  • lfsck started on MDS hosting mdt-0:
    [root@lola-8 ~]# date; lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A ; date
    Wed Jan 13 04:42:28 PST 2016
    Started LFSCK on the device soaked-MDT0000: scrub layout namespace
    Wed Jan 13 04:42:28 PST 2016
    

    No soak test was running

  • lfsck_namespace don't complete phase scanning-phase2
  • MDSes lola-9,11 showed an increasing number of blocked mdt_out* - threads
  • Triggering stack trace lead kernel panic on lola-11 (2016-01-13-08:15:22)
  • All MDSes show only minimal utilization of system resources

Attached files:

  • console, messages files of lola-9,11; containing stack trace information
  • vmcore-dmesg.txt of lola-11
  • lfsck status information of all MDTs

Crash file location see next comment.



 Comments   
Comment by Frank Heckes (Inactive) [ 13/Jan/16 ]

Crash file has been saved to lola-1:/scratch/crashdumps/lu-7662/lola-11-127.0.0.1-2016-01-13-08:15:22

Comment by Frank Heckes (Inactive) [ 14/Jan/16 ]

I tried to stop lfsck by running

[root@lola-8 ~]#  lctl lfsck_stop -M soaked-MDT0000 -A

on node hosting mdt-0.
Operation is stall (for hours):

----------------
lola-8
----------------
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      30223  0.1  0.0      0     0 ?        S    Jan13   2:32 [lfsck]
root      30226  0.0  0.0      0     0 ?        S    Jan13   1:04 [lfsck_namespace]
root      30227  0.1  0.0      0     0 ?        S    Jan13   2:48 [lfsck]
root      30230  0.0  0.0      0     0 ?        S    Jan13   1:30 [lfsck_namespace]
root      42585  0.0  0.0  11140   684 pts/0    S+   04:52   0:00 lctl lfsck_stop -M soaked-MDT0000 -A
----------------
lola-9
----------------
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      15157  0.1  0.0      0     0 ?        S    Jan13   2:33 [lfsck]
root      15158  0.1  0.0      0     0 ?        S    Jan13   2:31 [lfsck]
root      15163  0.0  0.0      0     0 ?        S    Jan13   1:11 [lfsck_namespace]
root      15164  0.0  0.0      0     0 ?        S    Jan13   0:53 [lfsck_namespace]
----------------
lola-10
----------------
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      19219  0.1  0.0      0     0 ?        S    Jan13   2:30 [lfsck]
root      19220  0.1  0.0      0     0 ?        S    Jan13   2:40 [lfsck]
root      19225  0.0  0.0      0     0 ?        S    Jan13   0:59 [lfsck_namespace]
root      19226  0.0  0.0      0     0 ?        S    Jan13   0:57 [lfsck_namespace]
----------------
lola-11
----------------
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       6170  0.0  0.0      0     0 ?        S    Jan13   0:00 [lfsck]
root       6172  0.0  0.0      0     0 ?        S    Jan13   0:09 [lfsck_namespace]

Huge number of mdt_out - threads are blocked on lola-9,11:

[root@lola-16 soaked]# pdsh -g mds 'uptime' | dshbak -c
----------------
lola-8
----------------
 07:47:57 up 3 days,  6:18,  1 user,  load average: 1.47, 1.36, 1.33
----------------
lola-9
----------------
 07:47:57 up 4 days, 20:49,  0 users,  load average: 198.46, 198.44, 198.39
----------------
lola-10
----------------
 07:47:57 up 5 days,  7:52,  0 users,  load average: 0.52, 0.44, 0.37
----------------
lola-11
----------------
 07:47:57 up 23:25,  1 user,  load average: 195.07, 195.23, 194.51

---------------------------------------------------------------------

[root@lola-11 crash]# uptime
 07:38:43 up 23:15,  1 user,  load average: 195.39, 194.76, 193.83
[root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | wc -l
196
[root@lola-11 crash]# ps aux | grep 'D ' | grep -v grep | grep 'mdt_out' | wc -l
196

Typical dump file contains (see attached debug log) a continous stream of CONN -> DISCONN and vice versa messages.

Comment by nasf (Inactive) [ 26/Jan/16 ]

This ticket contains server issues. Before the LFSCK, there were a lot of DNE tests on the lola cluster, and it is unknown the system consistency.

According to the log lfsck-info.txt.bz2, there were some lfsck failures, but without further logs, we do not know what happened (related LFSCK debug logs have been overwritten).

In the log console-lola-9.log.bz2, the namespace LFSCK was in double scan to verify some remote directory's linkEA. For that, the namespace LFSCK needs to locate the remote parent via OUT RPC, such RPC is uninterruptible. If the remote MDT did not handle such RPC (or the RPC handler was blocked on remote MDT), then lfsck_stop would hung there. That is why the "lfsck_stop" did not work. It is known trouble, and has been resolved by the patches http://review.whamcloud.com/17032/ and http://review.whamcloud.com/#/c/18082/. These two patches allow the LFSCK to be stopped even if related RPC are blocked.

lfsck_namespa S 0000000000000009     0 15163      2 0x00000080^M
 ffff880336279540 0000000000000046 0000000000000000 000000005696abdc^M
 0000000000000000 0000000000000000 00013ca15062f3e9 ffffffffa0a77805^M
 ffff880700000018 0000000114be3200 ffff88033d7ec5f8 ffff880336279fd8^M
Call Trace:^M
 [<ffffffffa0a77805>] ? ptl_send_rpc+0x685/0xea0 [ptlrpc]^M
 [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0^M
 [<ffffffff81087540>] ? process_timeout+0x0/0x10^M
 [<ffffffffa0a70cd9>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]^M
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M
 [<ffffffffa0a7d385>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]^M
 [<ffffffffa0a71461>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]^M
 [<ffffffffa138e101>] osp_remote_sync+0x121/0x190 [osp]^M
 [<ffffffffa1370cd8>] osp_attr_get+0x428/0x6e0 [osp]^M
 [<ffffffffa13726f7>] osp_object_init+0x1c7/0x330 [osp]^M
 [<ffffffffa0853648>] lu_object_alloc+0xd8/0x320 [obdclass]^M
 [<ffffffffa0854a31>] lu_object_find_try+0x151/0x260 [obdclass]^M
 [<ffffffffa0854bf1>] lu_object_find_at+0xb1/0xe0 [obdclass]^M
 [<ffffffffa0cf8b73>] ? fld_server_lookup+0x53/0x330 [fld]^M
 [<ffffffffa0854c5f>] lu_object_find_slice+0x1f/0x80 [obdclass]^M
 [<ffffffffa1101220>] lfsck_namespace_dsd_single+0x200/0xd50 [lfsck]^M
 [<ffffffffa1106406>] lfsck_namespace_double_scan_dir+0x6d6/0xe40 [lfsck]^M
 [<ffffffffa1106ec4>] lfsck_namespace_double_scan_one+0x354/0x1330 [lfsck]^M
 [<ffffffffa0854bf1>] ? lu_object_find_at+0xb1/0xe0 [obdclass]^M
 [<ffffffffa110845d>] lfsck_namespace_double_scan_one_trace_file+0x5bd/0x8d0 [lfsck]^M
 [<ffffffffa110c643>] lfsck_namespace_assistant_handler_p2+0x3b3/0x1830 [lfsck]^M
 [<ffffffff81087540>] ? process_timeout+0x0/0x10^M
 [<ffffffffa10ed913>] lfsck_assistant_engine+0x1633/0x2010 [lfsck]^M
 [<ffffffff81064c00>] ? default_wake_function+0x0/0x20^M
 [<ffffffffa10ec2e0>] ? lfsck_assistant_engine+0x0/0x2010 [lfsck]^M
 [<ffffffff8109e78e>] kthread+0x9e/0xc0^M
 [<ffffffff8100c28a>] child_rip+0xa/0x20^M
 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0^M
 [<ffffffff8100c280>] ? child_rip+0x0/0x20^M

Another problem is that there were a lot of "mdt_outxxx" thread hang on the MDT as the shown stack:

[<ffffffffa0866b7d>] lu_object_find_at+0x3d/0xe0 [obdclass]
[<ffffffff81064c00>] ? default_wake_function+0x0/0x20
[<ffffffffa0867a9c>] dt_locate_at+0x1c/0xa0 [obdclass]
[<ffffffffa0b01e2e>] out_handle+0x105e/0x19a0 [ptlrpc]
[<ffffffff8105872d>] ? check_preempt_curr+0x6d/0x90
[<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
[<ffffffffa0af132a>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
[<ffffffffa0af8bbc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
[<ffffffffa0aa0231>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
[<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
[<ffffffffa0a9f3f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
[<ffffffff8109e78e>] kthread+0x9e/0xc0
[<ffffffff8100c28a>] child_rip+0xa/0x20
[<ffffffff8109e6f0>] ? kthread+0x0/0xc0
[<ffffffff8100c280>] ? child_rip+0x0/0x20

The log shows that all the "mdt_outxxx" threads were blocked at the lu_object_find_at(). Related OUT RPCs were triggered by the namespace LFSCK on remote MDT for locating remote parent object. It must because that the parent object was in cache and marked as dying. But it is unknown (no logs) who was destroying the parent object and why it was not purged out of cache after the destroying.

Di, is there any known issue about destroying object under DNE mode?

One thing to be suspected is LU-7680. So please try with the following three patches for further verification. Thanks!
1) http://review.whamcloud.com/17032/
2) http://review.whamcloud.com/#/c/18082/
3) http://review.whamcloud.com/#/c/18032/

Comment by Frank Heckes (Inactive) [ 26/Jan/16 ]

Ok. I'll prepare a build together with Di's latest DNE patches contained in http://review.whamcloud.com/#/c/16838/39

Comment by Frank Heckes (Inactive) [ 26/Jan/16 ]

Hm, the first patch can't be applied:

[soakbuilder@lhn lustre-release]$ for i in /scratch/rpms/20160126/patches/*.patch; do git am $i; done
Applying: LU-0000 dne: dne llog fixes
warning: lustre/tests/conf-sanity.sh has type 100755, expected 100644
Applying: LU-6684 lfsck: stop lfsck even if some servers offline
error: patch failed: lustre/include/lustre_net.h:605
error: lustre/include/lustre_net.h: patch does not apply
error: patch failed: lustre/include/obd_support.h:557
error: lustre/include/obd_support.h: patch does not apply
error: patch failed: lustre/lfsck/lfsck_engine.c:1577
error: lustre/lfsck/lfsck_engine.c: patch does not apply
error: patch failed: lustre/lfsck/lfsck_internal.h:817
error: lustre/lfsck/lfsck_internal.h: patch does not apply
error: patch failed: lustre/lfsck/lfsck_layout.c:3248
error: lustre/lfsck/lfsck_layout.c: patch does not apply
error: patch failed: lustre/lfsck/lfsck_lib.c:31
error: lustre/lfsck/lfsck_lib.c: patch does not apply
error: patch failed: lustre/lfsck/lfsck_namespace.c:3931
error: lustre/lfsck/lfsck_namespace.c: patch does not apply
error: patch failed: lustre/obdclass/obd_mount_server.c:477
error: lustre/obdclass/obd_mount_server.c: patch does not apply
error: patch failed: lustre/osp/osp_trans.c:454
error: lustre/osp/osp_trans.c: patch does not apply
error: patch failed: lustre/ptlrpc/client.c:1661
error: lustre/ptlrpc/client.c: patch does not apply
error: patch failed: lustre/tests/sanity-lfsck.sh:4291
error: lustre/tests/sanity-lfsck.sh: patch does not apply
Patch failed at 0001 LU-6684 lfsck: stop lfsck even if some servers offline
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort".
previous rebase directory /home/soakbuilder/repos/lustre-release/.git/rebase-apply still exists but mbox given.

Patch details:

[soakbuilder@lhn lustre-release]$ ls -1 /scratch/rpms/20160126/patches/
001-LU-0000_dne_dne_llog_fixes-PatchSet39.patch
002-LU-6684_lfsck_stop_lfsck_even_if_some_servers_offline-PatchSet6.patch
003-LU-6684_lfsck_set_the_lfsck_notify_as_interruptable-PatchSet3.patch
004-LU-7680_mdd_put_migrated_object_on_the_orphan_list

Status of master branch used to create sub-branch :

[soakbuilder@lhn lustre-release]$ git describe ; git log | head -1
2.7.65-38-g607f691
commit 607f6919ea67b101796630d4b55649a12ea0e859
Comment by nasf (Inactive) [ 26/Jan/16 ]

The patch http://review.whamcloud.com/17032/ has already been landed to the latest master branch. If you are working on the latest master, then please apply the 2nd and 3rd patches directly.

Comment by Joseph Gmitter (Inactive) [ 27/Jan/16 ]

In triage today, it was reported that further work on the patch is needed after experiencing more failures. nasf is actively looking at it.

Comment by nasf (Inactive) [ 29/Jan/16 ]

The patch http://review.whamcloud.com/#/c/18082/ has been verified on lola today, works. But there is something can be improved.

Comment by nasf (Inactive) [ 29/Jan/16 ]

The patch http://review.whamcloud.com/#/c/18082/ have been improved to handle lola trouble more properly.

Comment by nasf (Inactive) [ 29/Jan/16 ]

It is another failure instance of LU-6684.

Comment by James A Simmons [ 29/Jan/16 ]

Since this ticket, which was a blocker, is a duplicate of LU-6684 shouldn't LU-6684 be marked as a blocker then

Generated at Sat Feb 10 02:10:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.