[LU-8886] LFSCK failed to resume from the last checkpoint Created: 01/Dec/16  Updated: 23/Dec/16  Resolved: 23/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It is found that the LFSCK failed just after resuming from the last checkpoint. The log is as following:

00100000:10000000:8.0:1480550565.633225:0:8251:0:(lfsck_lib.c:2573:lfsck_post_generic()) soaked-MDT0000-osd: waiting for assistant to do lfsck_layout post, rc = -61
00100000:10000000:8.0:1480550565.633247:0:8254:0:(lfsck_engine.c:1781:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant unknown status: rc = 0
00100000:10000000:8.0:1480550565.633252:0:8254:0:(lfsck_engine.c:1805:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant sync before exit
00100000:10000000:25.0:1480550565.633853:0:8253:0:(lfsck_engine.c:1805:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant sync before exit
00100000:10000000:8.0:1480550565.636178:0:8254:0:(lfsck_engine.c:1811:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant synced before exit: rc = 0
00100000:10000000:8.0:1480550565.636190:0:8254:0:(lfsck_engine.c:1838:lfsck_assistant_engine()) soaked-MDT0000-osd: lfsck_namespace LFSCK assistant thread exit: rc = 0
00100000:10000000:25.0:1480550565.650715:0:8253:0:(lfsck_engine.c:1811:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant synced before exit: rc = 0
00100000:10000000:25.0:1480550565.650725:0:8253:0:(lfsck_engine.c:1838:lfsck_assistant_engine()) soaked-MDT0000-osd: lfsck_layout LFSCK assistant thread exit: rc = -61
00100000:10000000:24.0:1480550565.650761:0:8251:0:(lfsck_lib.c:2585:lfsck_post_generic()) soaked-MDT0000-osd: the assistant has done lfsck_layout post, rc = -61
00100000:10000000:24.0:1480550565.650810:0:8251:0:(lfsck_layout.c:4680:lfsck_layout_master_post()) soaked-MDT0000-osd: layout LFSCK master post done: rc = 0
00100000:10000000:24.0:1480550565.650813:0:8251:0:(lfsck_lib.c:2573:lfsck_post_generic()) soaked-MDT0000-osd: waiting for assistant to do lfsck_namespace post, rc = -61
00100000:10000000:24.0:1480550565.650815:0:8251:0:(lfsck_lib.c:2585:lfsck_post_generic()) soaked-MDT0000-osd: the assistant has done lfsck_namespace post, rc = -61


 Comments   
Comment by Gerrit Updater [ 01/Dec/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/24056
Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f6a09af4d58f7d49e0988a9a8e595b1b152d972b

Comment by Cliff White (Inactive) [ 02/Dec/16 ]

Tried the patch, does not appear to be working - first time lfsk started the node crashed

Dec  2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
Dec  2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) LBUG
Dec  2 09:56:17 lola-8 kernel: Pid: 14454, comm: lfsck
Dec  2 09:56:17 lola-8 kernel:
Dec  2 09:56:17 lola-8 kernel: Call Trace:
Dec  2 09:56:17 lola-8 kernel: [<ffffffffa081e875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Dec  2 09:56:17 lola-8 kernel: [<ffffffffa081e9bf>] lbug_with_loc+0x3f/0x90 [libcfs]
Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11d1c2e>] lfsck_namespace_double_scan+0xee/0x120 [lfsck]
Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11cee40>] lfsck_master_engine+0x590/0x1460 [lfsck]
Dec  2 09:56:17 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11ce8b0>] ? lfsck_master_engine+0x0/0x1460 [lfsck]
Dec  2 09:56:17 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
Dec  2 09:56:17 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Dec  2 09:56:17 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
Dec  2 09:56:17 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by nasf (Inactive) [ 03/Dec/16 ]

The new failure looks like LU-8647. Have you applied the patch https://jira.hpdd.intel.com/browse/LU-8647 in the test?

Comment by Cliff White (Inactive) [ 05/Dec/16 ]

No, has that patch landed on master? If not, why not?

Comment by James Nunez (Inactive) [ 05/Dec/16 ]

LU-8647 landed to master in October under the ticket number LU-8569 at https://review.whamcloud.com/#/c/22723/
(... that's what is stated in the ticket LU-8647)

Comment by nasf (Inactive) [ 06/Dec/16 ]

No, has that patch landed on master? If not, why not?

Cliff,

As James mentioned, such patch has already been in master. But I am not sure whether the build your testing on Lola contains such patch or not.

Comment by Gerrit Updater [ 23/Dec/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24056/
Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8ff4556055edf496a0d23cf35fd0d63619143363

Generated at Sat Feb 10 02:21:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.