Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8886

LFSCK failed to resume from the last checkpoint

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It is found that the LFSCK failed just after resuming from the last checkpoint. The log is as following:

      00100000:10000000:8.0:1480550565.633225:0:8251:0:(lfsck_lib.c:2573:lfsck_post_generic()) soaked-MDT0000-osd: waiting for assistant to do lfsck_layout post, rc = -61
      00100000:10000000:8.0:1480550565.633247:0:8254:0:(lfsck_engine.c:1781:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant unknown status: rc = 0
      00100000:10000000:8.0:1480550565.633252:0:8254:0:(lfsck_engine.c:1805:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant sync before exit
      00100000:10000000:25.0:1480550565.633853:0:8253:0:(lfsck_engine.c:1805:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant sync before exit
      00100000:10000000:8.0:1480550565.636178:0:8254:0:(lfsck_engine.c:1811:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant synced before exit: rc = 0
      00100000:10000000:8.0:1480550565.636190:0:8254:0:(lfsck_engine.c:1838:lfsck_assistant_engine()) soaked-MDT0000-osd: lfsck_namespace LFSCK assistant thread exit: rc = 0
      00100000:10000000:25.0:1480550565.650715:0:8253:0:(lfsck_engine.c:1811:lfsck_assistant_engine()) soaked-MDT0000-osd: LFSCK assistant synced before exit: rc = 0
      00100000:10000000:25.0:1480550565.650725:0:8253:0:(lfsck_engine.c:1838:lfsck_assistant_engine()) soaked-MDT0000-osd: lfsck_layout LFSCK assistant thread exit: rc = -61
      00100000:10000000:24.0:1480550565.650761:0:8251:0:(lfsck_lib.c:2585:lfsck_post_generic()) soaked-MDT0000-osd: the assistant has done lfsck_layout post, rc = -61
      00100000:10000000:24.0:1480550565.650810:0:8251:0:(lfsck_layout.c:4680:lfsck_layout_master_post()) soaked-MDT0000-osd: layout LFSCK master post done: rc = 0
      00100000:10000000:24.0:1480550565.650813:0:8251:0:(lfsck_lib.c:2573:lfsck_post_generic()) soaked-MDT0000-osd: waiting for assistant to do lfsck_namespace post, rc = -61
      00100000:10000000:24.0:1480550565.650815:0:8251:0:(lfsck_lib.c:2585:lfsck_post_generic()) soaked-MDT0000-osd: the assistant has done lfsck_namespace post, rc = -61
      

      Attachments

        Activity

          [LU-8886] LFSCK failed to resume from the last checkpoint

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24056/
          Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 8ff4556055edf496a0d23cf35fd0d63619143363

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24056/ Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8ff4556055edf496a0d23cf35fd0d63619143363

          No, has that patch landed on master? If not, why not?

          Cliff,

          As James mentioned, such patch has already been in master. But I am not sure whether the build your testing on Lola contains such patch or not.

          yong.fan nasf (Inactive) added a comment - No, has that patch landed on master? If not, why not? Cliff, As James mentioned, such patch has already been in master. But I am not sure whether the build your testing on Lola contains such patch or not.

          LU-8647 landed to master in October under the ticket number LU-8569 at https://review.whamcloud.com/#/c/22723/
          (... that's what is stated in the ticket LU-8647)

          jamesanunez James Nunez (Inactive) added a comment - LU-8647 landed to master in October under the ticket number LU-8569 at https://review.whamcloud.com/#/c/22723/ (... that's what is stated in the ticket LU-8647 )

          No, has that patch landed on master? If not, why not?

          cliffw Cliff White (Inactive) added a comment - No, has that patch landed on master? If not, why not?

          The new failure looks like LU-8647. Have you applied the patch https://jira.hpdd.intel.com/browse/LU-8647 in the test?

          yong.fan nasf (Inactive) added a comment - The new failure looks like LU-8647 . Have you applied the patch https://jira.hpdd.intel.com/browse/LU-8647 in the test?

          Tried the patch, does not appear to be working - first time lfsk started the node crashed

          Dec  2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed:
          Dec  2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) LBUG
          Dec  2 09:56:17 lola-8 kernel: Pid: 14454, comm: lfsck
          Dec  2 09:56:17 lola-8 kernel:
          Dec  2 09:56:17 lola-8 kernel: Call Trace:
          Dec  2 09:56:17 lola-8 kernel: [<ffffffffa081e875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Dec  2 09:56:17 lola-8 kernel: [<ffffffffa081e9bf>] lbug_with_loc+0x3f/0x90 [libcfs]
          Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11d1c2e>] lfsck_namespace_double_scan+0xee/0x120 [lfsck]
          Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11cee40>] lfsck_master_engine+0x590/0x1460 [lfsck]
          Dec  2 09:56:17 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
          Dec  2 09:56:17 lola-8 kernel: [<ffffffffa11ce8b0>] ? lfsck_master_engine+0x0/0x1460 [lfsck]
          Dec  2 09:56:17 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
          Dec  2 09:56:17 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
          Dec  2 09:56:17 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
          Dec  2 09:56:17 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
          
          cliffw Cliff White (Inactive) added a comment - Tried the patch, does not appear to be working - first time lfsk started the node crashed Dec 2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) ASSERTION( list_empty(&lad->lad_req_list) ) failed: Dec 2 09:56:17 lola-8 kernel: LustreError: 14454:0:(lfsck_namespace.c:4492:lfsck_namespace_double_scan()) LBUG Dec 2 09:56:17 lola-8 kernel: Pid: 14454, comm: lfsck Dec 2 09:56:17 lola-8 kernel: Dec 2 09:56:17 lola-8 kernel: Call Trace: Dec 2 09:56:17 lola-8 kernel: [<ffffffffa081e875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Dec 2 09:56:17 lola-8 kernel: [<ffffffffa081e9bf>] lbug_with_loc+0x3f/0x90 [libcfs] Dec 2 09:56:17 lola-8 kernel: [<ffffffffa11d1c2e>] lfsck_namespace_double_scan+0xee/0x120 [lfsck] Dec 2 09:56:17 lola-8 kernel: [<ffffffffa11cee40>] lfsck_master_engine+0x590/0x1460 [lfsck] Dec 2 09:56:17 lola-8 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20 Dec 2 09:56:17 lola-8 kernel: [<ffffffffa11ce8b0>] ? lfsck_master_engine+0x0/0x1460 [lfsck] Dec 2 09:56:17 lola-8 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0 Dec 2 09:56:17 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Dec 2 09:56:17 lola-8 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 Dec 2 09:56:17 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/24056
          Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: f6a09af4d58f7d49e0988a9a8e595b1b152d972b

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/24056 Subject: LU-8886 lfsck: handle -ENODATA for the end of iteration Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f6a09af4d58f7d49e0988a9a8e595b1b152d972b

          People

            yong.fan nasf (Inactive)
            yong.fan nasf (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: