Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8863

LFSCK fails to complete, node cannot recover after LFSCK aborted.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.9.0, Lustre 2.10.0
    • Soak cluster lustre: 2.8.60_5_gcc5601d
    • 3
    • 9223372036854775807

    Description

      LFSCK fails to complete on lola-8, MDT0000:

                                      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                              fi
                      fi
      2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
      2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 15s
      2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 45s
      2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 105s
      2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 225s
      2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 465s
      2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
      2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 1905s
      2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 3825s
      2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 7665s
      

      I aborted LFSCK with lfsck_stop
      The LFSCK stopped, but clients and other servers were not able to re-connect.
      Example client:

      Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
      Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
      Lustre: Skipped 1 previous similar message
      LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
      LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
      

      The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
      I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
      Also, the lfsck_layout

      Attachments

        Issue Links

          Activity

            [LU-8863] LFSCK fails to complete, node cannot recover after LFSCK aborted.
            adilger Andreas Dilger made changes -
            Resolution New: Cannot Reproduce [ 5 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Link Original: This issue is duplicated by BULL-43 [ BULL-43 ]
            pjones Peter Jones made changes -
            Link New: This issue is duplicated by BULL-43 [ BULL-43 ]
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.10.0 [ 12204 ]
            jamesanunez James Nunez (Inactive) made changes -
            Labels New: soak
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (HPDD Community Wiki)" [ 19709 ]
            jamesanunez James Nunez (Inactive) made changes -
            Affects Version/s New: Lustre 2.10.0 [ 12204 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s New: Lustre 2.10.0 [ 12204 ]
            jamesanunez James Nunez (Inactive) made changes -
            Link New: This issue is related to LU-8647 [ LU-8647 ]
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Lai Siyao [ laisiyao ]
            cliffw Cliff White (Inactive) created issue -

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: