Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8863

LFSCK fails to complete, node cannot recover after LFSCK aborted.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
    • Fix Version/s: None
    • Labels:
    • Environment:
      Soak cluster lustre: 2.8.60_5_gcc5601d
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      LFSCK fails to complete on lola-8, MDT0000:

                                      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                              fi
                      fi
      2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
      2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 15s
      2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 45s
      2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 105s
      2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 225s
      2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 465s
      2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
      2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 1905s
      2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 3825s
      2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 7665s
      

      I aborted LFSCK with lfsck_stop
      The LFSCK stopped, but clients and other servers were not able to re-connect.
      Example client:

      Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
      Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
      Lustre: Skipped 1 previous similar message
      LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
      LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
      

      The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
      I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
      Also, the lfsck_layout

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                laisiyao Lai Siyao
                Reporter:
                cliffw Cliff White (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: