Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8863

LFSCK fails to complete, node cannot recover after LFSCK aborted.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.9.0, Lustre 2.10.0
    • Soak cluster lustre: 2.8.60_5_gcc5601d
    • 3
    • 9223372036854775807

    Description

      LFSCK fails to complete on lola-8, MDT0000:

                                      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                              fi
                      fi
      2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
      2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 15s
      2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 45s
      2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 105s
      2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 225s
      2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 465s
      2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
      2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 1905s
      2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 3825s
      2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 7665s
      

      I aborted LFSCK with lfsck_stop
      The LFSCK stopped, but clients and other servers were not able to re-connect.
      Example client:

      Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
      Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
      Lustre: Skipped 1 previous similar message
      LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
      LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
      

      The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
      I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
      Also, the lfsck_layout

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: