Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8863

LFSCK fails to complete, node cannot recover after LFSCK aborted.

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.9.0, Lustre 2.10.0
    • Soak cluster lustre: 2.8.60_5_gcc5601d
    • 3
    • 9223372036854775807

    Description

      LFSCK fails to complete on lola-8, MDT0000:

                                      lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                              fi
                      fi
      2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
      2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 15s
      2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 45s
      2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 105s
      2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 225s
      2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 465s
      2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
      2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 1905s
      2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 3825s
      2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 7665s
      

      I aborted LFSCK with lfsck_stop
      The LFSCK stopped, but clients and other servers were not able to re-connect.
      Example client:

      Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
      Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
      Lustre: Skipped 1 previous similar message
      LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
      LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4
      

      The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
      I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
      Also, the lfsck_layout

      Attachments

        Issue Links

          Activity

            [LU-8863] LFSCK fails to complete, node cannot recover after LFSCK aborted.

            Hitting this again on latest master

            ]# lctl get_param mdd.*.lfsck_layout
            mdd.soaked-MDT0000.lfsck_layout=
            name: lfsck_layout
            magic: 0xb1734d76
            version: 2
            status: scanning-phase1
            flags:
            param:
            last_completed_time: 1485198303
            time_since_last_completed: 89193 seconds
            latest_start_time: 1485275978
            time_since_latest_start: 11518 seconds
            last_checkpoint_time: 1485287462
            time_since_last_checkpoint: 34 seconds
            latest_start_position: 77
            last_checkpoint_position: 769655424
            first_failure_position: 0
            success_count: 2
            repaired_dangling: 807
            repaired_unmatched_pair: 0
            repaired_multiple_referenced: 0
            repaired_orphan: 0
            repaired_inconsistent_owner: 2310456
            repaired_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 4033934
            checked_phase2: 0
            run_time_phase1: 11517 seconds
            run_time_phase2: 0 seconds
            average_speed_phase1: 350 items/sec
            average_speed_phase2: N/A
            real-time_speed_phase1: 246 items/sec
            real-time_speed_phase2: N/A
            current_position: 773325008
            

            Logs

            ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
            ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
            ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
            ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
            
            cliffw Cliff White (Inactive) added a comment - Hitting this again on latest master ]# lctl get_param mdd.*.lfsck_layout mdd.soaked-MDT0000.lfsck_layout= name: lfsck_layout magic: 0xb1734d76 version: 2 status: scanning-phase1 flags: param: last_completed_time: 1485198303 time_since_last_completed: 89193 seconds latest_start_time: 1485275978 time_since_latest_start: 11518 seconds last_checkpoint_time: 1485287462 time_since_last_checkpoint: 34 seconds latest_start_position: 77 last_checkpoint_position: 769655424 first_failure_position: 0 success_count: 2 repaired_dangling: 807 repaired_unmatched_pair: 0 repaired_multiple_referenced: 0 repaired_orphan: 0 repaired_inconsistent_owner: 2310456 repaired_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 4033934 checked_phase2: 0 run_time_phase1: 11517 seconds run_time_phase2: 0 seconds average_speed_phase1: 350 items/sec average_speed_phase2: N/A real-time_speed_phase1: 246 items/sec real-time_speed_phase2: N/A current_position: 773325008 Logs ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0

            This is new to me, i am not sure how to check for DNE recovery as opposed to just recovery. The filesystem recovery completes before we trigger LFSCK in soak.

            cliffw Cliff White (Inactive) added a comment - This is new to me, i am not sure how to check for DNE recovery as opposed to just recovery. The filesystem recovery completes before we trigger LFSCK in soak.

            Cliff, have you checked if DNE recovery and filesystem usage is always blocked until all MDTs are available? It may be that LFSCK is being blocked by the DNE recovery? Preferably, we'd want the filesystem to be usable as soon as MDT0000 is up (at least for those parts that are on available MDTs), but it may be that DNE2 recovery is waiting for all MDTs to be up before it allows recovery to complete?

            adilger Andreas Dilger added a comment - Cliff, have you checked if DNE recovery and filesystem usage is always blocked until all MDTs are available? It may be that LFSCK is being blocked by the DNE recovery? Preferably, we'd want the filesystem to be usable as soon as MDT0000 is up (at least for those parts that are on available MDTs), but it may be that DNE2 recovery is waiting for all MDTs to be up before it allows recovery to complete?

            According to the logs, both the namespace LFSCK and layout LFSCK were running before the lfsck_stop. Except some inconsistent owners were repaired by layout LFSCK, all others looks normally. In fact, the inconsistent owner is fake because of async chown operation during the layout LFSCK. So from the LFSCK view, during the log interval, there were no other inconsistency found.

            But just like LU-8742, the LFSCK ran some "slowly", as to you have to stop lfsck_stop manually. But according to the logs, the LFSCK was really in scanning, not hung. As to how long it will take (already more than 720 seconds), depends on the objects count in the system.

            About the system hung issue, I cannot establish any relationship with the uncompleted LFSCK, because the LFSCK runs at background, and it did not found insistency. Means from the LFSCK view, the system was mountable. As the last comment described, recovery timeout finally, some clients evicted, that may be related with why 'clients remain hung'.

            yong.fan nasf (Inactive) added a comment - According to the logs, both the namespace LFSCK and layout LFSCK were running before the lfsck_stop. Except some inconsistent owners were repaired by layout LFSCK, all others looks normally. In fact, the inconsistent owner is fake because of async chown operation during the layout LFSCK. So from the LFSCK view, during the log interval, there were no other inconsistency found. But just like LU-8742 , the LFSCK ran some "slowly", as to you have to stop lfsck_stop manually. But according to the logs, the LFSCK was really in scanning, not hung. As to how long it will take (already more than 720 seconds), depends on the objects count in the system. About the system hung issue, I cannot establish any relationship with the uncompleted LFSCK, because the LFSCK runs at background, and it did not found insistency. Means from the LFSCK view, the system was mountable. As the last comment described, recovery timeout finally, some clients evicted, that may be related with why 'clients remain hung'.
            cliffw Cliff White (Inactive) added a comment - - edited

            I have been restarting the 4 MDS nodes to attempt to clear this, upon restarting the 4th MDS, (192.168.1.111@o2ib) MDT000 finally completed recovery.

            Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID
            Lustre: Skipped 47 previous similar messages
            Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib10) refused connection, still busy with 10 references
            Lustre: Skipped 47 previous similar messages
            Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1479919197/real 1479919241]  req@ffff88041ce450c0 x1551735784078192/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1479919252 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
            Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
            Lustre: MGS: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
            format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline
            Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0003_UUID (at 192.168.1.111@o2ib10) because it is indicating it is a new client
            Lustre: soaked-MDT0000: recovery is timed out, evict stale exports
            Lustre: soaked-MDT0000: disconnecting 12 stale clients
            Lustre: 5967:0:(ldlm_lib.c:1624:abort_req_replay_queue()) @@@ aborted:  req@ffff880821355080 x1551097408033824/t0(171804417562) o35->d7ef0a77-d6cc-d6e6-b6ee-f0c4dd8b3805@192.168.1.113@o2ib100:-1/-1 lens 512/0 e 2753 to 0 dl 1479919489 ref 1 fl Complete:/4/ffffffff rc 0/-1
            LustreError: 5967:0:(ldlm_lib.c:1645:abort_lock_replay_queue()) @@@ aborted:  req@ffff880827c96050 x1551509616864032/t0(0) o101->soaked-MDT0003-mdtlov_UUID@192.168.1.111@o2ib10:-1/-1 lens 328/0 e 2752 to 0 dl 1479919496 ref 1 fl Complete:/40/ffffffff rc 0/-1
            Lustre: soaked-MDT0000: Denying connection for new client c3576180-37f0-1109-b79d-ead98e580c5d(at 192.168.1.127@o2ib100), waiting for 21 known clients (6 recovered, 3 in progress, and 12 evicted) to recover in 21187373:09
            Lustre: Skipped 3 previous similar messages
            Lustre: 5967:0:(ldlm_lib.c:2035:target_recovery_overseer()) soaked-MDT0000 recovery is aborted by hard timeout
            Lustre: 5967:0:(ldlm_lib.c:2045:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            Lustre: soaked-MDT0000: Recovery over after 1147:10, of 21 clients 6 recovered and 15 were evicted.
            Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10)
            Lustre: soaked-MDT0000: Connection restored to 3f3093dd-91db-9fe9-0a49-09ac8c98d9ec (at 192.168.1.125@o2ib100)
            Lustre: Skipped 1 previous similar message
            Lustre: soaked-MDT0000: Connection restored to b111dd6e-3c52-31d6-0957-ebaf9d411485 (at 192.168.1.123@o2ib100)
            Lustre: Skipped 1 previous similar message
            Lustre: soaked-MDT0000: Connection restored to 0ac581dd-e230-df88-61be-a489f0ddcab5 (at 192.168.1.126@o2ib100)
            Lustre: Skipped 1 previous similar message
            

            However, clients remain hung, still investigating

            cliffw Cliff White (Inactive) added a comment - - edited I have been restarting the 4 MDS nodes to attempt to clear this, upon restarting the 4th MDS, (192.168.1.111@o2ib) MDT000 finally completed recovery. Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID Lustre: Skipped 47 previous similar messages Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib10) refused connection, still busy with 10 references Lustre: Skipped 47 previous similar messages Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1479919197/real 1479919241] req@ffff88041ce450c0 x1551735784078192/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1479919252 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: MGS: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0003_UUID (at 192.168.1.111@o2ib10) because it is indicating it is a new client Lustre: soaked-MDT0000: recovery is timed out, evict stale exports Lustre: soaked-MDT0000: disconnecting 12 stale clients Lustre: 5967:0:(ldlm_lib.c:1624:abort_req_replay_queue()) @@@ aborted: req@ffff880821355080 x1551097408033824/t0(171804417562) o35->d7ef0a77-d6cc-d6e6-b6ee-f0c4dd8b3805@192.168.1.113@o2ib100:-1/-1 lens 512/0 e 2753 to 0 dl 1479919489 ref 1 fl Complete:/4/ffffffff rc 0/-1 LustreError: 5967:0:(ldlm_lib.c:1645:abort_lock_replay_queue()) @@@ aborted: req@ffff880827c96050 x1551509616864032/t0(0) o101->soaked-MDT0003-mdtlov_UUID@192.168.1.111@o2ib10:-1/-1 lens 328/0 e 2752 to 0 dl 1479919496 ref 1 fl Complete:/40/ffffffff rc 0/-1 Lustre: soaked-MDT0000: Denying connection for new client c3576180-37f0-1109-b79d-ead98e580c5d(at 192.168.1.127@o2ib100), waiting for 21 known clients (6 recovered, 3 in progress, and 12 evicted) to recover in 21187373:09 Lustre: Skipped 3 previous similar messages Lustre: 5967:0:(ldlm_lib.c:2035:target_recovery_overseer()) soaked-MDT0000 recovery is aborted by hard timeout Lustre: 5967:0:(ldlm_lib.c:2045:target_recovery_overseer()) recovery is aborted, evict exports in recovery Lustre: soaked-MDT0000: Recovery over after 1147:10, of 21 clients 6 recovered and 15 were evicted. Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) Lustre: soaked-MDT0000: Connection restored to 3f3093dd-91db-9fe9-0a49-09ac8c98d9ec (at 192.168.1.125@o2ib100) Lustre: Skipped 1 previous similar message Lustre: soaked-MDT0000: Connection restored to b111dd6e-3c52-31d6-0957-ebaf9d411485 (at 192.168.1.123@o2ib100) Lustre: Skipped 1 previous similar message Lustre: soaked-MDT0000: Connection restored to 0ac581dd-e230-df88-61be-a489f0ddcab5 (at 192.168.1.126@o2ib100) Lustre: Skipped 1 previous similar message However, clients remain hung, still investigating
            laisiyao Lai Siyao added a comment -

            I just applied access to Lola: DCO-6349. I'll check the system tomorrow.

            laisiyao Lai Siyao added a comment - I just applied access to Lola: DCO-6349. I'll check the system tomorrow.

            After reboot of second MDT (lola-9) reconnection is still hung.

            Lustre: Skipped 8 previous similar messages
            Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib10, removing former export from same NID
            Lustre: soaked-MDT0000: Client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) refused connection, still busy with 7 references
            format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline
            Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0001_UUID (at 192.168.1.109@o2ib10) because it is indicating it is a new client
            
            cliffw Cliff White (Inactive) added a comment - After reboot of second MDT (lola-9) reconnection is still hung. Lustre: Skipped 8 previous similar messages Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib10, removing former export from same NID Lustre: soaked-MDT0000: Client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) refused connection, still busy with 7 references format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0001_UUID (at 192.168.1.109@o2ib10) because it is indicating it is a new client
            pjones Peter Jones added a comment -

            Lai

            Please could you advise on this one? The system is left in the hung state so you should be able to access it directly if that would help.

            Peter

            pjones Peter Jones added a comment - Lai Please could you advise on this one? The system is left in the hung state so you should be able to access it directly if that would help. Peter

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: