[LU-8863] LFSCK fails to complete, node cannot recover after LFSCK aborted. Created: 22/Nov/16 Updated: 12/Aug/22 Resolved: 12/Aug/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster lustre: 2.8.60_5_gcc5601d |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
LFSCK fails to complete on lola-8, MDT0000: lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
fi
fi
2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO lfsck started on lola-8
2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 15s
2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 45s
2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 105s
2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 225s
2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 465s
2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 945s
2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 1905s
2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 3825s
2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO lfsck still in progress for soaked-MDT0000 after 7665s
I aborted LFSCK with lfsck_stop Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7 Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10) Lustre: Skipped 1 previous similar message LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4 LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4 The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue. |
| Comments |
| Comment by Peter Jones [ 22/Nov/16 ] |
|
Lai Please could you advise on this one? The system is left in the hung state so you should be able to access it directly if that would help. Peter |
| Comment by Cliff White (Inactive) [ 22/Nov/16 ] |
|
After reboot of second MDT (lola-9) reconnection is still hung. Lustre: Skipped 8 previous similar messages Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib10, removing former export from same NID Lustre: soaked-MDT0000: Client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) refused connection, still busy with 7 references format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0001_UUID (at 192.168.1.109@o2ib10) because it is indicating it is a new client |
| Comment by Lai Siyao [ 23/Nov/16 ] |
|
I just applied access to Lola: DCO-6349. I'll check the system tomorrow. |
| Comment by Cliff White (Inactive) [ 23/Nov/16 ] |
|
I have been restarting the 4 MDS nodes to attempt to clear this, upon restarting the 4th MDS, (192.168.1.111@o2ib) MDT000 finally completed recovery. Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID Lustre: Skipped 47 previous similar messages Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib10) refused connection, still busy with 10 references Lustre: Skipped 47 previous similar messages Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1479919197/real 1479919241] req@ffff88041ce450c0 x1551735784078192/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1479919252 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Lustre: MGS: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0003_UUID (at 192.168.1.111@o2ib10) because it is indicating it is a new client Lustre: soaked-MDT0000: recovery is timed out, evict stale exports Lustre: soaked-MDT0000: disconnecting 12 stale clients Lustre: 5967:0:(ldlm_lib.c:1624:abort_req_replay_queue()) @@@ aborted: req@ffff880821355080 x1551097408033824/t0(171804417562) o35->d7ef0a77-d6cc-d6e6-b6ee-f0c4dd8b3805@192.168.1.113@o2ib100:-1/-1 lens 512/0 e 2753 to 0 dl 1479919489 ref 1 fl Complete:/4/ffffffff rc 0/-1 LustreError: 5967:0:(ldlm_lib.c:1645:abort_lock_replay_queue()) @@@ aborted: req@ffff880827c96050 x1551509616864032/t0(0) o101->soaked-MDT0003-mdtlov_UUID@192.168.1.111@o2ib10:-1/-1 lens 328/0 e 2752 to 0 dl 1479919496 ref 1 fl Complete:/40/ffffffff rc 0/-1 Lustre: soaked-MDT0000: Denying connection for new client c3576180-37f0-1109-b79d-ead98e580c5d(at 192.168.1.127@o2ib100), waiting for 21 known clients (6 recovered, 3 in progress, and 12 evicted) to recover in 21187373:09 Lustre: Skipped 3 previous similar messages Lustre: 5967:0:(ldlm_lib.c:2035:target_recovery_overseer()) soaked-MDT0000 recovery is aborted by hard timeout Lustre: 5967:0:(ldlm_lib.c:2045:target_recovery_overseer()) recovery is aborted, evict exports in recovery Lustre: soaked-MDT0000: Recovery over after 1147:10, of 21 clients 6 recovered and 15 were evicted. Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) Lustre: soaked-MDT0000: Connection restored to 3f3093dd-91db-9fe9-0a49-09ac8c98d9ec (at 192.168.1.125@o2ib100) Lustre: Skipped 1 previous similar message Lustre: soaked-MDT0000: Connection restored to b111dd6e-3c52-31d6-0957-ebaf9d411485 (at 192.168.1.123@o2ib100) Lustre: Skipped 1 previous similar message Lustre: soaked-MDT0000: Connection restored to 0ac581dd-e230-df88-61be-a489f0ddcab5 (at 192.168.1.126@o2ib100) Lustre: Skipped 1 previous similar message However, clients remain hung, still investigating |
| Comment by nasf (Inactive) [ 26/Nov/16 ] |
|
According to the logs, both the namespace LFSCK and layout LFSCK were running before the lfsck_stop. Except some inconsistent owners were repaired by layout LFSCK, all others looks normally. In fact, the inconsistent owner is fake because of async chown operation during the layout LFSCK. So from the LFSCK view, during the log interval, there were no other inconsistency found. But just like About the system hung issue, I cannot establish any relationship with the uncompleted LFSCK, because the LFSCK runs at background, and it did not found insistency. Means from the LFSCK view, the system was mountable. As the last comment described, recovery timeout finally, some clients evicted, that may be related with why 'clients remain hung'. |
| Comment by Andreas Dilger [ 02/Dec/16 ] |
|
Cliff, have you checked if DNE recovery and filesystem usage is always blocked until all MDTs are available? It may be that LFSCK is being blocked by the DNE recovery? Preferably, we'd want the filesystem to be usable as soon as MDT0000 is up (at least for those parts that are on available MDTs), but it may be that DNE2 recovery is waiting for all MDTs to be up before it allows recovery to complete? |
| Comment by Cliff White (Inactive) [ 02/Dec/16 ] |
|
This is new to me, i am not sure how to check for DNE recovery as opposed to just recovery. The filesystem recovery completes before we trigger LFSCK in soak. |
| Comment by Cliff White (Inactive) [ 24/Jan/17 ] |
|
Hitting this again on latest master ]# lctl get_param mdd.*.lfsck_layout mdd.soaked-MDT0000.lfsck_layout= name: lfsck_layout magic: 0xb1734d76 version: 2 status: scanning-phase1 flags: param: last_completed_time: 1485198303 time_since_last_completed: 89193 seconds latest_start_time: 1485275978 time_since_latest_start: 11518 seconds last_checkpoint_time: 1485287462 time_since_last_checkpoint: 34 seconds latest_start_position: 77 last_checkpoint_position: 769655424 first_failure_position: 0 success_count: 2 repaired_dangling: 807 repaired_unmatched_pair: 0 repaired_multiple_referenced: 0 repaired_orphan: 0 repaired_inconsistent_owner: 2310456 repaired_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 4033934 checked_phase2: 0 run_time_phase1: 11517 seconds run_time_phase2: 0 seconds average_speed_phase1: 350 items/sec average_speed_phase2: N/A real-time_speed_phase1: 246 items/sec real-time_speed_phase2: N/A current_position: 773325008 Logs ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 ERROR lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0 |