[LU-8863] LFSCK fails to complete, node cannot recover after LFSCK aborted. Created: 22/Nov/16  Updated: 12/Aug/22  Resolved: 12/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

Soak cluster lustre: 2.8.60_5_gcc5601d


Attachments: Text File lola-8.layout.txt     File lola-8.nov21-2016.txt.gz    
Issue Links:
Related
is related to LU-8647 lfsck_namespace_double_scan()) ASSERT... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LFSCK fails to complete on lola-8, MDT0000:

                                lctl lfsck_start -M soaked-MDT0000 -s 1000 -t namespace
                        fi
                fi
2016-11-21 13:21:36,440:fsmgmt.fsmgmt:INFO     lfsck started on lola-8
2016-11-21 13:21:52,069:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 15s
2016-11-21 13:22:22,672:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 45s
2016-11-21 13:23:23,898:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 105s
2016-11-21 13:25:26,280:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 225s
2016-11-21 13:29:31,072:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 465s
2016-11-21 13:37:40,601:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
2016-11-21 13:53:59,778:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 1905s
2016-11-21 14:26:38,226:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 3825s
2016-11-21 15:31:55,117:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 7665s

I aborted LFSCK with lfsck_stop
The LFSCK stopped, but clients and other servers were not able to re-connect.
Example client:

Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib10_0) after server handle changed from 0xd9fafa0ca7b8e5dc to 0x732870fe43aa2fe7
Lustre: MGC192.168.1.108@o2ib10: Connection restored to MGC192.168.1.108@o2ib10_0 (at 192.168.1.108@o2ib10)
Lustre: Skipped 1 previous similar message
LustreError: 183198:0:(lmv_obd.c:1402:lmv_statfs()) can't stat MDS #0 (soaked-MDT0000-mdc-ffff880426c9c000), error -4
LustreError: 183198:0:(llite_lib.c:1736:ll_statfs_internal()) md_statfs fails: rc = -4

The system appears to be wedged in this state, rebooting and remounting the lola-8 MDT does no fix the issue.
I dumped the Lustre log on lola-8 while it was in LFSCK, attached.
Also, the lfsck_layout



 Comments   
Comment by Peter Jones [ 22/Nov/16 ]

Lai

Please could you advise on this one? The system is left in the hung state so you should be able to access it directly if that would help.

Peter

Comment by Cliff White (Inactive) [ 22/Nov/16 ]

After reboot of second MDT (lola-9) reconnection is still hung.

Lustre: Skipped 8 previous similar messages
Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.109@o2ib10, removing former export from same NID
Lustre: soaked-MDT0000: Client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10) refused connection, still busy with 7 references
format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline
Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0001_UUID (at 192.168.1.109@o2ib10) because it is indicating it is a new client
Comment by Lai Siyao [ 23/Nov/16 ]

I just applied access to Lola: DCO-6349. I'll check the system tomorrow.

Comment by Cliff White (Inactive) [ 23/Nov/16 ]

I have been restarting the 4 MDS nodes to attempt to clear this, upon restarting the 4th MDS, (192.168.1.111@o2ib) MDT000 finally completed recovery.

Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID
Lustre: Skipped 47 previous similar messages
Lustre: soaked-MDT0000: Client soaked-MDT0002-mdtlov_UUID (at 192.168.1.110@o2ib10) refused connection, still busy with 10 references
Lustre: Skipped 47 previous similar messages
Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1479919197/real 1479919241]  req@ffff88041ce450c0 x1551735784078192/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1479919252 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Lustre: 5798:0:(client.c:2111:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Lustre: MGS: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
format at ldlm_lib.c:1221:target_handle_connect doesn't end in newline
Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0003_UUID (at 192.168.1.111@o2ib10) because it is indicating it is a new client
Lustre: soaked-MDT0000: recovery is timed out, evict stale exports
Lustre: soaked-MDT0000: disconnecting 12 stale clients
Lustre: 5967:0:(ldlm_lib.c:1624:abort_req_replay_queue()) @@@ aborted:  req@ffff880821355080 x1551097408033824/t0(171804417562) o35->d7ef0a77-d6cc-d6e6-b6ee-f0c4dd8b3805@192.168.1.113@o2ib100:-1/-1 lens 512/0 e 2753 to 0 dl 1479919489 ref 1 fl Complete:/4/ffffffff rc 0/-1
LustreError: 5967:0:(ldlm_lib.c:1645:abort_lock_replay_queue()) @@@ aborted:  req@ffff880827c96050 x1551509616864032/t0(0) o101->soaked-MDT0003-mdtlov_UUID@192.168.1.111@o2ib10:-1/-1 lens 328/0 e 2752 to 0 dl 1479919496 ref 1 fl Complete:/40/ffffffff rc 0/-1
Lustre: soaked-MDT0000: Denying connection for new client c3576180-37f0-1109-b79d-ead98e580c5d(at 192.168.1.127@o2ib100), waiting for 21 known clients (6 recovered, 3 in progress, and 12 evicted) to recover in 21187373:09
Lustre: Skipped 3 previous similar messages
Lustre: 5967:0:(ldlm_lib.c:2035:target_recovery_overseer()) soaked-MDT0000 recovery is aborted by hard timeout
Lustre: 5967:0:(ldlm_lib.c:2045:target_recovery_overseer()) recovery is aborted, evict exports in recovery
Lustre: soaked-MDT0000: Recovery over after 1147:10, of 21 clients 6 recovered and 15 were evicted.
Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib10)
Lustre: soaked-MDT0000: Connection restored to 3f3093dd-91db-9fe9-0a49-09ac8c98d9ec (at 192.168.1.125@o2ib100)
Lustre: Skipped 1 previous similar message
Lustre: soaked-MDT0000: Connection restored to b111dd6e-3c52-31d6-0957-ebaf9d411485 (at 192.168.1.123@o2ib100)
Lustre: Skipped 1 previous similar message
Lustre: soaked-MDT0000: Connection restored to 0ac581dd-e230-df88-61be-a489f0ddcab5 (at 192.168.1.126@o2ib100)
Lustre: Skipped 1 previous similar message

However, clients remain hung, still investigating

Comment by nasf (Inactive) [ 26/Nov/16 ]

According to the logs, both the namespace LFSCK and layout LFSCK were running before the lfsck_stop. Except some inconsistent owners were repaired by layout LFSCK, all others looks normally. In fact, the inconsistent owner is fake because of async chown operation during the layout LFSCK. So from the LFSCK view, during the log interval, there were no other inconsistency found.

But just like LU-8742, the LFSCK ran some "slowly", as to you have to stop lfsck_stop manually. But according to the logs, the LFSCK was really in scanning, not hung. As to how long it will take (already more than 720 seconds), depends on the objects count in the system.

About the system hung issue, I cannot establish any relationship with the uncompleted LFSCK, because the LFSCK runs at background, and it did not found insistency. Means from the LFSCK view, the system was mountable. As the last comment described, recovery timeout finally, some clients evicted, that may be related with why 'clients remain hung'.

Comment by Andreas Dilger [ 02/Dec/16 ]

Cliff, have you checked if DNE recovery and filesystem usage is always blocked until all MDTs are available? It may be that LFSCK is being blocked by the DNE recovery? Preferably, we'd want the filesystem to be usable as soon as MDT0000 is up (at least for those parts that are on available MDTs), but it may be that DNE2 recovery is waiting for all MDTs to be up before it allows recovery to complete?

Comment by Cliff White (Inactive) [ 02/Dec/16 ]

This is new to me, i am not sure how to check for DNE recovery as opposed to just recovery. The filesystem recovery completes before we trigger LFSCK in soak.

Comment by Cliff White (Inactive) [ 24/Jan/17 ]

Hitting this again on latest master

]# lctl get_param mdd.*.lfsck_layout
mdd.soaked-MDT0000.lfsck_layout=
name: lfsck_layout
magic: 0xb1734d76
version: 2
status: scanning-phase1
flags:
param:
last_completed_time: 1485198303
time_since_last_completed: 89193 seconds
latest_start_time: 1485275978
time_since_latest_start: 11518 seconds
last_checkpoint_time: 1485287462
time_since_last_checkpoint: 34 seconds
latest_start_position: 77
last_checkpoint_position: 769655424
first_failure_position: 0
success_count: 2
repaired_dangling: 807
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 0
repaired_inconsistent_owner: 2310456
repaired_others: 0
skipped: 0
failed_phase1: 0
failed_phase2: 0
checked_phase1: 4033934
checked_phase2: 0
run_time_phase1: 11517 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 350 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 246 items/sec
real-time_speed_phase2: N/A
current_position: 773325008

Logs

ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
Generated at Sat Feb 10 02:21:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.