[LU-8177] osp-syn threads in D state Created: 21/May/16  Updated: 17/Jun/16  Resolved: 17/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Mahmoud Hanafi Assignee: John Fuchs-Chesney (Inactive)
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7079 OSP shouldn't discard requests due to... Resolved
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

When trying to mount mdt all osp-syn threads stuck in 'D' state.

Debug logs are filled with these messages

00000004:00080000:8.0:1463850740.016156:0:14081:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebf799800, transno 0
00000004:00080000:8.0:1463850740.016164:0:14081:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebf799800/1, rc -2, transno 0
00000100:00100000:8.0:1463850740.016176:0:14081:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:nbp2-MDT0000-mdtlov_UUID:14081:1534957896521600:10.151.26.98@o2ib:6
00000004:00080000:9.0:1463850740.016219:0:14087:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebed48800, transno 0
00000004:00080000:9.0:1463850740.016226:0:14087:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebed48800/1, rc -2, transno 0

I will upload full debug logs to ftp site.



 Comments   
Comment by Mahmoud Hanafi [ 21/May/16 ]

Uploaded file to ftp:/uploads/LU8177/s600.debug.out.gz

Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ]

Mahmoud,

Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4.

Thanks,
~ jfc.

Comment by Mahmoud Hanafi [ 21/May/16 ]

Sorry it should be level 1.

Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ]

Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable."

~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ]

Assigning to me – Oleg is looking.
~ jfc.

Comment by Oleg Drokin [ 21/May/16 ]

Are there any messages in dmesg on mds or osts?
Is this a normal mount after a normal shutdown? a failover after something else?
are the OSTs up?

Comment by Alex Zhuravlev [ 21/May/16 ]

I think this can be a dup of LU-7079

Comment by Alex Zhuravlev [ 21/May/16 ]

basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup.

Comment by Mahmoud Hanafi [ 21/May/16 ]

This was a remount after power down.
The OST are mounted.
The shutdown was normal.

I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client.

So do we need to apply the patch from LU-7079 and remount? or can we some how stop the osp-sync.

Comment by Oleg Drokin [ 21/May/16 ]

Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs.

Comment by Oleg Drokin [ 21/May/16 ]

if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre.
Perhaps not all of them needs removing but just hte really old ones (you can tell by the date).

Comment by Mahmoud Hanafi [ 21/May/16 ]

I looked in /O/1/d* and there where files going back to 2015.

should i just delete everything in /0/1/* and remount?

Comment by Oleg Drokin [ 21/May/16 ]

do you use changelogs too?

Comment by Mahmoud Hanafi [ 21/May/16 ]

no we don't

Comment by Oleg Drokin [ 21/May/16 ]

Generally since I believe your system is now mountable, it's safer to just let the sync threads to run their course. it would put additional load on the system, but should not be too bad.

Once you apply lu7079 patch it should kill those records for good next time you reboot.
Without the patch some of the records would still be killed, but not all of them (and more might be amassed until next reboot) so you are looking at a similar situation next time you remount anyway.

Comment by Mahmoud Hanafi [ 21/May/16 ]

Ok thanks. You may lower the priority of the case. It did finish.

Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ]

Thank you for the update Mahmoud.

I think we'll keep the priority as it is for the time being (for recording purposes).

Do you want us to keep the ticket open for a while longer? Or do you think this event is now resolved?

Best regards,
~ jfc.

Comment by Mahmoud Hanafi [ 21/May/16 ]

Please leave the case open for now.

Comment by Jay Lan (Inactive) [ 22/May/16 ]

We have a b2_7_fe version of the patch, but need a back port to b2_5_fe. Thanks!

Comment by Jian Yu [ 24/May/16 ]

Hello Jay,

Here is the back-ported patch for Lustre b2_5_fe branch: http://review.whamcloud.com/20392

Comment by Jay Lan (Inactive) [ 25/May/16 ]

Thanks!

Comment by John Fuchs-Chesney (Inactive) [ 03/Jun/16 ]

Hello Mahmoud,

Do you want us to continue to keep this ticket open?

Thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 17/Jun/16 ]

Resolving as incomplete.

Please let us know if any further work is required on this ticket.

Thanks,
~ jfc.

Generated at Sat Feb 10 02:15:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.