[LU-8177] osp-syn threads in D state - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Incomplete
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
None

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

When trying to mount mdt all osp-syn threads stuck in 'D' state.

Debug logs are filled with these messages

00000004:00080000:8.0:1463850740.016156:0:14081:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebf799800, transno 0
00000004:00080000:8.0:1463850740.016164:0:14081:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebf799800/1, rc -2, transno 0
00000100:00100000:8.0:1463850740.016176:0:14081:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:nbp2-MDT0000-mdtlov_UUID:14081:1534957896521600:10.151.26.98@o2ib:6
00000004:00080000:9.0:1463850740.016219:0:14087:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebed48800, transno 0
00000004:00080000:9.0:1463850740.016226:0:14087:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebed48800/1, rc -2, transno 0

I will upload full debug logs to ftp site.

Attachments

Issue Links

is related to

LU-7079 OSP shouldn't discard requests due to imp_peer_committed_transno

Resolved

Activity

[LU-8177] osp-syn threads in D state

Mahmoud Hanafi added a comment - 21/May/16 7:04 PM

I looked in /O/1/d* and there where files going back to 2015.

should i just delete everything in /0/1/* and remount?

Mahmoud Hanafi added a comment - 21/May/16 7:04 PM I looked in /O/1/d* and there where files going back to 2015. should i just delete everything in /0/1/* and remount?

Oleg Drokin added a comment - 21/May/16 7:01 PM

if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre.
Perhaps not all of them needs removing but just hte really old ones (you can tell by the date).

Oleg Drokin added a comment - 21/May/16 7:01 PM if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre. Perhaps not all of them needs removing but just hte really old ones (you can tell by the date).

Oleg Drokin added a comment - 21/May/16 6:59 PM

Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs.

Oleg Drokin added a comment - 21/May/16 6:59 PM Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs.

Mahmoud Hanafi added a comment - 21/May/16 6:58 PM

This was a remount after power down.
The OST are mounted.
The shutdown was normal.

I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client.

So do we need to apply the patch from ~~LU-7079~~ and remount? or can we some how stop the osp-sync.

Mahmoud Hanafi added a comment - 21/May/16 6:58 PM This was a remount after power down. The OST are mounted. The shutdown was normal. I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client. So do we need to apply the patch from LU-7079 and remount? or can we some how stop the osp-sync.

Alex Zhuravlev added a comment - 21/May/16 6:42 PM

basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup.

Alex Zhuravlev added a comment - 21/May/16 6:42 PM basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup.

Alex Zhuravlev added a comment - 21/May/16 6:41 PM

I think this can be a dup of ~~LU-7079~~

Alex Zhuravlev added a comment - 21/May/16 6:41 PM I think this can be a dup of LU-7079

Oleg Drokin added a comment - 21/May/16 6:19 PM

Are there any messages in dmesg on mds or osts?
Is this a normal mount after a normal shutdown? a failover after something else?
are the OSTs up?

Oleg Drokin added a comment - 21/May/16 6:19 PM Are there any messages in dmesg on mds or osts? Is this a normal mount after a normal shutdown? a failover after something else? are the OSTs up?

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:15 PM

Assigning to me – Oleg is looking.
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:15 PM Assigning to me – Oleg is looking. ~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:08 PM

Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable."

~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:08 PM Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable." ~ jfc.

Mahmoud Hanafi added a comment - 21/May/16 6:07 PM

Sorry it should be level 1.

Mahmoud Hanafi added a comment - 21/May/16 6:07 PM Sorry it should be level 1.

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:05 PM

Mahmoud,

Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4.

Thanks,
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 21/May/16 6:05 PM Mahmoud, Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4. Thanks, ~ jfc.

People

Assignee:: John Fuchs-Chesney (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/May/16 5:28 PM

Updated:: 17/Jun/16 11:15 PM

Resolved:: 17/Jun/16 11:15 PM