[LU-8177] osp-syn threads in D state Created: 21/May/16 Updated: 17/Jun/16 Resolved: 17/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Mahmoud Hanafi | Assignee: | John Fuchs-Chesney (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 1 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When trying to mount mdt all osp-syn threads stuck in 'D' state. Debug logs are filled with these messages 00000004:00080000:8.0:1463850740.016156:0:14081:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebf799800, transno 0 00000004:00080000:8.0:1463850740.016164:0:14081:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebf799800/1, rc -2, transno 0 00000100:00100000:8.0:1463850740.016176:0:14081:0:(client.c:1872:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:nbp2-MDT0000-mdtlov_UUID:14081:1534957896521600:10.151.26.98@o2ib:6 00000004:00080000:9.0:1463850740.016219:0:14087:0:(osp_sync.c:317:osp_sync_request_commit_cb()) commit req ffff883ebed48800, transno 0 00000004:00080000:9.0:1463850740.016226:0:14087:0:(osp_sync.c:351:osp_sync_interpret()) reply req ffff883ebed48800/1, rc -2, transno 0 I will upload full debug logs to ftp site. |
| Comments |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
Uploaded file to ftp:/uploads/LU8177/s600.debug.out.gz |
| Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ] |
|
Mahmoud, Can you please clarify if you have a production site down emergency please? We rate that as a SEV-1 event, and you have selected SEV-4. Thanks, |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
Sorry it should be level 1. |
| Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ] |
|
Email from Mahmoud: "Sorry this should be severity1. The production site is down and unusable." ~ jfc. |
| Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ] |
|
Assigning to me – Oleg is looking. |
| Comment by Oleg Drokin [ 21/May/16 ] |
|
Are there any messages in dmesg on mds or osts? |
| Comment by Alex Zhuravlev [ 21/May/16 ] |
|
I think this can be a dup of |
| Comment by Alex Zhuravlev [ 21/May/16 ] |
|
basically some llog cancels got lost by mistake causing lots of IO to rescan llogs at startup. |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
This was a remount after power down. I unmounted all the OSTs the the mdt got mounted. Then I remounted the OSTs and the mdt got back to osp-sync in 'D' state. but at least i am able to mount it on the client. So do we need to apply the patch from |
| Comment by Oleg Drokin [ 21/May/16 ] |
|
Alex advises that the condition will clear on it's own after all llogs are reproceessed. the duration of that is hard to tell as it depends on number of those llogs. |
| Comment by Oleg Drokin [ 21/May/16 ] |
|
if you really need to clear the condition immediately, it's possible to unmount the MDT, mount it as ldiskfs, remove the stale llogs, unmount ldiskfs and remount mdt as lustre. |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
I looked in /O/1/d* and there where files going back to 2015. should i just delete everything in /0/1/* and remount? |
| Comment by Oleg Drokin [ 21/May/16 ] |
|
do you use changelogs too? |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
no we don't |
| Comment by Oleg Drokin [ 21/May/16 ] |
|
Generally since I believe your system is now mountable, it's safer to just let the sync threads to run their course. it would put additional load on the system, but should not be too bad. Once you apply lu7079 patch it should kill those records for good next time you reboot. |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
Ok thanks. You may lower the priority of the case. It did finish. |
| Comment by John Fuchs-Chesney (Inactive) [ 21/May/16 ] |
|
Thank you for the update Mahmoud. I think we'll keep the priority as it is for the time being (for recording purposes). Do you want us to keep the ticket open for a while longer? Or do you think this event is now resolved? Best regards, |
| Comment by Mahmoud Hanafi [ 21/May/16 ] |
|
Please leave the case open for now. |
| Comment by Jay Lan (Inactive) [ 22/May/16 ] |
|
We have a b2_7_fe version of the patch, but need a back port to b2_5_fe. Thanks! |
| Comment by Jian Yu [ 24/May/16 ] |
|
Hello Jay, Here is the back-ported patch for Lustre b2_5_fe branch: http://review.whamcloud.com/20392 |
| Comment by Jay Lan (Inactive) [ 25/May/16 ] |
|
Thanks! |
| Comment by John Fuchs-Chesney (Inactive) [ 03/Jun/16 ] |
|
Hello Mahmoud, Do you want us to continue to keep this ticket open? Thanks, |
| Comment by John Fuchs-Chesney (Inactive) [ 17/Jun/16 ] |
|
Resolving as incomplete. Please let us know if any further work is required on this ticket. Thanks, |