[LU-6687] ALL osp-sync in D state Created: 03/Jun/15 Updated: 16/Oct/15 Resolved: 16/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After we reboot and mount mdt we see all osp-sync threads in D state and the following errors. Jun 3 14:38:03 nbp8-mds1 kernel: LustreError: 5838:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp8-OST009e-osc-MDT0000: invalid setattr record, lsr_valid:100 |
| Comments |
| Comment by Peter Jones [ 04/Jun/15 ] |
|
Niu Could you please advise? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 04/Jun/15 ] |
|
Looks the related patch was applied correctly in your 2.5.3 tree (https://github.com/jlan/lustre-nas/commit/fb970b342a7fac22a17b4932e11febb6963b3dff) Is this an upgraded system? and it's the first mount after upgrading? I'm wondering if these invalid records were some leftover from old system. |
| Comment by Niu Yawei (Inactive) [ 04/Jun/15 ] |
|
BTW, because of |
| Comment by Mahmoud Hanafi [ 04/Jun/15 ] |
|
This is a upgraded system to 2.5.3. This happens every time when the mdt is mounted. How do we go about fixing the invalid records? We are going and cleanup a lot of mismatch between object UID/GID and MDT records. These occurred most likely due to |
| Comment by Niu Yawei (Inactive) [ 04/Jun/15 ] |
|
Hmm, fixing these invalid records manually will be troublesome (there isn't any llog edit tool, so you have to use hex edit to modify the records...) Actually, if there are only few leftover records, we can just delete all of them by removing the llog files, then we can move on to mount the MDT. 1. lctl --device $MDTDEV llog_catlist to show all the catalogs for the unlink/setattr records; |
| Comment by Mahmoud Hanafi [ 04/Jun/15 ] |
|
nbp8-mds1 ~ # lctl --device 6 llog_catlist it looks like there are a lot of records. I am not sure If i understand item #4 and #5 |
| Comment by Niu Yawei (Inactive) [ 05/Jun/15 ] |
|
Hmm, looks llog_catlist is only available in master now. Ok, each chown & unlink on MDT will generate a llog record in llog file, and this record will be used to sync the operations to OST objects, once the sync to OST done, the record will be removed from the llog file. Usually after a clean shutdown, there won't be any leftover records in the llog files. However, in your case, there are some invalid records which can't be processed and not removed at the end. Let's look at the on disk structure of llog files:
I think that could make it easier for understanding the #4 & #5 of my previous comment? Given the "lctl llog_catlist" isn't supported in 2.5, you can remove all the leftover records as following: mount mdt as ldiskfs, find all the files which name is numerical under /O/1/ and remove them all. (it's better to backup these files) |
| Comment by Mahmoud Hanafi [ 16/Oct/15 ] |
|
Please close this case |
| Comment by Peter Jones [ 16/Oct/15 ] |
|
ok Mahmoud |