FLR1: Landing tickets for File Level Redundancy Phase 1
(LU-9771)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Technical task | Priority: | Minor |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | FLR | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
MDT should update PFID information to OST objects after layout swap, split, and merge. However, a protocol should be defined to make this correct. Fanyong proposed to use synchronous OUT setxattr. However, I tend to think it would be better to use llog as SETATTR and DESTROY RPC. This problem has been existing for really long time since Lustre 2.4. LFSCK could fix this problem but it's better to have a mechanism to solve this kind of problem because there will be more and more similar problems. |
| Comments |
| Comment by nasf (Inactive) [ 16/Nov/17 ] |
|
LFSCK can handle the inconsistent PFID EA. But it is impossible to run LFSCK every time when merge layout for new FLR. On the other hand, compared with other operations, the operation of merge layout for FLR is very rare, so the performance is not big issue, as long as we can guarantee the system consistency, it will be enough. And if such operation become quite common in the future, we can consider other better solution. So currently, follow the lod_replace_parent_fid() logic can resolve most of the issue. |
| Comment by Alex Zhuravlev [ 16/Nov/17 ] |
|
saying "to use llog" is not very informative as llog can be used (and already is) in many different ways. |
| Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ] |
|
Alex - after adding the restrictions like MDT->OST communication plus SETATTR and DESTROY like RPC, it won't leave you much confusion. I think what we need is a mechanism here to perform it in a high efficient way. Using OUT SETXATTR is more like a workaround solution. The current way of handing SETATTR and DESTROY is a pretty good framework and we just need to extend it. Also this framework will be enhanced to pack multiple records in a single RPC, which will lead to better performance. |
| Comment by Alex Zhuravlev [ 16/Nov/17 ] |
|
I disagree, OUT is much more flexible interface. if you really want to find a good solution lets start from the requirements. |
| Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ] |
|
When I looked at the code of OUT, I saw the operation of replacing PFID is already there as lod_replace_parent_fid(), probably I can just use it. Therefore, before swapping or merging layout, I will just declare and call xattr set XATTR_NAME_FID to change PFID. Is there any side effect of it? For example, local trans fails but remote operation succeeds, which will result in inconsistency, or similar fancy stuff like that. |
| Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ] |
|
Alex - If I understand it correctly, OUT can only aggregate the updates that belong to the same transaction. Since the SETATTR and DESTROY mechanism is running outside of transaction, potentially it can pack whatever records from LLOG, as long as those records are sending to the same target device, the only limitation would be the size of RPC. I don't have any expertise on this area of the code, probably I'm seeing this issue on the surface. |
| Comment by Gerrit Updater [ 16/Nov/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30137 |
| Comment by Alex Zhuravlev [ 17/Nov/17 ] |
|
no, one can send few different transactions (each composed of many updates) within a single RPC. |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/17 ] |
|
The patch in this ticket will be landed to 2.11 as a bug fix |
| Comment by Gerrit Updater [ 28/Nov/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30292 |
| Comment by Nathan Dauchy (Inactive) [ 04/Dec/17 ] |
|
It looks like we hit this bug at NASA, triggered simply by running lfs_migrate. (possibly complicated by the migration of files with multiple hard links) If the issue is that easy to trigger, shouldn't this patch be pulled in to 2.10.x LTS branch? For those of us that hit the bug before the patch, precisely what lfsck args are needed to correct the problem? Something like: lctl lfsck_start -M lfstest-MDT0000 -c on -C on -o -r (or is "-A" or "-t layout" or "-t namespace" also required?) Also, in the test-framework script for the patch in this LU, I see what seems to be another way to trigger the fix, but can't find any documentation on it: $LCTL set_param -n obdfilter.${FSNAME}-OST*.lfsck_verify_pfid=1
|
| Comment by Andreas Dilger [ 04/Dec/17 ] |
|
Nathan, could you describe which problem you are hitting? Did you hit the problem in |
| Comment by Nathan Dauchy (Inactive) [ 05/Dec/17 ] |
|
9941 is not familiar to me, did you mean a different LU? What we are doing is using a modified version of Lester (e2scan derivative) to quickly build lists of files on a set of OSTs to be drained. (faster than lfs find) Then running lfs_migrate for those files. To parallelize the migrations, using xargs to run multiple copies of lfs_migrate, so there definitely could have been a race hit between two different instances working on the same file via different hard links. This was also started with an older ~2.7 version of lfs_migrate. The PFID issue was not seen from userspace, but by one of my colleagues working on improving the OST scanning tool, who was looking at things with debugfs. I'm afraid I did not save the details to give you the exact example though! |
| Comment by Andreas Dilger [ 05/Dec/17 ] |
|
Ah, if you are scanning the OST inodes directly, and looking at the filter_fid xattr of migrated OST objects, then indeed this is the correct issue. It wasn't clear from your initial comment. I'll let Fan Yong comment on what LFSCK command to use to fix the filter_fid xattrs on the OST objects, until such a time that this fix is implemented. |
| Comment by nasf (Inactive) [ 05/Dec/17 ] |
|
The OST-object's PFID EA back references the MDT-object which references the OST-object via its LOV EA. The PFID has the following functionalities: 1) When you enable the 'lfsck_verify_pfid' on the OST (disabled by default), the I/O logic on the OST will check whether the I/O target belong to the expected regular file or not. If not, deny I/O. Usually, the PFID EA will be set on the OST-object when the OST-object is modified for the first time (write/truncate/setattr). It also needs to be updated if the MDT-object's FID is changed, such as the migration case. But in the old implementation, we only swapped the LOV EA for the MDT-objects during migration, ignored the PFID EA for related OST-objects. Usually, such inconsistency is invisible/harmless to applications except above three cases or you make some special consistency check by yourself, such as your colleague did via debugfs. |
| Comment by nasf (Inactive) [ 05/Dec/17 ] |
|
Jinshan's patch https://review.whamcloud.com/#/c/30292/ will handle PFID during layout swap or merge (for FLR). Anyway, the layout LFSCK can repair the inconsistent PFID EA since Lustre-2.6 |
| Comment by nasf (Inactive) [ 05/Dec/17 ] |
|
To repair inconsistent PFID EA, you can run layout LFSCK, for example (run command on the MDT0000) lctl lfsck_start -M ${fsname}-MDT0000 -A -t layout -r
If just want to check whether there is inconsistency or not, then plus dryrun option "--dryrun". |
| Comment by Nathan Dauchy (Inactive) [ 05/Dec/17 ] |
|
Thanks much for the clarification, nasf! I performed a dry-run lfsck as follows and did get one inconsistency. Please let me know if there is any other debug info I should gather, or just go ahead and re-run without the dry-run flag. lctl clear lctl debug_daemon start /var/log/lfsck-n.debug lctl lfsck_start -M nbp7-MDT0000 -A -t layout -r -n # wait for "lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status" to show "completed" lctl debug_daemon stop lctl debug_file /var/log/lfsck-n.debug > /var/log/lfsck-n.log egrep -v "(kiblnd_passive_connect|ping_evictor_main)" /var/log/lfsck-n.log 00000004:00020000:15.0F:1512490067.127409:0:89302:0:(lod_dev.c:651:lod_sync()) nbp7-MDT0000-mdtlov: can't sync 79: -107 00000004:00020000:0.0F:1512490072.101529:0:89302:0:(lod_dev.c:651:lod_sync()) nbp7-MDT0000-mdtlov: can't sync 79: -107 00080000:12000000:4.0F:1512490467.544841:0:43141:0:(osd_handler.c:574:osd_check_lma()) nbp7-MDT0000-osd: FID [0x20016d1ea:0x627:0x0] != self_fid [0x20016d1ea:0x62f:0x0] Debug log: 12 lines, 12 kept, 0 dropped, 0 bad. client ~ # lfs fid2path /nobackupp7/ 0x20016d1ea:0x627:0x0 fid2path: error on FID 0x20016d1ea:0x627:0x0: No such file or directory client ~# lfs fid2path /nobackupp7/ 0x20016d1ea:0x62f:0x0 /nobackupp7/somepath/esmf_field.xsd client ~ # lfs path2fid /nobackupp7/somepath/esmf_field.xsd [0x20016d1ea:0x62f:0x0] Note that the "can't sync 79" errors are almost certainly because we have finished running lfs_migrate on one of the OSTs (#79) already and taken it offline. |
| Comment by nasf (Inactive) [ 06/Dec/17 ] |
|
Please show me the output: lctl get_param -n osd-ldiskfs.*.oi_scrub lctl get_param -n mdd.*.lfsck_layout |
| Comment by nasf (Inactive) [ 06/Dec/17 ] |
repaired_unmatched_pair: 283 The layout LFSCK detected 283 unmatched MDT-object and OST-object pairs. Two choices: |
| Comment by Nathan Dauchy (Inactive) [ 06/Dec/17 ] |
|
Redid the dry run and it reported a lot more than 283 errors! Commands I used are in the uploaded lfsck_2017-12-06-n.txt file, results in the other files should be evident. Please advise. |
| Comment by nasf (Inactive) [ 07/Dec/17 ] |
|
There is known issue about the repaired_inconsistent_owner, there were also a lot of unexpected repaired_inconsistent_owner in your former logs. They are not the real inconsistency, please ignore them temporarily. On the other hand, as you can see, the Lustre debug logs were overwritten because of those fake repaired_inconsistent_owner information, as to the useful repaired_unmatched_pair logs were lost. There are two choices: I will work on the fake inconsistent owner issue, once done, you can apply related patch(es) and run layout LFSCK again. |
| Comment by nasf (Inactive) [ 07/Dec/17 ] |
|
ndauchy, what is your Lustre version? |
| Comment by Peter Jones [ 07/Dec/17 ] |
|
It really seems like it would be better to transfer the NASA discussion to its own ticket rather than tacking on the end of this ticket to track part of the FLR implementation. |
| Comment by Nathan Dauchy (Inactive) [ 07/Dec/17 ] |
|
|
| Comment by Gerrit Updater [ 17/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30292/ |
| Comment by Peter Jones [ 17/Dec/17 ] |
|
Is this task complete with the recent landing to master? |
| Comment by Gerrit Updater [ 21/Dec/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30635 |
| Comment by Jay Lan (Inactive) [ 25/Jan/19 ] |
|
If the work on #30635 is complete, can we land it to b2_10? |