FLR1: Landing tickets for File Level Redundancy Phase 1 (LU-9771)

[LU-10248] Need to update PFID of OST objects after layout change Created: 16/Nov/17  Updated: 25/Jan/19  Resolved: 19/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Technical task Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: FLR

Attachments: File lctl_lfsck_layout.out     File lctl_oi_scrub.out     File lfsck_2017-12-06-n.err     File lfsck_2017-12-06-n.log.gz     File lfsck_2017-12-06-n.status     Text File lfsck_2017-12-06-n.txt    
Issue Links:
Related
is related to LU-2677 Adding LMA to OST object Resolved
is related to LU-3128 filter_fid on OST not updated during ... Closed
is related to LU-9771 FLR1: Landing tickets for File Level ... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

MDT should update PFID information to OST objects after layout swap, split, and merge. However, a protocol should be defined to make this correct. Fanyong proposed to use synchronous OUT setxattr. However, I tend to think it would be better to use llog as SETATTR and DESTROY RPC.

This problem has been existing for really long time since Lustre 2.4. LFSCK could fix this problem but it's better to have a mechanism to solve this kind of problem because there will be more and more similar problems.



 Comments   
Comment by nasf (Inactive) [ 16/Nov/17 ]

LFSCK can handle the inconsistent PFID EA. But it is impossible to run LFSCK every time when merge layout for new FLR.

On the other hand, compared with other operations, the operation of merge layout for FLR is very rare, so the performance is not big issue, as long as we can guarantee the system consistency, it will be enough. And if such operation become quite common in the future, we can consider other better solution. So currently, follow the lod_replace_parent_fid() logic can resolve most of the issue.

Comment by Alex Zhuravlev [ 16/Nov/17 ]

saying "to use llog" is not very informative as llog can be used (and already is) in many different ways.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ]

Alex - after adding the restrictions like MDT->OST communication plus SETATTR and DESTROY like RPC, it won't leave you much confusion.

I think what we need is a mechanism here to perform it in a high efficient way. Using OUT SETXATTR is more like a workaround solution. The current way of handing SETATTR and DESTROY is a pretty good framework and we just need to extend it. Also this framework will be enhanced to pack multiple records in a single RPC, which will lead to better performance.

Comment by Alex Zhuravlev [ 16/Nov/17 ]

I disagree, OUT is much more flexible interface. if you really want to find a good solution lets start from the requirements.
probably you missed that but OUT can batch stuff from the beginning.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ]

When I looked at the code of OUT, I saw the operation of replacing PFID is already there as lod_replace_parent_fid(), probably I can just use it.

Therefore, before swapping or merging layout, I will just declare and call xattr set XATTR_NAME_FID to change PFID. Is there any side effect of it? For example, local trans fails but remote operation succeeds, which will result in inconsistency, or similar fancy stuff like that.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ]

Alex - If I understand it correctly, OUT can only aggregate the updates that belong to the same transaction. Since the SETATTR and DESTROY mechanism is running outside of transaction, potentially it can pack whatever records from LLOG, as long as those records are sending to the same target device, the only limitation would be the size of RPC.

I don't have any expertise on this area of the code, probably I'm seeing this issue on the surface.

Comment by Gerrit Updater [ 16/Nov/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30137
Subject: LU-10248 mdd: set PFID for swap and merge layout
Project: fs/lustre-release
Branch: flr
Current Patch Set: 1
Commit: 9e7004ecc1a6ad9e8ec8b2259a6b91021d7c520c

Comment by Alex Zhuravlev [ 17/Nov/17 ]

no, one can send few different transactions (each composed of many updates) within a single RPC.

Comment by Jinshan Xiong (Inactive) [ 27/Nov/17 ]

The patch in this ticket will be landed to 2.11 as a bug fix

Comment by Gerrit Updater [ 28/Nov/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30292
Subject: LU-10248 mdd: set PFID for swap and merge layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 824b7b4d9ccfb6166a042ef06c340bef1ddfcd4f

Comment by Nathan Dauchy (Inactive) [ 04/Dec/17 ]

It looks like we hit this bug at NASA, triggered simply by running lfs_migrate. (possibly complicated by the migration of files with multiple hard links) If the issue is that easy to trigger, shouldn't this patch be pulled in to 2.10.x LTS branch?

For those of us that hit the bug before the patch, precisely what lfsck args are needed to correct the problem? Something like:

lctl lfsck_start -M lfstest-MDT0000 -c on -C on -o -r

  (or is "-A" or "-t layout" or "-t namespace" also required?)

Also, in the test-framework script for the patch in this LU, I see what seems to be another way to trigger the fix, but can't find any documentation on it:

$LCTL set_param -n obdfilter.${FSNAME}-OST*.lfsck_verify_pfid=1
Comment by Andreas Dilger [ 04/Dec/17 ]

Nathan, could you describe which problem you are hitting? Did you hit the problem in LU-9941 when running "lfs migrate", or something else? This ticket doesn't really have any userspace visible symptoms unless there is already MDT corruption and you need to rebuild the filesystem with LFSCK.

Comment by Nathan Dauchy (Inactive) [ 05/Dec/17 ]

9941 is not familiar to me, did you mean a different LU?

What we are doing is using a modified version of Lester (e2scan derivative) to quickly build lists of files on a set of OSTs to be drained. (faster than lfs find) Then running lfs_migrate for those files.  To parallelize the migrations, using xargs to run multiple copies of lfs_migrate, so there definitely could have been a race hit between two different instances working on the same file via different hard links.  This was also started with an older ~2.7 version of lfs_migrate.

The PFID issue was not seen from userspace, but by one of my colleagues working on improving the OST scanning tool, who was looking at things with debugfs.  I'm afraid I did not save the details to give you the exact example though!

Comment by Andreas Dilger [ 05/Dec/17 ]

Ah, if you are scanning the OST inodes directly, and looking at the filter_fid xattr of migrated OST objects, then indeed this is the correct issue. It wasn't clear from your initial comment.

I'll let Fan Yong comment on what LFSCK command to use to fix the filter_fid xattrs on the OST objects, until such a time that this fix is implemented.

Comment by nasf (Inactive) [ 05/Dec/17 ]

The OST-object's PFID EA back references the MDT-object which references the OST-object via its LOV EA. The PFID has the following functionalities:

1) When you enable the 'lfsck_verify_pfid' on the OST (disabled by default), the I/O logic on the OST will check whether the I/O target belong to the expected regular file or not. If not, deny I/O.
2) The layout LFSCK will use the PFID and LOV EA to verify whether the MDT-object and OST-object pairs consistently reference each other.
3) If the MDT-object crashed, then the layout LFSCK will use the PFID EA to re-geneated the MDT-object on the MDT to resolve the global orphan issues.

Usually, the PFID EA will be set on the OST-object when the OST-object is modified for the first time (write/truncate/setattr). It also needs to be updated if the MDT-object's FID is changed, such as the migration case. But in the old implementation, we only swapped the LOV EA for the MDT-objects during migration, ignored the PFID EA for related OST-objects. Usually, such inconsistency is invisible/harmless to applications except above three cases or you make some special consistency check by yourself, such as your colleague did via debugfs.

Comment by nasf (Inactive) [ 05/Dec/17 ]

Jinshan's patch https://review.whamcloud.com/#/c/30292/ will handle PFID during layout swap or merge (for FLR).

Anyway, the layout LFSCK can repair the inconsistent PFID EA since Lustre-2.6

Comment by nasf (Inactive) [ 05/Dec/17 ]

To repair inconsistent PFID EA, you can run layout LFSCK, for example (run command on the MDT0000)

lctl lfsck_start -M ${fsname}-MDT0000 -A -t layout -r

If just want to check whether there is inconsistency or not, then plus dryrun option "--dryrun".

Comment by Nathan Dauchy (Inactive) [ 05/Dec/17 ]

Thanks much for the clarification, nasf!

I performed a dry-run lfsck as follows and did get one inconsistency.  Please let me know if there is any other debug info I should gather, or just go ahead and re-run without the dry-run flag.

lctl clear
lctl debug_daemon start /var/log/lfsck-n.debug
lctl lfsck_start -M nbp7-MDT0000 -A -t layout -r -n
# wait for "lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status" to show "completed"
lctl debug_daemon stop lctl debug_file /var/log/lfsck-n.debug > /var/log/lfsck-n.log
egrep -v "(kiblnd_passive_connect|ping_evictor_main)" /var/log/lfsck-n.log
00000004:00020000:15.0F:1512490067.127409:0:89302:0:(lod_dev.c:651:lod_sync()) nbp7-MDT0000-mdtlov: can't sync 79: -107
00000004:00020000:0.0F:1512490072.101529:0:89302:0:(lod_dev.c:651:lod_sync()) nbp7-MDT0000-mdtlov: can't sync 79: -107
00080000:12000000:4.0F:1512490467.544841:0:43141:0:(osd_handler.c:574:osd_check_lma()) nbp7-MDT0000-osd: FID [0x20016d1ea:0x627:0x0] != self_fid [0x20016d1ea:0x62f:0x0]
Debug log: 12 lines, 12 kept, 0 dropped, 0 bad.
client ~ # lfs fid2path /nobackupp7/ 0x20016d1ea:0x627:0x0
fid2path: error on FID 0x20016d1ea:0x627:0x0: No such file or directory
client ~# lfs fid2path /nobackupp7/ 0x20016d1ea:0x62f:0x0
/nobackupp7/somepath/esmf_field.xsd
client ~ # lfs path2fid /nobackupp7/somepath/esmf_field.xsd
[0x20016d1ea:0x62f:0x0]

Note that the "can't sync 79" errors are almost certainly because we have finished running lfs_migrate on one of the OSTs (#79) already and taken it offline.

Comment by nasf (Inactive) [ 06/Dec/17 ]

Please show me the output:

lctl get_param -n osd-ldiskfs.*.oi_scrub
lctl get_param -n mdd.*.lfsck_layout
Comment by nasf (Inactive) [ 06/Dec/17 ]
repaired_unmatched_pair: 283

The layout LFSCK detected 283 unmatched MDT-object and OST-object pairs. Two choices:
1) Run layout LFSCK again without "dryrun" option, that will repair the inconsistency directly.
2) To be safe, enable "lfsck" debug log on the MDT, then re-run dryrun mode layout LFSCK again. After complete, dump the Lustre kernel debug logs on the MDT, that will contain the found inconsistencies. Please ONLY enable "lfsck" debug log to avoid log buffer full and overwritten.

Comment by Nathan Dauchy (Inactive) [ 06/Dec/17 ]

Redid the dry run and it reported a lot more than 283 errors! Commands I used are in the uploaded lfsck_2017-12-06-n.txt file, results in the other files should be evident. Please advise.

Comment by nasf (Inactive) [ 07/Dec/17 ]

There is known issue about the repaired_inconsistent_owner, there were also a lot of unexpected repaired_inconsistent_owner in your former logs. They are not the real inconsistency, please ignore them temporarily. On the other hand, as you can see, the Lustre debug logs were overwritten because of those fake repaired_inconsistent_owner information, as to the useful repaired_unmatched_pair logs were lost.

There are two choices:
1) Run layout LFSCK again without "dryrun" option, that will repair the inconsistency directly. But because of too many repaired_inconsistent_owner information, we may cannot know what to be fixed.
2) Keep the inconsistency there. According to your former logs, there are about 283 unmatched MDT-object and OST-object pairs. These unmatched pairs will NOT affect the normal system access as described in the comment:
https://jira.hpdd.intel.com/browse/LU-10248?focusedCommentId=215286&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-215286

I will work on the fake inconsistent owner issue, once done, you can apply related patch(es) and run layout LFSCK again.

Comment by nasf (Inactive) [ 07/Dec/17 ]

ndauchy, what is your Lustre version?

Comment by Peter Jones [ 07/Dec/17 ]

It really seems like it would be better to transfer the NASA discussion to its own ticket rather than tacking on the end of this ticket to track part of the FLR implementation.

Comment by Nathan Dauchy (Inactive) [ 07/Dec/17 ]

LU-10349 created for NASA-specific debugging efforts

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30292/
Subject: LU-10248 mdd: set PFID for swap and merge layout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4d534365ab214e28452c54fd2e0d4781e2f290d6

Comment by Peter Jones [ 17/Dec/17 ]

Is this task complete with the recent landing to master?

Comment by Gerrit Updater [ 21/Dec/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30635
Subject: LU-10248 mdd: set PFID for swap and merge layout
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 8d034f5aa4794d4dcd761b3e5ba2537995cb4e5d

Comment by Jay Lan (Inactive) [ 25/Jan/19 ]

If the work on #30635 is complete, can we land it to b2_10?

Generated at Sat Feb 10 02:33:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.