[LU-11459] llsom_sync updates LSOM data for some files when called with non-existant user Created: 02/Oct/18  Updated: 21/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: LSOM, patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looks like llsom_sync is not working correctly when a non-existent changelog user is entered.

For example, I've registered two changelog users on the MDS, set llite.*.xattr_cache = 0 on the client (vm4) and create a new file and overwrite an existing file:

[vm3 ~]# dd if=/dev/urandom of=/lustre/scratch/ddfile3 bs=37k count=200
200+0 records in
200+0 records out
7577600 bytes (7.6 MB) copied, 0.0464932 s, 163 MB/s

[vm4 ~]# echo "aa" > /lustre/scratch/ddfile2

Looking at the LSOM data, we can see the size is correct, but the blocks are not updated.

[vm4 ~]# lfs getsom /lustre/scratch/ddfile2
file: /lustre/scratch/ddfile2 size: 3 blocks: 0 flags: 4
[vm4 ~]# lfs getsom /lustre/scratch/ddfile3
file: /lustre/scratch/ddfile3 size: 7577600 blocks: 0 flags: 4

Now, sync the LSOM data, but give it a bad changelog user ID.

[vm4 ~]# llsom_sync --mdt scratch-MDT0000 --user cli4 -v /lustre/scratch/
Start receiving records
Processed changelog record index:5 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:6 type:CREAT(0x1) FID:[0x200000402:0x2:0x0]
Processed changelog record index:7 type:XATTR(0xf) FID:[0x200000402:0x2:0x0]
Processed changelog record index:8 type:CLOSE(0xb) FID:[0x200000402:0x2:0x0]
Processed changelog record index:9 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:10 type:TRUNC(0xd) FID:[0x200000402:0x1:0x0]
Processed changelog record index:11 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:12 type:CLOSE(0xb) FID:[0x200000402:0x1:0x0]
finished reading [scratch-MDT0000]
Start to sync 2 records.
record 1651949901960989620:8, updated LSOM for fid [0x200000402:0x2:0x0] size:7577600 blocks:14800
llsom_sync: cannot purge records for 'cli4': Invalid argument (22)
llsom_sync: failed to clear changelog record: cli4:8: Invalid argument (22)
[vm4 ~]# lfs getsom /lustre/scratch/ddfile3
file: /lustre/scratch/ddfile3 size: 7577600 blocks: 14800 flags: 4
[vm4 ~]# lfs getsom /lustre/scratch/ddfile2
file: /lustre/scratch/ddfile2 size: 3 blocks: 0 flags: 4

The changelog record purge error is to be expected since there is no user cli4. The problem is, one file's LSOM data (blocks) is updated, the other file's data is not updated.

From the output of llsom_sync, it looks like updating of the LSOM file data is interrupted when it figured out that the user is not valid. It seems like there are two issues here:
1. We should update the LSOM data of all files or none of the files when a bad user ID is input
2. We are not checking the validity of the user at an appropriate time.

Looking at the llsom_sync code, we don't check the changelog user in llsom_sync until we call llapi_changelog_clear() to purge changelog records and this routine produces an error.

Using a valid changelog user, then all file's LSOM data are updated.

[vm4 ~]# llsom_sync --mdt scratch-MDT0000 --user cl2 -v /lustre/scratch/
Start receiving records
Processed changelog record index:5 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:6 type:CREAT(0x1) FID:[0x200000402:0x2:0x0]
Processed changelog record index:7 type:XATTR(0xf) FID:[0x200000402:0x2:0x0]
Processed changelog record index:8 type:CLOSE(0xb) FID:[0x200000402:0x2:0x0]
Processed changelog record index:9 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:10 type:TRUNC(0xd) FID:[0x200000402:0x1:0x0]
Processed changelog record index:11 type:XATTR(0xf) FID:[0x200000402:0x1:0x0]
Processed changelog record index:12 type:CLOSE(0xb) FID:[0x200000402:0x1:0x0]
Processed changelog record index:13 type:XATTR(0xf) FID:[0x200000402:0x2:0x0]
finished reading [scratch-MDT0000]
Start to sync 2 records.
record 1651949901960989620:8, updated LSOM for fid [0x200000402:0x2:0x0] size:7577600 blocks:14800
record 1651949963251084516:12, updated LSOM for fid [0x200000402:0x1:0x0] size:3 blocks:8
[vm4 ~]# lfs getsom /lustre/scratch/ddfile3
file: /lustre/scratch/ddfile3 size: 7577600 blocks: 14800 flags: 4
[vm4 ~]# lfs getsom /lustre/scratch/ddfile2
file: /lustre/scratch/ddfile2 size: 3 blocks: 8 flags: 4


 Comments   
Comment by Andreas Dilger [ 02/Oct/18 ]

I don't think we need an "all-or-none" semantic for LSOM data. Since LSOM is, by definition, lazy then there may be any number of reasons why the LSOM attrs are updated or not (e.g. some other client accessed the file, it was stored only on the MDT, whatever.

I agree it would be useful to have the LSOM tool check for a valid Changelog user when it is first run, so that it can complain to the user appropriately.

Comment by Peter Jones [ 03/Oct/18 ]

Qian

Can you please investigate?

Thanks

Peter

Comment by Qian Yingjin [ 08/Oct/18 ]

I digged into the code, to have the LSOM sync tool check for a valid Changelog user, we need add an extra RPC to MDS to check it. I will make a patch sooner.

Comment by Gerrit Updater [ 08/Oct/18 ]

Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/33315
Subject: LU-11459 changelog: valid check for a given changelog user
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c5c1b7751b0873e2e0b1f905d2926a9d64501e8

Generated at Sat Feb 10 02:44:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.