[LU-15193] qsd_op_begin: more than 8 qids enforced for a transaction? Created: 03/Nov/21  Updated: 11/Dec/23  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Feng Lei
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.9, Lustre 2.12.7 on clients


Issue Links:
Related
is related to LU-12388 expand QUOTA_MAX_TRANSIDS for Project... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This morning, we hit the following problem (new for us) on Fir (2.12.5 servers, 2.12.7 clients):

# rmdir /scratch/users/ragoglia/csATAC/For_Rachel
rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument

 
More info on this directory:

[root@fir-rbh01 ~]# ls -lisa /scratch/users/ragoglia/csATAC/For_Rachel
total 8
198162765779658672 4 drwxr-xr-x  2 atrev    wjg      4096 Oct 20 02:58 .
198162765779658671 4 drwxrwxr-x+ 3 ragoglia hbfraser 4096 Sep  5  2019 ..

[root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC/For_Rachel
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none

[root@fir-rbh01 ~]# lfs project -d /scratch/users/ragoglia/csATAC/For_Rachel
259557 P /scratch/users/ragoglia/csATAC/For_Rachel

[root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none

[root@fir-rbh01 ~]# rmdir /scratch/users/ragoglia/csATAC/For_Rachel
rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument

MDS logs show:

Nov 03 09:52:49 fir-md1-s3 kernel: LustreError: 103307:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
Nov 03 10:10:09 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
Nov 03 10:10:35 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
Nov 03 10:10:52 fir-md1-s3 kernel: LustreError: 103697:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
Nov 03 10:15:55 fir-md1-s3 kernel: LustreError: 103721:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?

Which looks like LU-12388
 
My colleague found a workaround for rmdir: set project ID 0 before rmdir, and it worked:

# lfs project -d  /scratch/users/ragoglia/csATAC/For_Rachel/
259557 P /scratch/users/ragoglia/csATAC/For_Rachel/
# lfs project -p 0  /scratch/users/ragoglia/csATAC/For_Rachel/
# lfs project -d  /scratch/users/ragoglia/csATAC/For_Rachel/
    0 P /scratch/users/ragoglia/csATAC/For_Rachel/
# rmdir /scratch/users/ragoglia/csATAC/For_Rachel/
#

I believe this is the only occurrence we've seen of this issue so far. Let me know if additional logs would be helpful the next time we hit this.
 



 Comments   
Comment by Andreas Dilger [ 03/Nov/21 ]

It looks like a trivial fix. It would also be useful if the qsd_op_begin() error message was updated to also print the FID, so that this is easier to debug in the future.

Comment by Andreas Dilger [ 03/Nov/21 ]

It may not be quite as trivial as I thought. The trivial solution is to increase the max quotas to 12 - ( <old,new> or <parent,child> x <user,group,project> x <block,inode>, but this may needlessly increase the transaction credits and hurt performance.

In the case of "rmdir" it isn't clear why the quota of the parent directory would be updated (it only removes a name from the leaf block, and updates the link count on the inode, so until we have directory shrink there would be no changes to the parent ID quotas), so (IMHO) there should only be at most 6 different IDs involved in this transaction.

Comment by Stephane Thiell [ 03/Nov/21 ]

Simple reproducer below.

It looks like the owner/group of the parent directory has to be different than the owner/group of the directory to be removed, and to trigger the problem, the project IDs have to be different too.

From a directory that has project ID set by default (and inherited):

[root@fir-rbh01 LU-15193]# lfs project -d .
282232 P .

[root@fir-rbh01 LU-15193]# mkdir parent
[root@fir-rbh01 LU-15193]# chown atrev.wjg parent
[root@fir-rbh01 LU-15193]# mkdir parent/dir
[root@fir-rbh01 LU-15193]# chown sthiell.ruthm parent/dir

[root@fir-rbh01 LU-15193]# ls -lisa parent
total 12
180150013517627412 4 drwxr-xr-x 3 atrev   wjg   4096 Nov  3 13:50 .
180150013517627393 4 drwxr-xr-x 3 root    root  4096 Nov  3 13:49 ..
180150013517627413 4 drwxr-xr-x 2 sthiell ruthm 4096 Nov  3 13:50 dir

[root@fir-rbh01 LU-15193]# lfs project -p 215845 parent/dir

[root@fir-rbh01 LU-15193]# lfs project -d parent parent/dir
282232 P parent
215845 P parent/dir

[root@fir-rbh01 LU-15193]# rmdir parent/dir
rmdir: failed to remove 'parent/dir': Invalid argument

In practice, for us, that's definitely a very rare case, that's probably why we haven't seen it before.

Comment by Gerrit Updater [ 04/Nov/21 ]

"Feng, Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45456
Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c995e003c75b15452d48fc94a31165eb194d4455

Comment by Stephane Thiell [ 27/Jan/22 ]

Hello! We are still hitting this problem. Example from this morning using a client running 2.12.8 and servers running 2.12.7:

Client-side:

[root@sh02-hn01 VASP]# pwd
/scratch/users/dasc/Projects/BATT-RIXS/VASP

[root@sh02-hn01 VASP]# ls -al
total 12
drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 .
drwxrwxr-x+ 4 311749 tpd   4096 Jun 16  2020 ..
drwxr-xr-x  2 292018 32269 4096 May 27  2021 Na2Mn3O7

[root@sh02-hn01 VASP]# rm -Rf Na2Mn3O7/
[root@sh02-hn01 VASP]# ls -al
total 12
drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 .
drwxrwxr-x+ 4 311749 tpd   4096 Jun 16  2020 ..
drwxr-xr-x  2 292018 32269 4096 May 27  2021 Na2Mn3O7

[root@sh02-hn01 VASP]# ls -al Na2Mn3O7/
total 8
drwxr-xr-x  2 292018 32269 4096 May 27  2021 .
drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 ..

[root@sh02-hn01 VASP]# rmdir Na2Mn3O7/
rmdir: failed to remove 'Na2Mn3O7/': Invalid argument
# lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
# lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS/VASP
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none

Server-side:

fir-md1-s3: Jan 27 11:04:36 fir-md1-s3 kernel: LustreError: 24993:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45456/
Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61481796ac85e9ab2469b8d2f4cc75088c65d298

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Comment by Stephane Thiell [ 12/Jan/23 ]

It would be nice to have this patch backported to 2.15 LTS. Thanks!

Comment by Andreas Dilger [ 12/Jan/23 ]

It would be nice to have this patch backported to 2.15 LTS. Thanks!

For simple patches like this you can usually cherry-pick the patch directly to b2_15 within Gerrit. Click the "[Cherry-pick]" button on the patch, add "b2_15" for the branch name, edit the commit message to add labels "Lustre-change:" (from "Reviewed-on:") and "Lustre-commit:" (from the "cherry-picked" line at the end), and remove the "Tested-by:" and "Reviewed-by: Oleg Drokin" lines.

As an added benefit, if I don't do the cherry-pick the patch myself, I'm able to review it and it can be landed more quickly instead of waiting for someone else to review it.

Comment by Gerrit Updater [ 12/Jan/23 ]

"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49611
Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 44db441ca5ce7395cd059e3230b4ae684db01830

Comment by Stephane Thiell [ 13/Jan/23 ]

Sounds good, makes sense, I will remember for next time, thanks Andreas and Feng!

Comment by Gerrit Updater [ 19/Oct/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49611/
Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: c20d23cd92c5bc748a618e9ed96e6eddd794ab45

Generated at Sat Feb 10 03:16:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.