[LU-15193] qsd_op_begin: more than 8 qids enforced for a transaction? Created: 03/Nov/21 Updated: 11/Dec/23 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Feng Lei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9, Lustre 2.12.7 on clients |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This morning, we hit the following problem (new for us) on Fir (2.12.5 servers, 2.12.7 clients): # rmdir /scratch/users/ragoglia/csATAC/For_Rachel rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument [root@fir-rbh01 ~]# ls -lisa /scratch/users/ragoglia/csATAC/For_Rachel total 8 198162765779658672 4 drwxr-xr-x 2 atrev wjg 4096 Oct 20 02:58 . 198162765779658671 4 drwxrwxr-x+ 3 ragoglia hbfraser 4096 Sep 5 2019 .. [root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC/For_Rachel lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none [root@fir-rbh01 ~]# lfs project -d /scratch/users/ragoglia/csATAC/For_Rachel 259557 P /scratch/users/ragoglia/csATAC/For_Rachel [root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none [root@fir-rbh01 ~]# rmdir /scratch/users/ragoglia/csATAC/For_Rachel rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument MDS logs show: Nov 03 09:52:49 fir-md1-s3 kernel: LustreError: 103307:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? Nov 03 10:10:09 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? Nov 03 10:10:35 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? Nov 03 10:10:52 fir-md1-s3 kernel: LustreError: 103697:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? Nov 03 10:15:55 fir-md1-s3 kernel: LustreError: 103721:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? Which looks like # lfs project -d /scratch/users/ragoglia/csATAC/For_Rachel/
259557 P /scratch/users/ragoglia/csATAC/For_Rachel/
# lfs project -p 0 /scratch/users/ragoglia/csATAC/For_Rachel/
# lfs project -d /scratch/users/ragoglia/csATAC/For_Rachel/
0 P /scratch/users/ragoglia/csATAC/For_Rachel/
# rmdir /scratch/users/ragoglia/csATAC/For_Rachel/
#
I believe this is the only occurrence we've seen of this issue so far. Let me know if additional logs would be helpful the next time we hit this. |
| Comments |
| Comment by Andreas Dilger [ 03/Nov/21 ] |
|
It looks like a trivial fix. It would also be useful if the qsd_op_begin() error message was updated to also print the FID, so that this is easier to debug in the future. |
| Comment by Andreas Dilger [ 03/Nov/21 ] |
|
It may not be quite as trivial as I thought. The trivial solution is to increase the max quotas to 12 - ( <old,new> or <parent,child> x <user,group,project> x <block,inode>, but this may needlessly increase the transaction credits and hurt performance. In the case of "rmdir" it isn't clear why the quota of the parent directory would be updated (it only removes a name from the leaf block, and updates the link count on the inode, so until we have directory shrink there would be no changes to the parent ID quotas), so (IMHO) there should only be at most 6 different IDs involved in this transaction. |
| Comment by Stephane Thiell [ 03/Nov/21 ] |
|
Simple reproducer below. It looks like the owner/group of the parent directory has to be different than the owner/group of the directory to be removed, and to trigger the problem, the project IDs have to be different too. From a directory that has project ID set by default (and inherited): [root@fir-rbh01 LU-15193]# lfs project -d . 282232 P . [root@fir-rbh01 LU-15193]# mkdir parent [root@fir-rbh01 LU-15193]# chown atrev.wjg parent [root@fir-rbh01 LU-15193]# mkdir parent/dir [root@fir-rbh01 LU-15193]# chown sthiell.ruthm parent/dir [root@fir-rbh01 LU-15193]# ls -lisa parent total 12 180150013517627412 4 drwxr-xr-x 3 atrev wjg 4096 Nov 3 13:50 . 180150013517627393 4 drwxr-xr-x 3 root root 4096 Nov 3 13:49 .. 180150013517627413 4 drwxr-xr-x 2 sthiell ruthm 4096 Nov 3 13:50 dir [root@fir-rbh01 LU-15193]# lfs project -p 215845 parent/dir [root@fir-rbh01 LU-15193]# lfs project -d parent parent/dir 282232 P parent 215845 P parent/dir [root@fir-rbh01 LU-15193]# rmdir parent/dir rmdir: failed to remove 'parent/dir': Invalid argument In practice, for us, that's definitely a very rare case, that's probably why we haven't seen it before. |
| Comment by Gerrit Updater [ 04/Nov/21 ] |
|
"Feng, Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45456 |
| Comment by Stephane Thiell [ 27/Jan/22 ] |
|
Hello! We are still hitting this problem. Example from this morning using a client running 2.12.8 and servers running 2.12.7: Client-side: [root@sh02-hn01 VASP]# pwd /scratch/users/dasc/Projects/BATT-RIXS/VASP [root@sh02-hn01 VASP]# ls -al total 12 drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 . drwxrwxr-x+ 4 311749 tpd 4096 Jun 16 2020 .. drwxr-xr-x 2 292018 32269 4096 May 27 2021 Na2Mn3O7 [root@sh02-hn01 VASP]# rm -Rf Na2Mn3O7/ [root@sh02-hn01 VASP]# ls -al total 12 drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 . drwxrwxr-x+ 4 311749 tpd 4096 Jun 16 2020 .. drwxr-xr-x 2 292018 32269 4096 May 27 2021 Na2Mn3O7 [root@sh02-hn01 VASP]# ls -al Na2Mn3O7/ total 8 drwxr-xr-x 2 292018 32269 4096 May 27 2021 . drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 .. [root@sh02-hn01 VASP]# rmdir Na2Mn3O7/ rmdir: failed to remove 'Na2Mn3O7/': Invalid argument # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS/VASP lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none Server-side: fir-md1-s3: Jan 27 11:04:36 fir-md1-s3 kernel: LustreError: 24993:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction? |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45456/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Stephane Thiell [ 12/Jan/23 ] |
|
It would be nice to have this patch backported to 2.15 LTS. Thanks! |
| Comment by Andreas Dilger [ 12/Jan/23 ] |
For simple patches like this you can usually cherry-pick the patch directly to b2_15 within Gerrit. Click the "[Cherry-pick]" button on the patch, add "b2_15" for the branch name, edit the commit message to add labels "Lustre-change:" (from "Reviewed-on:") and "Lustre-commit:" (from the "cherry-picked" line at the end), and remove the "Tested-by:" and "Reviewed-by: Oleg Drokin" lines. As an added benefit, if I don't do the cherry-pick the patch myself, I'm able to review it and it can be landed more quickly instead of waiting for someone else to review it. |
| Comment by Gerrit Updater [ 12/Jan/23 ] |
|
"Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49611 |
| Comment by Stephane Thiell [ 13/Jan/23 ] |
|
Sounds good, makes sense, I will remember for next time, thanks Andreas and Feng! |
| Comment by Gerrit Updater [ 19/Oct/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49611/ |