Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15193

qsd_op_begin: more than 8 qids enforced for a transaction?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.12.5
    • None
    • CentOS 7.9, Lustre 2.12.7 on clients
    • 3
    • 9223372036854775807

    Description

      This morning, we hit the following problem (new for us) on Fir (2.12.5 servers, 2.12.7 clients):

      # rmdir /scratch/users/ragoglia/csATAC/For_Rachel
      rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument
      

       
      More info on this directory:

      [root@fir-rbh01 ~]# ls -lisa /scratch/users/ragoglia/csATAC/For_Rachel
      total 8
      198162765779658672 4 drwxr-xr-x  2 atrev    wjg      4096 Oct 20 02:58 .
      198162765779658671 4 drwxrwxr-x+ 3 ragoglia hbfraser 4096 Sep  5  2019 ..
      
      [root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC/For_Rachel
      lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
      
      [root@fir-rbh01 ~]# lfs project -d /scratch/users/ragoglia/csATAC/For_Rachel
      259557 P /scratch/users/ragoglia/csATAC/For_Rachel
      
      [root@fir-rbh01 ~]# lfs getdirstripe /scratch/users/ragoglia/csATAC
      lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
      
      [root@fir-rbh01 ~]# rmdir /scratch/users/ragoglia/csATAC/For_Rachel
      rmdir: failed to remove ‘/scratch/users/ragoglia/csATAC/For_Rachel’: Invalid argument
      

      MDS logs show:

      Nov 03 09:52:49 fir-md1-s3 kernel: LustreError: 103307:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
      Nov 03 10:10:09 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
      Nov 03 10:10:35 fir-md1-s3 kernel: LustreError: 103781:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
      Nov 03 10:10:52 fir-md1-s3 kernel: LustreError: 103697:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
      Nov 03 10:15:55 fir-md1-s3 kernel: LustreError: 103721:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
      

      Which looks like LU-12388
       
      My colleague found a workaround for rmdir: set project ID 0 before rmdir, and it worked:

      # lfs project -d  /scratch/users/ragoglia/csATAC/For_Rachel/
      259557 P /scratch/users/ragoglia/csATAC/For_Rachel/
      # lfs project -p 0  /scratch/users/ragoglia/csATAC/For_Rachel/
      # lfs project -d  /scratch/users/ragoglia/csATAC/For_Rachel/
          0 P /scratch/users/ragoglia/csATAC/For_Rachel/
      # rmdir /scratch/users/ragoglia/csATAC/For_Rachel/
      #
      

      I believe this is the only occurrence we've seen of this issue so far. Let me know if additional logs would be helpful the next time we hit this.
       

      Attachments

        Issue Links

          Activity

            [LU-15193] qsd_op_begin: more than 8 qids enforced for a transaction?

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49611/
            Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: c20d23cd92c5bc748a618e9ed96e6eddd794ab45

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49611/ Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12 Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: c20d23cd92c5bc748a618e9ed96e6eddd794ab45

            Sounds good, makes sense, I will remember for next time, thanks Andreas and Feng!

            sthiell Stephane Thiell added a comment - Sounds good, makes sense, I will remember for next time, thanks Andreas and Feng!

            "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49611
            Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 44db441ca5ce7395cd059e3230b4ae684db01830

            gerrit Gerrit Updater added a comment - "Feng Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49611 Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 44db441ca5ce7395cd059e3230b4ae684db01830

            It would be nice to have this patch backported to 2.15 LTS. Thanks!

            For simple patches like this you can usually cherry-pick the patch directly to b2_15 within Gerrit. Click the "[Cherry-pick]" button on the patch, add "b2_15" for the branch name, edit the commit message to add labels "Lustre-change:" (from "Reviewed-on:") and "Lustre-commit:" (from the "cherry-picked" line at the end), and remove the "Tested-by:" and "Reviewed-by: Oleg Drokin" lines.

            As an added benefit, if I don't do the cherry-pick the patch myself, I'm able to review it and it can be landed more quickly instead of waiting for someone else to review it.

            adilger Andreas Dilger added a comment - It would be nice to have this patch backported to 2.15 LTS. Thanks! For simple patches like this you can usually cherry-pick the patch directly to b2_15 within Gerrit. Click the " [Cherry-pick] " button on the patch, add " b2_15 " for the branch name, edit the commit message to add labels " Lustre-change: " (from " Reviewed-on: ") and " Lustre-commit: " (from the " cherry-picked " line at the end), and remove the " Tested-by: " and " Reviewed-by: Oleg Drokin " lines. As an added benefit, if I don't do the cherry-pick the patch myself, I'm able to review it and it can be landed more quickly instead of waiting for someone else to review it.

            It would be nice to have this patch backported to 2.15 LTS. Thanks!

            sthiell Stephane Thiell added a comment - It would be nice to have this patch backported to 2.15 LTS. Thanks!
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45456/
            Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 61481796ac85e9ab2469b8d2f4cc75088c65d298

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45456/ Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 61481796ac85e9ab2469b8d2f4cc75088c65d298

            Hello! We are still hitting this problem. Example from this morning using a client running 2.12.8 and servers running 2.12.7:

            Client-side:

            [root@sh02-hn01 VASP]# pwd
            /scratch/users/dasc/Projects/BATT-RIXS/VASP
            
            [root@sh02-hn01 VASP]# ls -al
            total 12
            drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 .
            drwxrwxr-x+ 4 311749 tpd   4096 Jun 16  2020 ..
            drwxr-xr-x  2 292018 32269 4096 May 27  2021 Na2Mn3O7
            
            [root@sh02-hn01 VASP]# rm -Rf Na2Mn3O7/
            [root@sh02-hn01 VASP]# ls -al
            total 12
            drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 .
            drwxrwxr-x+ 4 311749 tpd   4096 Jun 16  2020 ..
            drwxr-xr-x  2 292018 32269 4096 May 27  2021 Na2Mn3O7
            
            [root@sh02-hn01 VASP]# ls -al Na2Mn3O7/
            total 8
            drwxr-xr-x  2 292018 32269 4096 May 27  2021 .
            drwxrwxr-x+ 3 311749 tpd   4096 May 27  2021 ..
            
            [root@sh02-hn01 VASP]# rmdir Na2Mn3O7/
            rmdir: failed to remove 'Na2Mn3O7/': Invalid argument
            
            # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS
            lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
            # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS/VASP
            lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
            

            Server-side:

            fir-md1-s3: Jan 27 11:04:36 fir-md1-s3 kernel: LustreError: 24993:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?
            
            sthiell Stephane Thiell added a comment - Hello! We are still hitting this problem. Example from this morning using a client running 2.12.8 and servers running 2.12.7: Client-side: [root@sh02-hn01 VASP]# pwd /scratch/users/dasc/Projects/BATT-RIXS/VASP [root@sh02-hn01 VASP]# ls -al total 12 drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 . drwxrwxr-x+ 4 311749 tpd 4096 Jun 16 2020 .. drwxr-xr-x 2 292018 32269 4096 May 27 2021 Na2Mn3O7 [root@sh02-hn01 VASP]# rm -Rf Na2Mn3O7/ [root@sh02-hn01 VASP]# ls -al total 12 drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 . drwxrwxr-x+ 4 311749 tpd 4096 Jun 16 2020 .. drwxr-xr-x 2 292018 32269 4096 May 27 2021 Na2Mn3O7 [root@sh02-hn01 VASP]# ls -al Na2Mn3O7/ total 8 drwxr-xr-x 2 292018 32269 4096 May 27 2021 . drwxrwxr-x+ 3 311749 tpd 4096 May 27 2021 .. [root@sh02-hn01 VASP]# rmdir Na2Mn3O7/ rmdir: failed to remove 'Na2Mn3O7/': Invalid argument # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none # lfs getdirstripe /scratch/users/dasc/Projects/BATT-RIXS/VASP lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none Server-side: fir-md1-s3: Jan 27 11:04:36 fir-md1-s3 kernel: LustreError: 24993:0:(qsd_handler.c:884:qsd_op_begin()) fir-MDT0002: more than 8 qids enforced for a transaction?

            "Feng, Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45456
            Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c995e003c75b15452d48fc94a31165eb194d4455

            gerrit Gerrit Updater added a comment - "Feng, Lei <flei@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45456 Subject: LU-15193 quota: expand QUOTA_MAX_TRANSIDS to 12 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c995e003c75b15452d48fc94a31165eb194d4455

            Simple reproducer below.

            It looks like the owner/group of the parent directory has to be different than the owner/group of the directory to be removed, and to trigger the problem, the project IDs have to be different too.

            From a directory that has project ID set by default (and inherited):

            [root@fir-rbh01 LU-15193]# lfs project -d .
            282232 P .
            
            [root@fir-rbh01 LU-15193]# mkdir parent
            [root@fir-rbh01 LU-15193]# chown atrev.wjg parent
            [root@fir-rbh01 LU-15193]# mkdir parent/dir
            [root@fir-rbh01 LU-15193]# chown sthiell.ruthm parent/dir
            
            [root@fir-rbh01 LU-15193]# ls -lisa parent
            total 12
            180150013517627412 4 drwxr-xr-x 3 atrev   wjg   4096 Nov  3 13:50 .
            180150013517627393 4 drwxr-xr-x 3 root    root  4096 Nov  3 13:49 ..
            180150013517627413 4 drwxr-xr-x 2 sthiell ruthm 4096 Nov  3 13:50 dir
            
            [root@fir-rbh01 LU-15193]# lfs project -p 215845 parent/dir
            
            [root@fir-rbh01 LU-15193]# lfs project -d parent parent/dir
            282232 P parent
            215845 P parent/dir
            
            [root@fir-rbh01 LU-15193]# rmdir parent/dir
            rmdir: failed to remove 'parent/dir': Invalid argument
            

            In practice, for us, that's definitely a very rare case, that's probably why we haven't seen it before.

            sthiell Stephane Thiell added a comment - Simple reproducer below. It looks like the owner/group of the parent directory has to be different than the owner/group of the directory to be removed, and to trigger the problem, the project IDs have to be different too. From a directory that has project ID set by default (and inherited): [root@fir-rbh01 LU-15193]# lfs project -d . 282232 P . [root@fir-rbh01 LU-15193]# mkdir parent [root@fir-rbh01 LU-15193]# chown atrev.wjg parent [root@fir-rbh01 LU-15193]# mkdir parent/dir [root@fir-rbh01 LU-15193]# chown sthiell.ruthm parent/dir [root@fir-rbh01 LU-15193]# ls -lisa parent total 12 180150013517627412 4 drwxr-xr-x 3 atrev wjg 4096 Nov 3 13:50 . 180150013517627393 4 drwxr-xr-x 3 root root 4096 Nov 3 13:49 .. 180150013517627413 4 drwxr-xr-x 2 sthiell ruthm 4096 Nov 3 13:50 dir [root@fir-rbh01 LU-15193]# lfs project -p 215845 parent/dir [root@fir-rbh01 LU-15193]# lfs project -d parent parent/dir 282232 P parent 215845 P parent/dir [root@fir-rbh01 LU-15193]# rmdir parent/dir rmdir: failed to remove 'parent/dir': Invalid argument In practice, for us, that's definitely a very rare case, that's probably why we haven't seen it before.

            People

              flei Feng Lei
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: