Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10040

nodemap and quota issues (ineffective GID mapping)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.10.0, Lustre 2.10.1
    • 2
    • 9223372036854775807

    Description

      We're using the nodemap feature with map_mode=gid_only in production and we are seeing more and more issues with GID mapping, which seems to default to squash_gid instead of being properly mapped. The nodemap hasn't changed for these groups, we just add new groups from time to time.

      Example, configuration for mapping 'sherlock' on MGS:

      [root@oak-md1-s1 sherlock]# pwd
      /proc/fs/lustre/nodemap/sherlock
      
      [root@oak-md1-s1 sherlock]# cat ranges 
      [
       { id: 6, start_nid: 0.0.0.0@o2ib4, end_nid: 255.255.255.255@o2ib4 },
       { id: 5, start_nid: 0.0.0.0@o2ib3, end_nid: 255.255.255.255@o2ib3 }
      ]
      
      [root@oak-md1-s1 sherlock]# cat idmap 
      [
       { idtype: gid, client_id: 3525, fs_id: 3741 } { idtype: gid, client_id: 6401, fs_id: 3752 } { idtype: gid, client_id: 99001, fs_id: 3159 } { idtype: gid, client_id: 10525, fs_id: 3351 } { idtype: gid, client_id: 11886, fs_id: 3593 } { idtype: gid, client_id: 12193, fs_id: 3636 } { idtype: gid, client_id: 13103, fs_id: 3208 } { idtype: gid, client_id: 17079, fs_id: 3700 } { idtype: gid, client_id: 19437, fs_id: 3618 } { idtype: gid, client_id: 22959, fs_id: 3745 } { idtype: gid, client_id: 24369, fs_id: 3526 } { idtype: gid, client_id: 26426, fs_id: 3352 } { idtype: gid, client_id: 29361, fs_id: 3746 } { idtype: gid, client_id: 29433, fs_id: 3479 } { idtype: gid, client_id: 30289, fs_id: 3262 } { idtype: gid, client_id: 32264, fs_id: 3199 } { idtype: gid, client_id: 32774, fs_id: 3623 } { idtype: gid, client_id: 38517, fs_id: 3702 } { idtype: gid, client_id: 40387, fs_id: 3708 } { idtype: gid, client_id: 47235, fs_id: 3674 } { idtype: gid, client_id: 48931, fs_id: 3325 } { idtype: gid, client_id: 50590, fs_id: 3360 } { idtype: gid, client_id: 52892, fs_id: 3377 } { idtype: gid, client_id: 56316, fs_id: 3353 } { idtype: gid, client_id: 56628, fs_id: 3411 } { idtype: gid, client_id: 59943, fs_id: 3372 } { idtype: gid, client_id: 63938, fs_id: 3756 } { idtype: gid, client_id: 100533, fs_id: 3281 } { idtype: gid, client_id: 244300, fs_id: 3617 } { idtype: gid, client_id: 254778, fs_id: 3362 } { idtype: gid, client_id: 267829, fs_id: 3748 } { idtype: gid, client_id: 270331, fs_id: 3690 } { idtype: gid, client_id: 305454, fs_id: 3371 } { idtype: gid, client_id: 308753, fs_id: 3367 }
      
      [root@oak-md1-s1 sherlock]# cat squash_gid 
      99
      [root@oak-md1-s1 sherlock]# cat map_mode 
      gid_only
      
      [root@oak-md1-s1 sherlock]# cat admin_nodemap 
      0
      [root@oak-md1-s1 sherlock]# cat deny_unknown 
      1
      [root@oak-md1-s1 sherlock]# cat trusted_nodemap 
      0
      
      
      

      Issue with group: GID 3593 (mapped to GID 11886 on sherlock)

      lfs quota, not mapped (using canonical GID 3593):

      [root@oak-rbh01 ~]# lfs quota -g oak_euan /oak
      Disk quotas for group oak_euan (gid 3593):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 33255114444  50000000000 50000000000       -  526016  7500000 7500000       -
      
      

      Broken lfs quota mapped on sherlock (o2ib4):

      [root@sh-113-01 ~]# lfs quota -g euan /oak
      Disk quotas for grp euan (gid 11886):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 2875412844*      1       1       -      26*      1       1       -
      [root@sh-113-01 ~]# lctl list_nids
      10.9.113.1@o2ib4
      
      

      It matches the quota usage for squash_gid:

      [root@oak-rbh01 ~]# lfs quota -g 99 /oak
      Disk quotas for group 99 (gid 99):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 2875412844*      1       1       -      26*      1       1       -
      
      

       

      Please note that GID mapping works OK for most of the groups though:

      3199 -> 32264(sherlock)
      
      canonical:
      [root@oak-rbh01 ~]# lfs quota -g oak_ruthm /oak
      Disk quotas for group oak_ruthm (gid 3199):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -
      
      mapped (sherlock):
      [root@sh-113-01 ~]# lfs quota -g ruthm /oak
      Disk quotas for grp ruthm (gid 32264):
           Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                 /oak 10460005688  20000000000 20000000000       - 1683058  3000000 3000000       -
      
      
      

      Failing over the MDT resolved a few groups, but not all. Failing the MDT back showed an issue on the exact same original groups having issues (currently 4-5).

      While I haven't seen it by myself yet, the issue seems to affect users as a few of them reported erroneous EDQUOT errors. This is why it is quite urgent to figure out what's wrong. Please note that the issue was already there before using the patch from LU-9929.

      I'm willing to attach some debug logs, but what debug flags should I enable to troubleshoot such a quota+nodemap issue on client and server?

      Thanks!
      Stephane

      Attachments

        1. break_nodemap_rbtree.sh
          0.7 kB
        2. oak-md1-s1.glb-grp.txt
          11 kB
        3. oak-md1-s2.dk.log
          1.25 MB
        4. oak-md1-s2.glb-grp.txt
          11 kB
        5. oak-md1-s2.mdt.dk.full.log
          53.84 MB
        6. reproducer.log
          3 kB
        7. sh-101-59.client.dk.full.log
          2.25 MB
        8. sh-113-01.dk.log
          547 kB

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: