Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17922

nodemap should allow declaring ID ranges

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      It is currently not possible to specify an ID range when using "lctl nodemap_add_idmap" to specify a UID/GID/PROJID mapping for a nodemap, there needs to be a separate record specified for each "CLID:FSID" ID mapping. That means for an environment with a large number of mapped IDs (e.g. tenants in a virtualized environment) there would potentially need to be hundreds or thousands of nodemap records specified, if they are not all squashed to a single FSID.

      The command line interface could specify a range of values like "lctl nodemap_add_idmap --name MAP --idtype TYPE --idmap CLSTART-CLEND:FSSTART[-_FSEND_]". The "FSEND" parameter is redundant in this case and would not need to be specified, but it may be good to accept it as a double-check and/or a visual reminder to the admin what the range of CLID and FSID values are used by this nodemap? Alternately, or additionally, it would be possible to specify the number of IDs being mapped "lctl nodemap_add_idmap --name NODEMAP --idtype TYPE --idmap CLSTART:FSSTART --count=NUMBER" to avoid the chance of an inconsistency in the *END values, but it would need to be converted to a range due to internal implementation details (see further below).

      The "expedient" solution would be to loop over the specified range/count internally in jt_nodemap_add_idmap() and generate a series of commands to the kernel that maps one ID at a time. However, I worry that expanding the range into individual records could be accidentally misused, with an admin thinking "let's map a range of 1M UIDs and 1M GIDs for each nodemap just in case they are needed" and that generates many millions of configuration records that every target needs to process at mount time, if this many records can even be stored in the config logs in the first place.

      The better solution would be to store the ID range in the configuration record, and then expand it when it is processed (which would give 1M separate records in memory), but since it uses an rbtree structure for lookup this wouldn't be fatal. The data structure to hold the ID mapping in the NODEMAP_UIDMAP_IDX, NODEMAP_GIDMAP_IDX, and NODEMAP_PROJIDMAP_IDX records is:

      struct nodemap_id_rec {
              __u32   nir_id_fs;                                                      
              __u32   nir_padding1;                                                 
              __u64   nir_padding2;                                                 
              __u64   nir_padding3;                                                  
              __u64   nir_padding4;                                                   
      };
      

      We could replace nir_padding1 with nir_id_fs_end (or nir_id_count, whichever is more convenient for the implementation) which would optionally specify the number of IDs to map if it is non-zero.

      The best solution would be to be able to insert the range of mappings directly into the rbtree structure (which I believe can be done efficiently in an rbtree). In this case, if the ID is mapped to a range and this is returned from the rbtree lookup, then care would be needed to properly "offset" the returned value so that it properly maps the ID within the range.

      This will involve code changes in several areas:

      • lctl nodemap_add_idmap and jt_nodemap_add_idmap() to allow specifying the *END values (and/or COUNT) and pass this on to the kernel
      • the interface to the kernel to set the range in the idmap looks like it is a string-based lustre_cfg so it should be possible to just pass the range directly through to the kernel. I see that the "CLID:FSID" string is passed straight through to nodemap_parse_idmap() in the kernel, and it just parses two numbers separated by ':' directly. It is a tossup whether it is more expedient to pass the "range" values like "CLSTARTCLEND:FSSTART[_FSEND_]" directly compared to a count "CLSTART:FSSTART+COUNT" (which is not as convenient for the users, but less prone to errors IMHO, and either could be generated by jt_nodemap_add_idmap() internally). Either format should cause kstrtoul() to fail ID parsing for older modules, which is probably desirable so that an ID range is not accidentally treated as a single value.
      • the code in cfg_nodemap_cmd() and nodemap_parse_idmap() that processes the configuration record would need to handle the extra range values, maybe just growing idmap[3] to calculate and pass back idmap[2] = COUNT to the caller?
      • the rbtree management for the range of IDs is probably the most complex part. This is pretty straight forward (if least efficient) if nodemap_add_idmap() is just looping and inserting idmap[2]=COUNT separate records into the rbtree. This could be done for an initial implementation to avoid the need for more extensive changes elsewhere in the ID mapping code. The changes become more complex if the ID range is inserted into the rbtree directly, but that could be done as a separate patch once the initial parsing/passing/processing code is working properly.

      Attachments

        Issue Links

          Activity

            [LU-17922] nodemap should allow declaring ID ranges

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56551/
            Subject: LU-17922 tests: skip sanity-sec/27aa for old MDS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 50b779a87bee4ae6456d2de4a8668efe3a09b1b2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56551/ Subject: LU-17922 tests: skip sanity-sec/27aa for old MDS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 50b779a87bee4ae6456d2de4a8668efe3a09b1b2

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56551
            Subject: LU-17922 tests: skip sanity-sec/27aa for old MDS
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 48f975bf0e368391fe3e934afd68a9dc819d3770

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56551 Subject: LU-17922 tests: skip sanity-sec/27aa for old MDS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 48f975bf0e368391fe3e934afd68a9dc819d3770
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55502/
            Subject: LU-17922 utils: added idmap range functionality
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8445f7b92f067346919069bcf9a268d41c52c5b3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55502/ Subject: LU-17922 utils: added idmap range functionality Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8445f7b92f067346919069bcf9a268d41c52c5b3

            "Maximilian Dilger <mdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55502
            Subject: LU-17922 utils: added idmap range functionality
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 427297f4371dbe402bf3196486e161f27c98da13

            gerrit Gerrit Updater added a comment - "Maximilian Dilger <mdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55502 Subject: LU-17922 utils: added idmap range functionality Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 427297f4371dbe402bf3196486e161f27c98da13

            @adilger yes, I think this could be quite convenient!

            Example 1: For GID mapping from Oak (storage system) to Sherlock (cluster), with map_mode=gid, we add 1000000 to each GID to avoid any collision on already existing GIDs on the cluster. It's a convention we adopted between Oak and Sherlock.
            Currently, the mapping looks like this:

            idtype: gid, client_id: 1014600, fs_id: 14600
            idtype: gid, client_id: 1014622, fs_id: 14622 
            idtype: gid, client_id: 1014633, fs_id: 14633
            

            We have 378 GIDs mapped like this. We could definitely replace these with a single ID range map.

            Example 2: For another cluster that is root squashed (admin_nodemap=0), we do a GID-to-same-GID mapping (with map_mode=gid):

            idtype: gid, client_id: 13734, fs_id: 13736 
            idtype: gid, client_id: 13765, fs_id: 13765 
            idtype: gid, client_id: 13766, fs_id: 13766 
            

            We have 684 GIDs mapped like this. Those are the only allowed GIDs. We would have to double check if that would be ok to replace with a single ID mapping range in terms of security, but I think that possibly it could.

            We used to have another cluster with GID mapping to "random" cluster GIDs: there, ID ranges would not have been useful, but we have stopped doing that as of today.

            sthiell Stephane Thiell added a comment - @adilger yes, I think this could be quite convenient! Example 1: For GID mapping from Oak (storage system) to Sherlock (cluster), with map_mode=gid, we add 1000000 to each GID to avoid any collision on already existing GIDs on the cluster. It's a convention we adopted between Oak and Sherlock. Currently, the mapping looks like this: idtype: gid, client_id: 1014600, fs_id: 14600 idtype: gid, client_id: 1014622, fs_id: 14622 idtype: gid, client_id: 1014633, fs_id: 14633 We have 378 GIDs mapped like this. We could definitely replace these with a single ID range map. Example 2: For another cluster that is root squashed (admin_nodemap=0), we do a GID-to-same-GID mapping (with map_mode=gid): idtype: gid, client_id: 13734, fs_id: 13736 idtype: gid, client_id: 13765, fs_id: 13765 idtype: gid, client_id: 13766, fs_id: 13766 We have 684 GIDs mapped like this. Those are the only allowed GIDs. We would have to double check if that would be ok to replace with a single ID mapping range in terms of security, but I think that possibly it could. We used to have another cluster with GID mapping to "random" cluster GIDs: there, ID ranges would not have been useful, but we have stopped doing that as of today.

            Hi Andreas,

            This can certainly be useful for some workflows. In multi-tenancy implementations, what I have seen most of the time is either a squashed uid/gid for all users in a tenant, or 'trusted' set to 1 so that uids/gids on clients are kept untouched. But in case of massive uid/gid collisions, it would make sense to 'mass-map' uids/gids.

            I agree the way we handle the id ranges internally should be 'independent' of the user interface(s) we provide to define them. I would tend to prefer the "CLID:FSID+COUNT" variant, it looks simpler at first glance.
            To store the COUNT, we could convert nir_padding1 to nir_id_count in struct nodemap_id_rec. But then I guess we would need a flag or equivalent to tell if this field is valid. Past experiences with nodemap code modification showed that unused/padding fields are not necessarily stored on disk with 0s, so we might interpret former unused values incorrectly. I think we could introduce new types in enum nodemap_idx_type:

            enum nodemap_idx_type {
            	NODEMAP_EMPTY_IDX = 0,		/* index created with blank record */
            	NODEMAP_CLUSTER_IDX = 1,	/* a nodemap cluster of nodes */
            	NODEMAP_RANGE_IDX = 2,		/* nid range assigned to a nm cluster */
            	NODEMAP_UIDMAP_IDX = 3,		/* uid map assigned to a nm cluster */
            	NODEMAP_GIDMAP_IDX = 4,		/* gid map assigned to a nm cluster */
            	NODEMAP_PROJIDMAP_IDX = 5,	/* projid map assigned to nm cluster */
            	NODEMAP_NID_MASK_IDX = 6,	/* large NID setup for a nm cluster */
            	NODEMAP_GLOBAL_IDX = 15,	/* stores nodemap activation status */
            };
            

            e.g. NODEMAP_UIDMAP_RANGE_IDX and NODEMAP_GIDMAP_RANGE_IDX (in that perspective I think NODEMAP_RANGE_IDX should be renamed to NODEMAP_NIDRANGE_IDX). Depending on this type, we could decide to pass nir_id_count to nodemap_add_idmap_helper from nodemap_process_keyrec , and to nodemap_idx_idmap_add from nodemap_add_idmap.

            I agree it would make sense to propose a first round of implementation that just loops on inserting new nodes individually in the rb trees in case we are handling an id map range. That would reduce initial patches' complexity (for instance in case a single idmap already exists in the middle of the range), and maybe there is not much we can do about mass rb tree insert?
            I am not a specialist of the rb tree API in the kernel, but it does not seem to provide any facility for adding a range or group of nodes at once. Moreover, rb_link_node/rb_insert_color are supposed to be called right after each node insertion into the rb tree, otherwise it can result in an incorrect and unbalanced tree structure, breaking the properties required for efficient rb tree operations.
            Note that in the context of id map ranges, the nodes we are inserting are already sorted (i.e. from the first id in the range to the last one, in ascending order). This can reduce the number of rotations required when calling rb_link_node/rb_insert_color after each node insert.

            sebastien Sebastien Buisson added a comment - Hi Andreas, This can certainly be useful for some workflows. In multi-tenancy implementations, what I have seen most of the time is either a squashed uid/gid for all users in a tenant, or 'trusted' set to 1 so that uids/gids on clients are kept untouched. But in case of massive uid/gid collisions, it would make sense to 'mass-map' uids/gids. I agree the way we handle the id ranges internally should be 'independent' of the user interface(s) we provide to define them. I would tend to prefer the "CLID:FSID+COUNT" variant, it looks simpler at first glance. To store the COUNT, we could convert nir_padding1 to nir_id_count in struct nodemap_id_rec . But then I guess we would need a flag or equivalent to tell if this field is valid. Past experiences with nodemap code modification showed that unused/padding fields are not necessarily stored on disk with 0s, so we might interpret former unused values incorrectly. I think we could introduce new types in enum nodemap_idx_type : enum nodemap_idx_type { NODEMAP_EMPTY_IDX = 0, /* index created with blank record */ NODEMAP_CLUSTER_IDX = 1, /* a nodemap cluster of nodes */ NODEMAP_RANGE_IDX = 2, /* nid range assigned to a nm cluster */ NODEMAP_UIDMAP_IDX = 3, /* uid map assigned to a nm cluster */ NODEMAP_GIDMAP_IDX = 4, /* gid map assigned to a nm cluster */ NODEMAP_PROJIDMAP_IDX = 5, /* projid map assigned to nm cluster */ NODEMAP_NID_MASK_IDX = 6, /* large NID setup for a nm cluster */ NODEMAP_GLOBAL_IDX = 15, /* stores nodemap activation status */ }; e.g. NODEMAP_UIDMAP_RANGE_IDX and NODEMAP_GIDMAP_RANGE_IDX (in that perspective I think NODEMAP_RANGE_IDX should be renamed to NODEMAP_NIDRANGE_IDX ). Depending on this type, we could decide to pass nir_id_count to nodemap_add_idmap_helper from nodemap_process_keyrec , and to nodemap_idx_idmap_add from nodemap_add_idmap . I agree it would make sense to propose a first round of implementation that just loops on inserting new nodes individually in the rb trees in case we are handling an id map range. That would reduce initial patches' complexity (for instance in case a single idmap already exists in the middle of the range), and maybe there is not much we can do about mass rb tree insert? I am not a specialist of the rb tree API in the kernel, but it does not seem to provide any facility for adding a range or group of nodes at once. Moreover, rb_link_node / rb_insert_color are supposed to be called right after each node insertion into the rb tree, otherwise it can result in an incorrect and unbalanced tree structure, breaking the properties required for efficient rb tree operations. Note that in the context of id map ranges, the nodes we are inserting are already sorted (i.e. from the first id in the range to the last one, in ascending order). This can reduce the number of rotations required when calling rb_link_node / rb_insert_color after each node insert.
            adilger Andreas Dilger added a comment - - edited

            sthiell, sebastien, could you please review and comment on this proposal.

            sthiell, I know you are using GID mapping extensively in your filesystems, and I'm wondering if you are mapping large ranges of contiguous IDs and this proposed enhancement would be useful for you, or if the GIDs are all individually mapped because they do not form a contiguous range from source to destination? How many GIDs are being mapped individually in your system?

            I went back and forth whether the better interface is to allow specifying a range of values (which I think is visually and operationally easier for the admin) or to specify the COUNT (either directly as part of the IDs or as a separate argument, either is less error prone than having to make the *END values align if they are not on nice even boundaries). I ended up think that both interfaces should be allowed at the command line for convenience to the admin. Internally, it should be possible to convert to a single format, and I don't think the complexity is significantly different from the implementation POV, but I think using "CLID:FSID+COUNT" internally is less likely to generate errors/bugs (only a single additional value to parse, and it fits the idmap[] array best. It is likely best to only handle this one format in the kernel to minimize the complexity and chance of bugs there.

            adilger Andreas Dilger added a comment - - edited sthiell , sebastien , could you please review and comment on this proposal. sthiell , I know you are using GID mapping extensively in your filesystems, and I'm wondering if you are mapping large ranges of contiguous IDs and this proposed enhancement would be useful for you, or if the GIDs are all individually mapped because they do not form a contiguous range from source to destination? How many GIDs are being mapped individually in your system? I went back and forth whether the better interface is to allow specifying a range of values (which I think is visually and operationally easier for the admin) or to specify the COUNT (either directly as part of the IDs or as a separate argument, either is less error prone than having to make the *END values align if they are not on nice even boundaries). I ended up think that both interfaces should be allowed at the command line for convenience to the admin. Internally, it should be possible to convert to a single format, and I don't think the complexity is significantly different from the implementation POV, but I think using " CLID : FSID + COUNT " internally is less likely to generate errors/bugs (only a single additional value to parse, and it fits the idmap[] array best. It is likely best to only handle this one format in the kernel to minimize the complexity and chance of bugs there.

            People

              mdilger Max Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: