Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16823

add LNet and OBD connect flags for IPv6 peers

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      When nodes are connecting to peers the sender should set a bit in the connection that it supports IPv6 (large) NIDs. This would inform LNet discovery whether it is safe to reply with large NIDs during connection or ping replies.

      At the Lustre level new clients should set an OBD_CONNECT2_LARGE_NID=0x100000000ULL flag in obd_connect_data so that the MGS knows whether it can safely reply with large NIDs in mgs_nidtbl_entry. That avoids the need to backport a patch (ala LU-13306) to allow old clients to mount a server with both IPv4 and IPv6 NIDs configured.

      Attachments

        Issue Links

          Activity

            [LU-16823] add LNet and OBD connect flags for IPv6 peers

            Work is complete

            simmonsja James A Simmons added a comment - Work is complete

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53398/
            Subject: LU-16823 lustre: test if large nid is support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 165cf78ab54e6e8d172f999940c62afabc043cd5

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53398/ Subject: LU-16823 lustre: test if large nid is support Project: fs/lustre-release Branch: master Current Patch Set: Commit: 165cf78ab54e6e8d172f999940c62afabc043cd5

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53398
            Subject: LU-16823 lustre: test if large nid is support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 64bc4ffbff2c8a6927f8c9474887a32ec528e1b9

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53398 Subject: LU-16823 lustre: test if large nid is support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 64bc4ffbff2c8a6927f8c9474887a32ec528e1b9

            Then tendency is for IPv6 to be very random in an network. Need to think about a solution for this. Looking at lmv_setup() its processing an lcfg that contains a lmv_desc. Perhaps we can in mgs_llog.c create a record for lmv_desc with new info for the index to be used?

            simmonsja James A Simmons added a comment - Then tendency is for IPv6 to be very random in an network. Need to think about a solution for this. Looking at lmv_setup() its processing an lcfg that contains a lmv_desc. Perhaps we can in mgs_llog.c create a record for lmv_desc with new info for the index to be used?
            adilger Andreas Dilger added a comment - - edited

            James, are you referring to this hunk of code to initialize the lmv_qos_rr_index starting value:

                    /*
                     * initialize rr_index to lower 32bit of netid, so that client
                     * can distribute subdirs evenly from the beginning.
                     */
                    while (LNetGetId(i++, &lnet_id, false) != -ENOENT) {
                            if (!nid_is_lo0(&lnet_id.nid)) {
                                    lmv->lmv_qos_rr_index = ntohl(lnet_id.nid.nid_addr[0]);
                                    break;
                            }
                    }
            

            That code doesn't really need to have the full NID. The main goal is that each client is initialized in some way to a well-balanced starting value (instead of 0) so that clients don't all start creating subdirectories on MDT0000 and go in lockstep across MDTs. Using the NID is deterministic and likely gives us a more uniform distribution compared to a purely random number, because it normally only changes the low bits among nearby clients in a single job. It could use any other kind of value that is slowly changing for each client, but it would need to be available early during the mount process.

            I don't know if IPv6 NIDs have the same property or not, or if they have too much "random" stuff in them that they may be wildly imbalanced?

            We could potentially have the MDS assign each client a sequential "client number" for this purpose in the mount reply, but that might be imbalanced for clients that are allocated into the same job because it would only "globally" be uniform (though still better than a purely random number).

            In any case, the lmv_qos_rr_index doesn't have to be perfect, as it drifts over time anyway, but it avoids the "thundering herd" problem at initial mount time.

            A similar solution is needed for lmv_select_statfs_mdt() to select an MDT to send MDS_STATFS RPCs to:

                    /* choose initial MDT for this client */
                    for (i = 0;; i++) {
                            struct lnet_processid lnet_id;
                            if (LNetGetId(i, &lnet_id, false) == -ENOENT)
                                    break;
                                    
                            if (!nid_is_lo0(&lnet_id.nid)) {
                                    /* We dont need a full 64-bit modulus, just enough
                                     * to distribute the requests across MDTs evenly.
                                     */
                                    lmv->lmv_statfs_start = nidhash(&lnet_id.nid) %
                                                            lmv->lmv_mdt_count;
                                    break;
                            }     
                    }
            

            and it probably makes sense that these both use the same mechanism instead of fetching the NIDs each time.

            adilger Andreas Dilger added a comment - - edited James, are you referring to this hunk of code to initialize the lmv_qos_rr_index starting value: /* * initialize rr_index to lower 32bit of netid, so that client * can distribute subdirs evenly from the beginning. */ while (LNetGetId(i++, &lnet_id, false ) != -ENOENT) { if (!nid_is_lo0(&lnet_id.nid)) { lmv->lmv_qos_rr_index = ntohl(lnet_id.nid.nid_addr[0]); break ; } } That code doesn't really need to have the full NID. The main goal is that each client is initialized in some way to a well-balanced starting value (instead of 0) so that clients don't all start creating subdirectories on MDT0000 and go in lockstep across MDTs. Using the NID is deterministic and likely gives us a more uniform distribution compared to a purely random number, because it normally only changes the low bits among nearby clients in a single job. It could use any other kind of value that is slowly changing for each client, but it would need to be available early during the mount process. I don't know if IPv6 NIDs have the same property or not, or if they have too much "random" stuff in them that they may be wildly imbalanced? We could potentially have the MDS assign each client a sequential "client number" for this purpose in the mount reply, but that might be imbalanced for clients that are allocated into the same job because it would only "globally" be uniform (though still better than a purely random number). In any case, the lmv_qos_rr_index doesn't have to be perfect, as it drifts over time anyway, but it avoids the "thundering herd" problem at initial mount time. A similar solution is needed for lmv_select_statfs_mdt() to select an MDT to send MDS_STATFS RPCs to: /* choose initial MDT for this client */ for (i = 0;; i++) { struct lnet_processid lnet_id; if (LNetGetId(i, &lnet_id, false ) == -ENOENT) break ; if (!nid_is_lo0(&lnet_id.nid)) { /* We dont need a full 64-bit modulus, just enough * to distribute the requests across MDTs evenly. */ lmv->lmv_statfs_start = nidhash(&lnet_id.nid) % lmv->lmv_mdt_count; break ; } } and it probably makes sense that these both use the same mechanism instead of fetching the NIDs each time.

            I was looking at adding the final touches for this work and I found one place its not a easy replacement. For lmv_setup() we call LNetGetId() which only gets small size NIDs. I don't see an easy way to get the connect flag here.  Any suggestions? Should we move to another setup function that has an export as a parameter?

            simmonsja James A Simmons added a comment - I was looking at adding the final touches for this work and I found one place its not a easy replacement. For lmv_setup() we call LNetGetId() which only gets small size NIDs. I don't see an easy way to get the connect flag here.  Any suggestions? Should we move to another setup function that has an export as a parameter?

            James, adding the handling for the OBD_CONNECT2_LARGE_NID feature is relatively straight forward:

            • the 51108 patch has already handled the mechanics of adding a new connect flag (definition, wiretest/wirecheck, obd_connect_names[], etc.)

            When the client and MGS support for handling large NIDs is finished, a patch should be pushed that:

            • adds this flag to data->ocd_connect_flags2 on the client when they are connecting to the MGS (and possibly MDS and OSS, not sure)
            • adds this flag to MGS_CONNECT_SUPPORTED2 (and MDS_CONNECT_SUPPORTED2 and OSS_CONNECT_SUPPORTED2 if needed)
            • the old MGS will mask off OBD_CONNECT2_LARGE_NID in the reply to new clients, because it is not in the old MGS_CONNECT_SUPPORTED2
            • the new MGS will reply with OBD_CONNECT2_LARGE_NID to clients that send it
            • the MGS can check this on client exports to determine if they support large NIDs, as needed
            • clients can check this to determine if the MGS supports large NIDs, as needed

            We don't want to land the "supported" patch until the code is (substantially) working, otherwise a client/server might advertise their support for this feature, but not actually work yet.

            adilger Andreas Dilger added a comment - James, adding the handling for the OBD_CONNECT2_LARGE_NID feature is relatively straight forward: the 51108 patch has already handled the mechanics of adding a new connect flag (definition, wiretest/wirecheck, obd_connect_names[] , etc.) When the client and MGS support for handling large NIDs is finished, a patch should be pushed that: adds this flag to data->ocd_connect_flags2 on the client when they are connecting to the MGS (and possibly MDS and OSS, not sure) adds this flag to MGS_CONNECT_SUPPORTED2 (and MDS_CONNECT_SUPPORTED2 and OSS_CONNECT_SUPPORTED2 if needed) the old MGS will mask off OBD_CONNECT2_LARGE_NID in the reply to new clients, because it is not in the old MGS_CONNECT_SUPPORTED2 the new MGS will reply with OBD_CONNECT2_LARGE_NID to clients that send it the MGS can check this on client exports to determine if they support large NIDs, as needed clients can check this to determine if the MGS supports large NIDs, as needed We don't want to land the "supported" patch until the code is (substantially) working, otherwise a client/server might advertise their support for this feature, but not actually work yet.

            While 51108 landed we do need one more patch here to communicate that the MGS supports large NIDs.

            simmonsja James A Simmons added a comment - While 51108 landed we do need one more patch here to communicate that the MGS supports large NIDs.
            simmonsja James A Simmons added a comment - https://review.whamcloud.com/#/c/fs/lustre-release/+/51108 covers the goal of this ticket.

            I haven't looked into all of the details here, but this would potentially allow a better alternative to patching old clients to ignore large NIDs.

            adilger Andreas Dilger added a comment - I haven't looked into all of the details here, but this would potentially allow a better alternative to patching old clients to ignore large NIDs.

            People

              simmonsja James A Simmons
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: