Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9490

MPI-IO Lustre ADIO driver gets Lustre layout parameters incorrectly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Looking at the Lustre ADIO driver, it appears that it is retrieving the layout parameters (stripe count, stripe size) incorrectly from the kernel. It is calling an ioctl() directly to the kernel, but then not checking that the returned layout magic is something that it understands before dereferencing the fields of struct lov_mds_md. For composite files that for not have struct lov_mds_md directly at the start of the layout, this will deteference random fields in the layout and return garbage to the library.

      Attachments

        Issue Links

          Activity

            [LU-9490] MPI-IO Lustre ADIO driver gets Lustre layout parameters incorrectly

            Emoly, yes the fix should be submitted to the MPICH Git repo. The ADIO patches in lustre/contrib should be deleted.

            adilger Andreas Dilger added a comment - Emoly, yes the fix should be submitted to the MPICH Git repo. The ADIO patches in lustre/contrib should be deleted.
            emoly.liu Emoly Liu added a comment -

            adilger, I'm cloning MPICH git repo. Do you mean I need to create a patch there?
            BTW, the lustre ADIO driver patch in our repo (lustre/contrib/adio_driver_mpich2-1.0.7.patch) is too old, I remembered it was released by MPICH2 about 8 years ago. Should it be removed?

            emoly.liu Emoly Liu added a comment - adilger , I'm cloning MPICH git repo. Do you mean I need to create a patch there? BTW, the lustre ADIO driver patch in our repo (lustre/contrib/adio_driver_mpich2-1.0.7.patch) is too old, I remembered it was released by MPICH2 about 8 years ago. Should it be removed?
            emoly.liu Emoly Liu added a comment -

            I will look into this one.

            emoly.liu Emoly Liu added a comment - I will look into this one.

            Emoly to work on ADIO driver ww21

            eberglan Eric Bergland (Inactive) added a comment - Emoly to work on ADIO driver ww21

            We still need a fix for MPICH ADIO, otherwise it will never work properly with composite layouts. The current solution allows it to function, but it won't generate any optimal IO patterns.

            adilger Andreas Dilger added a comment - We still need a fix for MPICH ADIO, otherwise it will never work properly with composite layouts. The current solution allows it to function , but it won't generate any optimal IO patterns.

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27183
            Subject: LU-9490 llite: return v1/v3 layout for legacy app
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 87da9d4b51f6874412c46ff130b36e938fa85fe4

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27183 Subject: LU-9490 llite: return v1/v3 layout for legacy app Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 87da9d4b51f6874412c46ff130b36e938fa85fe4

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27180
            Subject: LU-9490 tests: ll_dirstripe_verify handles PFL layout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ecf66cbfaf2e5734a18098db77882a344c944f9f

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27180 Subject: LU-9490 tests: ll_dirstripe_verify handles PFL layout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ecf66cbfaf2e5734a18098db77882a344c944f9f

            What do you mean "via a message printed out in 2.11."?

            I mean that llapi_file_get_stripe() should start to print out a warning about calling this function in 2.11, and that they should change to using llapi_layout_get_by_path() instead.

            I'm wondering if it's better to return error instead of the last instantiated component? I suppose application usually calls getstripe on open/create file, so the first component will be returned in most cases, then optimizing based on first component's attributes will highly likely get opposite result.

            That's why I suggested to "have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size." That would return the last component if the file was just created.

            Another alternative is to return the fields of interest in the current padding fields:

            struct lov_user_md_v1 {           /* LOV EA user data (host-endian) */
                    :
                    __u32 lmm_stripe_size;    /* size of stripe in bytes */
                    __u16 lmm_stripe_count;   /* num stripes in use for this object */
                    __u16 lmm_layout_gen;     /* layout generation number
                                               * used when reading */
            

            These map to the __u64 lcm_padding2 in struct lov_comp_md_v1, which is currently unused. If this ioctl() is called, it would be possible to extract those values from the proper component, temporarily store them into lov_user_md_v1, and return it to userspace. This will likely allow 99% of apps to continue using Lustre without any change.

            If course, Ideally we will get a proper fix into the hands of the users, but I think there also needs to be a workaround in place so that sites which don't apply the fix.

            adilger Andreas Dilger added a comment - What do you mean "via a message printed out in 2.11."? I mean that llapi_file_get_stripe() should start to print out a warning about calling this function in 2.11, and that they should change to using llapi_layout_get_by_path() instead. I'm wondering if it's better to return error instead of the last instantiated component? I suppose application usually calls getstripe on open/create file, so the first component will be returned in most cases, then optimizing based on first component's attributes will highly likely get opposite result. That's why I suggested to "have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size." That would return the last component if the file was just created. Another alternative is to return the fields of interest in the current padding fields: struct lov_user_md_v1 { /* LOV EA user data (host-endian) */ : __u32 lmm_stripe_size; /* size of stripe in bytes */ __u16 lmm_stripe_count; /* num stripes in use for this object */ __u16 lmm_layout_gen; /* layout generation number * used when reading */ These map to the __u64 lcm_padding2 in struct lov_comp_md_v1 , which is currently unused. If this ioctl() is called, it would be possible to extract those values from the proper component, temporarily store them into lov_user_md_v1 , and return it to userspace. This will likely allow 99% of apps to continue using Lustre without any change. If course, Ideally we will get a proper fix into the hands of the users, but I think there also needs to be a workaround in place so that sites which don't apply the fix.

            Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via attribute((deprecated)) to start (compile time), then via a message printed out in 2.11.
            In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size.

            Ok, sure. What do you mean "via a message printed out in 2.11."?

            However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe(), and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures.

            I'm wondering if it's better to return error instead of the last instantiated component?

            I suppose application usually calls getstripe on open/create file, so the first component will be returned in most cases, then optimizing based on first component's attributes will highly likely get opposite result. If we return an error instead of last instantiated component, app may either fallback to non-optimizing default setting (which is better than optimizing on first component), or error out (so that user will know it's time to upgrade).

            BTW, except ADIO, I didn't see other callers passing lmm magic to LL_IOC_LOV_GETSTRIPE ioctl, so this workaround only works for the apps which providing lmm magic.

            niu Niu Yawei (Inactive) added a comment - Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via attribute ((deprecated)) to start (compile time), then via a message printed out in 2.11. In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size. Ok, sure. What do you mean "via a message printed out in 2.11."? However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe(), and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures. I'm wondering if it's better to return error instead of the last instantiated component? I suppose application usually calls getstripe on open/create file, so the first component will be returned in most cases, then optimizing based on first component's attributes will highly likely get opposite result. If we return an error instead of last instantiated component, app may either fallback to non-optimizing default setting (which is better than optimizing on first component), or error out (so that user will know it's time to upgrade). BTW, except ADIO, I didn't see other callers passing lmm magic to LL_IOC_LOV_GETSTRIPE ioctl, so this workaround only works for the apps which providing lmm magic.

            Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via _attribute_((deprecated)) to start (compile time), then via a message printed out in 2.11.

            In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size.

            Separately, Emoly, can you please work with Niu to create and push a patch for MPICH, as I believe you worked on this code originally:
            http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c It should really be using the llapi_layout_* interface. I don't know if applications pass in a hint for the file size, but if they do it would make it easy to decide which component to use for the returned layout parameters.

            However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe(), and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures.

            adilger Andreas Dilger added a comment - Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via _ attribute _((deprecated)) to start (compile time), then via a message printed out in 2.11. In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size. Separately, Emoly, can you please work with Niu to create and push a patch for MPICH, as I believe you worked on this code originally: http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c It should really be using the llapi_layout_* interface. I don't know if applications pass in a hint for the file size, but if they do it would make it easy to decide which component to use for the returned layout parameters. However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe() , and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures.

            This is the error we saw during the test shot:

            Assertion failed in file /notbackedup/users/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 474: _mpiio_o_lov_delay_create != 0

            simmonsja James A Simmons added a comment - This is the error we saw during the test shot: Assertion failed in file /notbackedup/users/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 474: _mpiio_o_lov_delay_create != 0

            People

              emoly.liu Emoly Liu
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: