Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9490

MPI-IO Lustre ADIO driver gets Lustre layout parameters incorrectly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Looking at the Lustre ADIO driver, it appears that it is retrieving the layout parameters (stripe count, stripe size) incorrectly from the kernel. It is calling an ioctl() directly to the kernel, but then not checking that the returned layout magic is something that it understands before dereferencing the fields of struct lov_mds_md. For composite files that for not have struct lov_mds_md directly at the start of the layout, this will deteference random fields in the layout and return garbage to the library.

      Attachments

        Issue Links

          Activity

            [LU-9490] MPI-IO Lustre ADIO driver gets Lustre layout parameters incorrectly

            Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via _attribute_((deprecated)) to start (compile time), then via a message printed out in 2.11.

            In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size.

            Separately, Emoly, can you please work with Niu to create and push a patch for MPICH, as I believe you worked on this code originally:
            http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c It should really be using the llapi_layout_* interface. I don't know if applications pass in a hint for the file size, but if they do it would make it easy to decide which component to use for the returned layout parameters.

            However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe(), and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures.

            adilger Andreas Dilger added a comment - Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via _ attribute _((deprecated)) to start (compile time), then via a message printed out in 2.11. In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size. Separately, Emoly, can you please work with Niu to create and push a patch for MPICH, as I believe you worked on this code originally: http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c It should really be using the llapi_layout_* interface. I don't know if applications pass in a hint for the file size, but if they do it would make it easy to decide which component to use for the returned layout parameters. However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe() , and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures.

            This is the error we saw during the test shot:

            Assertion failed in file /notbackedup/users/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 474: _mpiio_o_lov_delay_create != 0

            simmonsja James A Simmons added a comment - This is the error we saw during the test shot: Assertion failed in file /notbackedup/users/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 474: _mpiio_o_lov_delay_create != 0

            set LOV_USER_MAGIC = LOV_USER_MAGIC_COMP_V1 in lustreapi and then detect this in the kernel LL_IOC_LOV_GETSTRIPE handling, and return a fake LOV_USER_MAGIC_V1 xattr to the caller. I don't think this will be very robust.

            Yes, I also don't think it will work well. The fake v1 xattr can't provide accurate stripe information for ADIO to leverage on, imaging FLR or other more complex layout introduced later, it'll be more messy with the fake v1. If some application cares about Lustre file layout, it has to understand the layout anyway.

            fix ADIO_Lustre to check the returned magic value and then interpret the returned composite xattr based on the magic
            fix ADIO_Lustre to use llapi_layout_get_by_path(), which creates a dependency on liblustreapi (not sure if this is an issue, is it always installed on Lustre clients?)

            I vote for using llapi_layout_xxx interfaces, but I don't know how it should be fixed, should we inform the maintainer to update the ADIO driver?

            It appears that we have similar problems in our own code, for example any program that calls llapi_file_get_stripe() will always return struct lov_user_md to the caller. For better or worse, this will result in lum->lmm_stripe_count, lum->lmm_stripe_size, and lum->lmm_stripe_offset being zero, since these fields overlap with lcm_padding2.

            I see only some places used llapi_file_get_stripe() (or call the ioctl) directly:

            • 'lfs find', 'lfs getstripe' calls the ioctl directly, but we've handled it properly.
            • 'lfs migrate' calls llapi_file_get_stripe() directly and uses result for v1/v3 only, not an error but can be improved.
            • ll_dirstripe_verify.c calls llapi_file_get_stripe() directly, and it's used in sanity.sh, that'll cause test failures when root default layout is set as composite, I think it should be handled by LU-9324. (or already being handled?)
            niu Niu Yawei (Inactive) added a comment - set LOV_USER_MAGIC = LOV_USER_MAGIC_COMP_V1 in lustreapi and then detect this in the kernel LL_IOC_LOV_GETSTRIPE handling, and return a fake LOV_USER_MAGIC_V1 xattr to the caller. I don't think this will be very robust. Yes, I also don't think it will work well. The fake v1 xattr can't provide accurate stripe information for ADIO to leverage on, imaging FLR or other more complex layout introduced later, it'll be more messy with the fake v1. If some application cares about Lustre file layout, it has to understand the layout anyway. fix ADIO_Lustre to check the returned magic value and then interpret the returned composite xattr based on the magic fix ADIO_Lustre to use llapi_layout_get_by_path(), which creates a dependency on liblustreapi (not sure if this is an issue, is it always installed on Lustre clients?) I vote for using llapi_layout_xxx interfaces, but I don't know how it should be fixed, should we inform the maintainer to update the ADIO driver? It appears that we have similar problems in our own code, for example any program that calls llapi_file_get_stripe() will always return struct lov_user_md to the caller. For better or worse, this will result in lum->lmm_stripe_count, lum->lmm_stripe_size, and lum->lmm_stripe_offset being zero, since these fields overlap with lcm_padding2. I see only some places used llapi_file_get_stripe() (or call the ioctl) directly: 'lfs find', 'lfs getstripe' calls the ioctl directly, but we've handled it properly. 'lfs migrate' calls llapi_file_get_stripe() directly and uses result for v1/v3 only, not an error but can be improved. ll_dirstripe_verify.c calls llapi_file_get_stripe() directly, and it's used in sanity.sh, that'll cause test failures when root default layout is set as composite, I think it should be handled by LU-9324 . (or already being handled?)

            James, could you please include the output from the Cray MPI-IO test run, since it was referencing a function that does not exist in the upstream MPICH code.

            adilger Andreas Dilger added a comment - James, could you please include the output from the Cray MPI-IO test run, since it was referencing a function that does not exist in the upstream MPICH code.

            Yep we hit this problem during our PFL test shot. Prevented us from running one of our apps.

            simmonsja James A Simmons added a comment - Yep we hit this problem during our PFL test shot. Prevented us from running one of our apps.

            It looks like lfs.c::lfs_migrate() is calling llapi_file_get_stripe() to get the stripe size, and this should be changed to use llapi_layout_get_by_path() and the stripe count extracted from the returned layout. One option is to set the component pointer to be the last initialized component, so that attributes returned are what the user would expect if they don't understand composite files.

            adilger Andreas Dilger added a comment - It looks like lfs.c::lfs_migrate() is calling llapi_file_get_stripe() to get the stripe size, and this should be changed to use llapi_layout_get_by_path() and the stripe count extracted from the returned layout. One option is to set the component pointer to be the last initialized component, so that attributes returned are what the user would expect if they don't understand composite files.
            pjones Peter Jones added a comment -

            Niu

            Could you please assist with this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Niu Could you please assist with this issue? Thanks Peter

            In particular: https://trac.mpich.org/projects/mpich/browser/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c?rev=f818af4a15b0a168ecab7f73baf0491c2baed26b

            void ADIOI_LUSTRE_Open(ADIO_File fd, int *error_code)
            {
            	:
            	:
            	/* get file striping information and set it in info */
            	lum.lmm_magic = LOV_USER_MAGIC;
            	err = ioctl(fd->fd_sys, LL_IOC_LOV_GETSTRIPE, (void *) &lum);
            
            	if (!err) {
            		sprintf(value, "%d", lum.lmm_stripe_size);
            		MPI_Info_set(fd->info, "striping_unit", value);
            
            		sprintf(value, "%d", lum.lmm_stripe_count);
            		MPI_Info_set(fd->info, "striping_factor", value);
            	
            		sprintf(value, "%d", lum.lmm_stripe_offset);
            		MPI_Info_set(fd->info, "start_iodevice", value);
            	}
            

            Some options that I see to handle this:

            • set LOV_USER_MAGIC = LOV_USER_MAGIC_COMP_V1 in lustreapi and then detect this in the kernel LL_IOC_LOV_GETSTRIPE handling, and return a fake LOV_USER_MAGIC_V1 xattr to the caller. I don't think this will be very robust.
            • fix ADIO_Lustre to check the returned magic value and then interpret the returned composite xattr based on the magic
            • fix ADIO_Lustre to use llapi_layout_get_by_path(), which creates a dependency on liblustreapi (not sure if this is an issue, is it always installed on Lustre clients?)

            It appears that we have similar problems in our own code, for example any program that calls llapi_file_get_stripe() will always return struct lov_user_md to the caller. For better or worse, this will result in lum->lmm_stripe_count, lum->lmm_stripe_size, and lum->lmm_stripe_offset being zero, since these fields overlap with lcm_padding2.

            adilger Andreas Dilger added a comment - In particular: https://trac.mpich.org/projects/mpich/browser/src/mpi/romio/adio/ad_lustre/ad_lustre_open.c?rev=f818af4a15b0a168ecab7f73baf0491c2baed26b void ADIOI_LUSTRE_Open(ADIO_File fd, int *error_code) { : : /* get file striping information and set it in info */ lum.lmm_magic = LOV_USER_MAGIC; err = ioctl(fd->fd_sys, LL_IOC_LOV_GETSTRIPE, (void *) &lum); if (!err) { sprintf(value, "%d" , lum.lmm_stripe_size); MPI_Info_set(fd->info, "striping_unit" , value); sprintf(value, "%d" , lum.lmm_stripe_count); MPI_Info_set(fd->info, "striping_factor" , value); sprintf(value, "%d" , lum.lmm_stripe_offset); MPI_Info_set(fd->info, "start_iodevice" , value); } Some options that I see to handle this: set LOV_USER_MAGIC = LOV_USER_MAGIC_COMP_V1 in lustreapi and then detect this in the kernel LL_IOC_LOV_GETSTRIPE handling, and return a fake LOV_USER_MAGIC_V1 xattr to the caller. I don't think this will be very robust. fix ADIO_Lustre to check the returned magic value and then interpret the returned composite xattr based on the magic fix ADIO_Lustre to use llapi_layout_get_by_path() , which creates a dependency on liblustreapi (not sure if this is an issue, is it always installed on Lustre clients?) It appears that we have similar problems in our own code, for example any program that calls llapi_file_get_stripe() will always return struct lov_user_md to the caller. For better or worse, this will result in lum->lmm_stripe_count , lum->lmm_stripe_size , and lum->lmm_stripe_offset being zero, since these fields overlap with lcm_padding2 .

            People

              emoly.liu Emoly Liu
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: