[LU-9490] MPI-IO Lustre ADIO driver gets Lustre layout parameters incorrectly Created: 11/May/17 Updated: 12/Nov/17 Resolved: 07/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Looking at the Lustre ADIO driver, it appears that it is retrieving the layout parameters (stripe count, stripe size) incorrectly from the kernel. It is calling an ioctl() directly to the kernel, but then not checking that the returned layout magic is something that it understands before dereferencing the fields of struct lov_mds_md. For composite files that for not have struct lov_mds_md directly at the start of the layout, this will deteference random fields in the layout and return garbage to the library. |
| Comments |
| Comment by Andreas Dilger [ 11/May/17 ] |
void ADIOI_LUSTRE_Open(ADIO_File fd, int *error_code) { : : /* get file striping information and set it in info */ lum.lmm_magic = LOV_USER_MAGIC; err = ioctl(fd->fd_sys, LL_IOC_LOV_GETSTRIPE, (void *) &lum); if (!err) { sprintf(value, "%d", lum.lmm_stripe_size); MPI_Info_set(fd->info, "striping_unit", value); sprintf(value, "%d", lum.lmm_stripe_count); MPI_Info_set(fd->info, "striping_factor", value); sprintf(value, "%d", lum.lmm_stripe_offset); MPI_Info_set(fd->info, "start_iodevice", value); } Some options that I see to handle this:
It appears that we have similar problems in our own code, for example any program that calls llapi_file_get_stripe() will always return struct lov_user_md to the caller. For better or worse, this will result in lum->lmm_stripe_count, lum->lmm_stripe_size, and lum->lmm_stripe_offset being zero, since these fields overlap with lcm_padding2. |
| Comment by Peter Jones [ 11/May/17 ] |
|
Niu Could you please assist with this issue? Thanks Peter |
| Comment by Andreas Dilger [ 11/May/17 ] |
|
It looks like lfs.c::lfs_migrate() is calling llapi_file_get_stripe() to get the stripe size, and this should be changed to use llapi_layout_get_by_path() and the stripe count extracted from the returned layout. One option is to set the component pointer to be the last initialized component, so that attributes returned are what the user would expect if they don't understand composite files. |
| Comment by James A Simmons [ 11/May/17 ] |
|
Yep we hit this problem during our PFL test shot. Prevented us from running one of our apps. |
| Comment by Andreas Dilger [ 11/May/17 ] |
|
James, could you please include the output from the Cray MPI-IO test run, since it was referencing a function that does not exist in the upstream MPICH code. |
| Comment by Niu Yawei (Inactive) [ 12/May/17 ] |
Yes, I also don't think it will work well. The fake v1 xattr can't provide accurate stripe information for ADIO to leverage on, imaging FLR or other more complex layout introduced later, it'll be more messy with the fake v1. If some application cares about Lustre file layout, it has to understand the layout anyway.
I vote for using llapi_layout_xxx interfaces, but I don't know how it should be fixed, should we inform the maintainer to update the ADIO driver?
I see only some places used llapi_file_get_stripe() (or call the ioctl) directly:
|
| Comment by James A Simmons [ 12/May/17 ] |
|
This is the error we saw during the test shot: Assertion failed in file /notbackedup/users/ulib/mpt_base/mpich2/src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 474: _mpiio_o_lov_delay_create != 0 |
| Comment by Andreas Dilger [ 16/May/17 ] |
|
Niu, it looks like ll_dirstripe_verify.c is not handling non-v1/v3 layouts properly. It is probably just comparing "0 == 0" in this case. We should probably deprecate the use of llapi_file_get_stripe() via _attribute_((deprecated)) to start (compile time), then via a message printed out in 2.11. In the meantime, it probably makes sense to have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size. Separately, Emoly, can you please work with Niu to create and push a patch for MPICH, as I believe you worked on this code originally: However, for existing applications that call the ioctl directly, we may still need to return something reasonable from the kernel, since otherwise every MPI-IO application running on a new version of Lustre will fail if it hasn't been patched. One option is to detect the magic value passed to the kernel LL_IOC_LOV_GETSTRIPE handling in lov_getstripe(), and if LOV_MAGIC_V1/V3 return the last init component, and only return the full composite layout if the caller passes in LOV_MAGIC_COMP_V1? This would add complexity to the kernel, but would avoid compatibility issues with older applications, at the expense of returning something useful to userspace rather than causing outright application failures. |
| Comment by Niu Yawei (Inactive) [ 16/May/17 ] |
Ok, sure. What do you mean "via a message printed out in 2.11."?
I'm wondering if it's better to return error instead of the last instantiated component? I suppose application usually calls getstripe on open/create file, so the first component will be returned in most cases, then optimizing based on first component's attributes will highly likely get opposite result. If we return an error instead of last instantiated component, app may either fallback to non-optimizing default setting (which is better than optimizing on first component), or error out (so that user will know it's time to upgrade). BTW, except ADIO, I didn't see other callers passing lmm magic to LL_IOC_LOV_GETSTRIPE ioctl, so this workaround only works for the apps which providing lmm magic. |
| Comment by Andreas Dilger [ 17/May/17 ] |
I mean that llapi_file_get_stripe() should start to print out a warning about calling this function in 2.11, and that they should change to using llapi_layout_get_by_path() instead.
That's why I suggested to "have it return something reasonable for PFL files when used, such as the plain layout of the last component (if file has no size) and the last init component if it has a size." That would return the last component if the file was just created. Another alternative is to return the fields of interest in the current padding fields: struct lov_user_md_v1 { /* LOV EA user data (host-endian) */
:
__u32 lmm_stripe_size; /* size of stripe in bytes */
__u16 lmm_stripe_count; /* num stripes in use for this object */
__u16 lmm_layout_gen; /* layout generation number
* used when reading */
These map to the __u64 lcm_padding2 in struct lov_comp_md_v1, which is currently unused. If this ioctl() is called, it would be possible to extract those values from the proper component, temporarily store them into lov_user_md_v1, and return it to userspace. This will likely allow 99% of apps to continue using Lustre without any change. If course, Ideally we will get a proper fix into the hands of the users, but I think there also needs to be a workaround in place so that sites which don't apply the fix. |
| Comment by Gerrit Updater [ 18/May/17 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27180 |
| Comment by Gerrit Updater [ 18/May/17 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/27183 |
| Comment by Andreas Dilger [ 18/May/17 ] |
|
We still need a fix for MPICH ADIO, otherwise it will never work properly with composite layouts. The current solution allows it to function, but it won't generate any optimal IO patterns. |
| Comment by Eric Bergland (Inactive) [ 22/May/17 ] |
|
Emoly to work on ADIO driver ww21 |
| Comment by Emoly Liu [ 23/May/17 ] |
|
I will look into this one. |
| Comment by Emoly Liu [ 23/May/17 ] |
|
adilger, I'm cloning MPICH git repo. Do you mean I need to create a patch there? |
| Comment by Andreas Dilger [ 23/May/17 ] |
|
Emoly, yes the fix should be submitted to the MPICH Git repo. The ADIO patches in lustre/contrib should be deleted. |
| Comment by Emoly Liu [ 26/May/17 ] |
|
Discussed with Niu and worked on how to apply each component stripe information to the I/O pattern redistribution. |
| Comment by Gerrit Updater [ 26/May/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/27310 |
| Comment by Gerrit Updater [ 30/May/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) merged in patch https://review.whamcloud.com/27310/ |
| Comment by Gerrit Updater [ 03/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27183/ |
| Comment by Gerrit Updater [ 07/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27180/ |
| Comment by Peter Jones [ 07/Jun/17 ] |
|
Lustre changes all landed for 2.10. ADIO changes being tracked separately |
| Comment by Eric Bergland (Inactive) [ 13/Jun/17 ] |
|
ADIO changes being tracked in |