[LU-9657] Make Lustre ADIO driver work with PFL correctly Created: 13/Jun/17  Updated: 18/Jul/22  Resolved: 18/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Improvement Priority: Major
Reporter: Emoly Liu Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: pfl

Rank (Obsolete): 9223372036854775807

 Description   

The work includes:

  • using llapi_layout_* interfaces to get or set the composite layout parameters (stripe count, stripe size) correctly
  • improving the following I/O redistribution algorithm with PFL feature:
    • Stripe-contiguous: use the LCM(lowest common multiple) of different component stripe count to calculate available cb nodes, and try to make sure each I/O write among MPI procs happens in the same component extent
    • File-contiguous: use the max. or LCM of component stripe size as the common stripe size to keep the stripe alignment in each write.

The patch will be submitted to MPICH finally.



 Comments   
Comment by Andreas Dilger [ 15/Jun/17 ]

It should be noted that in cases where MPICH knows the total file size or the number of parallel writers in advance (I don't know how often that is true or not), then it is likely more efficient to just have it specify a single N-stripe file rather than using a PFL file. PFL files should mostly be used when the application doesn't know in advance how large the file is going to be, or the number of concurrent readers/writers.

Comment by Emoly Liu [ 20/Jun/17 ]

Now the ADIO driver replaced with llapi_layout_* interfaces can work correctly on a non-PFL file and next I will test it on a PFL file.

BTW, later I will post a lustre patch to add a new llapi_layout_* interface, which was introduced during my test.

Comment by Gerrit Updater [ 21/Jun/17 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27752
Subject: LU-9657 llapi: check if the file laout is composite
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 614b8abd2a8372e5b71c4a5ee9da1c6884601b42

Comment by Emoly Liu [ 22/Jun/17 ]

adilger, I have a question: do we need to let the user know whether the file layout is composite or not?
I have two ideas:

  • Yes, add a new API llapi_layout_is_composite(), as I did in the patch https://review.whamcloud.com/27752, to tell this explicitly. In this way, the user probably have to call this API everytime when setting/getting striping information;
  • No, a non-composite file layout can be treated as a single component, so we don't need to return -1 when calling llapi_layout_comp_use(layout, LLAPI_LAYOUT_COMP_USE_FIRST/LAST) if (!layout->llot_is_composite), and only do that when LLAPI_LAYOUT_COMP_USE_NEXT/PREV.

Do you think which is better?

BTW, I am testing the ADIO patch on trevis nodes. And since I'm using IOR, the file is non-PFL. Do I need to add some hints to specify PFL striping information? Thanks for any advice!

Comment by Andreas Dilger [ 22/Jun/17 ]

In general, I think composite and non-composite files should be treated similarly where possible. I wouldn't object to returning the one component in the FIRST/LAST case if that simplifies using these APIs.

I'm not against aching a helper function to return whether the layout is composite or not, but I suspect there are already several ways to check this - component count, magic, etc.

Comment by Gerrit Updater [ 28/Jun/17 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27865
Subject: LU-9657 pfl: llapi_layout_comp_use should handle non-pfl file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c471103fa917ae83634ad7da40381668611f4f7

Comment by Gerrit Updater [ 28/Jun/17 ]

Emoly Liu (emoly.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/27869
Subject: LU-9657 adio: Lustre ADIO driver patch for PFL feature
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b5c4f60a830cc47ff56fbe5700ffdae68cb4e5bb

Comment by Emoly Liu [ 29/Jun/17 ]

adilger, could you please review this ADIO patch at https://review.whamcloud.com/27869 ? I made the following changes:

  • Use llapi_layout_* interfaces to get/set the striping information
    • use llapi_layout_file_create() instead of ioctl() to set striping information when creating a file;
    • use llapi_layout_xxx_set/get() instead of ioctl() to set/get striping information;
    • since O_LOV_DELAY_CREATE is set with O_CREATE by default in llapi_layout_file_open(), the related code in ADIOI_LUSTRE_open() is changed.
  • Improve the following I/O redistribution algorithm with PFL feature:
    • for tripe-contiguous: use the LCM(lowest common multiple) of different component stripe count to calculate the number of available cb nodes, and try to make sure each I/O write among MPI procs happens in the same component extent;
    • for file-contiguous: use the max. or LCM of component stripe size as the common stripe size to keep the stripe alignment in each write.
  • Fix some issues:
    • set fn->hints->cb_nodes in ADIOI_LUSTRE_WriteStridedColl(), otherwise the final avail_cb_nodes is always 1;
    • since there is no a mapping/initialization for ranklist[], just use #rank directly, otherwise it will get a wrong number;
    • add LDEBUG() to print debug information;
    • remove striping information setting in ADIOI_LUSTRE_SetInfo() since these values can be set/gotten easily by llapi_layout_xxx_set/get().

The patch can work correctly on my local two vm machines by IOR + non-PFL file, but failed sometimes on trevis multiple nodes, IOR+POSIX+ADIO also failed on trevis either. The following is the output of a simple collective write test:

[root@centos7-2 C]# rm /mnt/lustre/iorfile 
rm: remove regular file ‘/mnt/lustre/iorfile’? y
[root@centos7-2 C]# lfs osts
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
[root@centos7-2 C]# cat hostfile 
centos7-2
centos7-3
[root@centos7-2 C]# mpirun -np 2 -machinefile ./hostfile /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -r -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H 
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Thu Jun 29 11:05:32 2017
Command line used: /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -r -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H
Machine: Linux centos7-2
Start time skew across all tasks: 0.57 sec
Path: /mnt/lustre
FS: 1.2 GiB   Used FS: 5.1%   Inodes: 0.1 Mi   Used Inodes: 0.3%
Participating tasks: 2

Summary:
	api                = MPIIO (version=3, subversion=1)
	test filename      = /mnt/lustre/iorfile
	access             = single-shared-file, collective
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 2 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 6 MiB
	aggregate filesize = 12 MiB


hints passed to MPI_File_open() {
	striping_factor = 2
	striping_unit = 524288
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 524288
	striping_factor = 2
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	cb_config_list = *:1
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
	romio_lustre_start_iodevice = 3
}
Commencing write performance test.
Thu Jun 29 11:05:32 2017

Verifying contents of the file(s) just written.
Thu Jun 29 11:05:32 2017


hints passed to MPI_File_open() {
	striping_factor = 2
	striping_unit = 524288
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 524288
	striping_factor = 2
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	cb_config_list = *:1
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
	romio_lustre_start_iodevice = 3
}

hints passed to MPI_File_open() {
	striping_factor = 2
	striping_unit = 524288
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 524288
	striping_factor = 2
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	cb_config_list = *:1
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
	romio_lustre_start_iodevice = 3
}
Commencing read performance test.
Thu Jun 29 11:05:33 2017

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          62.17      62.17       62.17      0.00      62.17      62.17       62.17      0.00   0.19302   2 1 1 0 0 1 0 0 1 6291456 1048576 12582912 -1 MPIIO EXCEL
read          138.39     138.39      138.39      0.00     138.39     138.39      138.39      0.00   0.08671   2 1 1 0 0 1 0 0 1 6291456 1048576 12582912 -1 MPIIO EXCEL

Max Write: 62.17 MiB/sec (65.19 MB/sec)
Max Read:  138.39 MiB/sec (145.11 MB/sec)

Run finished: Thu Jun 29 11:05:33 2017
[root@centos7-2 C]# lfs getstripe /mnt/lustre/iorfile 
/mnt/lustre/iorfile
lmm_stripe_count:  2
lmm_stripe_size:   524288
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group
	     0	            10	          0xa	             0
	     1	            10	          0xa	             0

[root@centos7-2 C]# ls -al /mnt/lustre/iorfile 
-rw-r--r--. 1 root root 12582912 Jun 29 11:05 /mnt/lustre/iorfile

If this change for non-PFL is OK, I will move to add PFL hints and do some tests.

Comment by Andreas Dilger [ 04/Jul/17 ]

It isn't totally clear why you are using the LCM of the stripe count, instead of using LCM(stripe_count * stripe_size) of each component? Also, if the first component is very small (1 stripe, small stripe size <= 1MB) then it should probably be skipped in this calculation, as it will not contribute significantly to the overall performance of the file.

Comment by Emoly Liu [ 05/Jul/17 ]

The LCM of the stripe count is used to calculate avail_cb_nodes, the number of the MPI processes who will join in this one read/write.
And yes, if the component is very small, that won't help much. Another issue is that the algorithm in ADIO driver is stipe-contiguous pattern, which doesn't make much sense for PFL due to much different layouts.

Comment by Gerrit Updater [ 19/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27752/
Subject: LU-9657 llapi: check if the file layout is composite
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 11670b4c5e7014e2092bfa7ca6ba96ecca5d9d4e

Comment by James A Simmons [ 08/Aug/17 ]

Has anyone push the PFL patch to MPICH git repo yet?

Comment by Emoly Liu [ 09/Aug/17 ]

The ADIO+PFL patch has not been pushed to MPICH git repo. Now it can set/analyze the PFL layout parameters correctly by two new hints "romio_lustre_pfl" and "romio_lustre_pfl_layout", but it always fails if the first component stripe size < 1MB. I'm investigating the issue, then will update the patch at https://review.whamcloud.com/27869 .

Comment by Emoly Liu [ 09/Aug/17 ]

Here is my simple test to run ADIO+PFL on 4 OSTs:

[root@centos7-2 C]# cat hostfile 
centos7-2
centos7-3
[root@centos7-2 C]# cat hint 
IOR_HINT__MPI__romio_lustre_pfl=enable
IOR_HINT__MPI__romio_lustre_pfl_layout=-E 4M -c 2 -S 1M -E -1 -c 4 -S 512K
IOR_HINT__MPI__striping_factor=2
IOR_HINT__MPI__striping_unit=2097152
IOR_HINT__MPI__directIO=disable
IOR_HINT__MPI__romio_lustre_co_ratio=2
IOR_HINT__MPI__same_io_size=no
IOR_HINT__MPI__contiguous_data=yes
IOR_HINT__MPI__ds_in_coll=enable
IOR_HINT__MPI__big_req_size=40960

Here is the output:

[root@centos7-2 C]# mpirun -np 2 -machinefile ./hostfile /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Wed Aug  9 11:49:09 2017
Command line used: /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H
Machine: Linux centos7-2
Start time skew across all tasks: 0.37 sec
Path: /mnt/lustre
FS: 1.2 GiB   Used FS: 5.1%   Inodes: 0.1 Mi   Used Inodes: 0.3%
Participating tasks: 2

Summary:
 api                = MPIIO (version=3, subversion=1)
 test filename      = /mnt/lustre/iorfile
 access             = single-shared-file, collective
 pattern            = segmented (1 segment)
 ordering in a file = sequential offsets
 ordering inter file= no tasks offsets
 clients            = 2 (1 per node)
 repetitions        = 1
 xfersize           = 1 MiB
 blocksize          = 6 MiB
 aggregate filesize = 12 MiB


hints passed to MPI_File_open() {
 romio_lustre_pfl = enable
 romio_lustre_pfl_layout = -E 4M -c 2 -S 1M -E -1 -c 4 -S 512K
 striping_factor = 2
 striping_unit = 2097152
 directIO = disable
 romio_lustre_co_ratio = 2
 same_io_size = no
 contiguous_data = yes
 ds_in_coll = enable
 big_req_size = 40960
}

hints returned from opened file {
 direct_read = false
 direct_write = false
 romio_lustre_co_ratio = 2
 romio_lustre_coll_threshold = 0
 romio_lustre_ds_in_coll = enable
 striping_unit = 2097152
 striping_factor = 2
 romio_lustre_pfl = enable
 cb_config_list = *:1
 cb_buffer_size = 16777216
 romio_cb_read = automatic
 romio_cb_write = automatic
 cb_nodes = 2
 romio_no_indep_rw = false
 romio_cb_pfr = disable
 romio_cb_fr_types = aar
 romio_cb_fr_alignment = 1
 romio_cb_ds_threshold = 0
 romio_cb_alltoall = automatic
 ind_rd_buffer_size = 4194304
 ind_wr_buffer_size = 524288
 romio_ds_read = automatic
 romio_ds_write = automatic
 romio_filesystem_type = LUSTRE:
 romio_aggregator_list = 0 1 
}
Commencing write performance test.
Wed Aug  9 11:49:09 2017

ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 0, len[0] = 1048576
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 6291456, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 6815744, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 1048576, len[0] = 1048576
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 7340032, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 7864320, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 8388608, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 8912896, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 2097152, len[0] = 1048576
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 3145728, len[0] = 1048576
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 9437184, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 9961472, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 4194304, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 4718592, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 10485760, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 11010048, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 5242880, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(0) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 5767168, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 0 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 11534336, len[0] = 524288
ADIOI_LUSTRE_Calc_my_req(371): rank(1) data needed from 1 (count = 1):
ADIOI_LUSTRE_Calc_my_req(374):  off[0] = 12058624, len[0] = 524288
Verifying contents of the file(s) just written.
Wed Aug  9 11:49:09 2017


hints passed to MPI_File_open() {
 romio_lustre_pfl = enable
 romio_lustre_pfl_layout = -E 4M -c 2 -S 1M -E -1 -c 4 -S 512K
 striping_factor = 2
 striping_unit = 2097152
 directIO = disable
 romio_lustre_co_ratio = 2
 same_io_size = no
 contiguous_data = yes
 ds_in_coll = enable
 big_req_size = 40960
}

hints returned from opened file {
 direct_read = false
 direct_write = false
 romio_lustre_co_ratio = 2
 romio_lustre_coll_threshold = 0
 romio_lustre_ds_in_coll = enable
 striping_unit = 2097152
 striping_factor = 2
 romio_lustre_pfl = enable
 cb_config_list = *:1
 cb_buffer_size = 16777216
 romio_cb_read = automatic
 romio_cb_write = automatic
 cb_nodes = 2
 romio_no_indep_rw = false
 romio_cb_pfr = disable
 romio_cb_fr_types = aar
 romio_cb_fr_alignment = 1
 romio_cb_ds_threshold = 0
 romio_cb_alltoall = automatic
 ind_rd_buffer_size = 4194304
 ind_wr_buffer_size = 524288
 romio_ds_read = automatic
 romio_ds_write = automatic
 romio_filesystem_type = LUSTRE:
 romio_aggregator_list = 0 1 
}
Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          56.14      56.14       56.14      0.00      56.14      56.14       56.14      0.00   0.21374   2 1 1 0 0 1 0 0 1 6291456 1048576 12582912 -1 MPIIO EXCEL

Max Write: 56.14 MiB/sec (58.87 MB/sec)

Run finished: Wed Aug  9 11:49:09 2017

Here is the layout of file iorfile:

[root@centos7-2 C]# ls -al /mnt/lustre/iorfile 
-rw-r--r--. 1 root root 12582912 Aug  9 11:49 /mnt/lustre/iorfile
[root@centos7-2 C]# lfs getstripe /mnt/lustre/iorfile 
/mnt/lustre/iorfile
  lcm_layout_gen:  3
  lcm_entry_count: 2
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  2
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x1e:0x0] }
      - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x1e:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  4
      lmm_stripe_size:   524288
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x10:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0x10:0x0] }
      - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x1f:0x0] }
      - 3: { l_ost_idx: 1, l_fid: [0x100010000:0x1f:0x0] }

Comment by Andreas Dilger [ 09/Aug/17 ]

Just to confirm, the Lustre ADIO driver should still work properly without the PFL hints - those are only hints to create a PFL file? What has been proposed for specifying a PFL template layout for nodemap is to give the FID (or in this case the pathname) of a directory with the desired layout template. That allows the user to create an arbitrarily complex layout for the output files, without having to specify a complex syntax to create the composite file.

The other option is to allow a string of YAML to specify the layout, like what Bobijam has done for saving and restoring the layout on other files.

That is especially true because after PFL files there will be FLR files, so the hint name should not be "*_pfl".

Comment by Cong Xu (Inactive) [ 09/Aug/17 ]

One issue of pushing this patch to MPICH git repo is that this code cannot be compiled over a regular Lustre file system, because on regular Lustre the PFL header file is missing, and the PFL APIs of setting/getting striping configuration are not supported. We need to figure out how to make this code work over both regular Lustre and Lustre with PFL feature.

For the hints file question. Yes, if the hints file is not provided, the Lustre ADIO driver still work. The file inherits the striping configuration from the directory of the file automatically, and Lustre ADIO driver uses system call to obtain striping information of the file and calculates aggregators. Contrarily, when the hints file is provided, the Lustre ADIO driver creates a file on Lustre, and the striping configuration of the file follows the hints file.

Comment by Emoly Liu [ 10/Aug/17 ]

Thanks for Cong's reply.
BTW, thanks for Andreas' reminder, I will improve the way to set complex layout instead of the current hint string, after I make sure the ADIO can with PFL feature correctly.

Comment by Emoly Liu [ 21/Aug/17 ]

Now the ADIO driver can work with PFL feature correctly by specifying an YAML template file. I still have some questions about this work:

  • I use the last component stripe size as the common stripe size because different MPI procs will write different components, and it's hard to predict which component will have the most impact on performance. If anyone has any idea about this, please let me know.
  • I added the following checks to romio/configure.ac to verify if the Lustre supports composite layout and YAML. Since I am not sure if YAML is a must, I keep the implementation of string format hint "romio_lustre_comp_layout_opt" in case that yaml is not present in the system.
    	    # Verify presence of composite layout functions
    	    AC_SEARCH_LIBS(llapi_layout_comp_use, lustreapi,
    	        [AC_DEFINE(HAVE_LUSTRE_COMP_LAYOUT_SUPPORT, 1, [Lustre composite layout is supported])],
    	        [AC_MSG_WARN([Lustre composite layout is not supported])])
    	    # Verify presence of yaml.h
    	    AC_CHECK_HEADERS(yaml.h,
    	        [AC_DEFINE(HAVE_YAML_SUPPORT, 1, [yaml is present])],
    	        [AC_MSG_WARN([yaml is not present])])
    

I will post the current ADIO+PFL example later.

Comment by Emoly Liu [ 21/Aug/17 ]

Here is my simple test to run ADIO+PFL on 4 OSTs:

[root@centos7-2 C]# cat hostfile 
centos7-2
centos7-3
[root@centos7-2 C]# cat hint 
IOR_HINT__MPI__romio_lustre_layout_yaml_temp=/root/ior/src/C/yaml_temp
IOR_HINT__MPI__striping_factor=2
IOR_HINT__MPI__striping_unit=1048576
IOR_HINT__MPI__directIO=disable
IOR_HINT__MPI__romio_lustre_co_ratio=2
IOR_HINT__MPI__same_io_size=no
IOR_HINT__MPI__contiguous_data=yes
IOR_HINT__MPI__ds_in_coll=enable
IOR_HINT__MPI__big_req_size=40960

[root@centos7-2 C]# cat yaml_temp 
  lcm_layout_gen:  
  lcm_entry_count: 4
  component0:
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
    sub_layout:
      lmm_stripe_count:  2
      lmm_stripe_size:   524288
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0

  component1:
    lcme_id:             2
    lcme_flags:          0
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   8388608
    sub_layout:
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
  component2:
    lcme_id:             3
    lcme_flags:          0
    lcme_extent.e_start: 8388608
    lcme_extent.e_end:   12582912
    sub_layout:
      lmm_stripe_count:  4
      lmm_stripe_size:   262144
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
  component3:
    lcme_id:             4
    lcme_flags:          0
    lcme_extent.e_start: 12582912
    lcme_extent.e_end:   EOF
    sub_layout:
      lmm_stripe_count:  2
      lmm_stripe_size:   2097152
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

Here is the output:

[root@centos7-2 C]# mpirun -np 2 -machinefile ./hostfile /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -r -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Mon Aug 21 18:23:24 2017
Command line used: /root/ior/src/C/IOR -a MPIIO -b 6M -o /mnt/lustre/iorfile -t 1M -v -c -w -r -W -i 1 -T 30 -k -U /root/ior/src/C/hint -H
Machine: Linux centos7-2
Start time skew across all tasks: 0.42 sec
Path: /mnt/lustre
FS: 1.2 GiB   Used FS: 5.1%   Inodes: 0.1 Mi   Used Inodes: 0.3%
Participating tasks: 2

Summary:
	api                = MPIIO (version=3, subversion=1)
	test filename      = /mnt/lustre/iorfile
	access             = single-shared-file, collective
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 2 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 6 MiB
	aggregate filesize = 12 MiB


hints passed to MPI_File_open() {
	romio_lustre_layout_yaml_temp = /root/ior/src/C/yaml_temp
	striping_factor = 2
	striping_unit = 1048576
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 1048576
	striping_factor = 2
	cb_config_list = *:1
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
}
Commencing write performance test.
Mon Aug 21 18:23:24 2017

Verifying contents of the file(s) just written.
Mon Aug 21 18:23:25 2017


hints passed to MPI_File_open() {
	romio_lustre_layout_yaml_temp = /root/ior/src/C/yaml_temp
	striping_factor = 2
	striping_unit = 1048576
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 1048576
	striping_factor = 2
	cb_config_list = *:1
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
}

hints passed to MPI_File_open() {
	romio_lustre_layout_yaml_temp = /root/ior/src/C/yaml_temp
	striping_factor = 2
	striping_unit = 1048576
	directIO = disable
	romio_lustre_co_ratio = 2
	same_io_size = no
	contiguous_data = yes
	ds_in_coll = enable
	big_req_size = 40960
}

hints returned from opened file {
	direct_read = false
	direct_write = false
	romio_lustre_co_ratio = 2
	romio_lustre_coll_threshold = 0
	romio_lustre_ds_in_coll = enable
	striping_unit = 1048576
	striping_factor = 2
	cb_config_list = *:1
	cb_buffer_size = 16777216
	romio_cb_read = automatic
	romio_cb_write = automatic
	cb_nodes = 2
	romio_no_indep_rw = false
	romio_cb_pfr = disable
	romio_cb_fr_types = aar
	romio_cb_fr_alignment = 1
	romio_cb_ds_threshold = 0
	romio_cb_alltoall = automatic
	ind_rd_buffer_size = 4194304
	ind_wr_buffer_size = 524288
	romio_ds_read = automatic
	romio_ds_write = automatic
	romio_filesystem_type = LUSTRE:
	romio_aggregator_list = 0 1 
}
Commencing read performance test.
Mon Aug 21 18:23:25 2017

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  Op grep #Tasks tPN reps  fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize

---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          15.84      15.84       15.84      0.00      15.84      15.84       15.84      0.00   0.75749   2 1 1 0 0 1 0 0 1 6291456 1048576 12582912 -1 MPIIO EXCEL
read          129.90     129.90      129.90      0.00     129.90     129.90      129.90      0.00   0.09238   2 1 1 0 0 1 0 0 1 6291456 1048576 12582912 -1 MPIIO EXCEL

Max Write: 15.84 MiB/sec (16.61 MB/sec)
Max Read:  129.90 MiB/sec (136.21 MB/sec)

Run finished: Mon Aug 21 18:23:25 2017

Here is the layout of file iorfile:

[root@centos7-2 C]# lfs getstripe /mnt/lustre/iorfile 
/mnt/lustre/iorfile
  lcm_layout_gen:  6
  lcm_entry_count: 4
    lcme_id:             1
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   4194304
      lmm_stripe_count:  2
      lmm_stripe_size:   524288
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x18:0x0] }
      - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x18:0x0] }

    lcme_id:             2
    lcme_flags:          init
    lcme_extent.e_start: 4194304
    lcme_extent.e_end:   8388608
      lmm_stripe_count:  4
      lmm_stripe_size:   1048576
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_objects:
      - 0: { l_ost_idx: 3, l_fid: [0x100030000:0xe:0x0] }
      - 1: { l_ost_idx: 2, l_fid: [0x100020000:0xe:0x0] }
      - 2: { l_ost_idx: 0, l_fid: [0x100000000:0x19:0x0] }
      - 3: { l_ost_idx: 1, l_fid: [0x100010000:0x19:0x0] }

    lcme_id:             3
    lcme_flags:          init
    lcme_extent.e_start: 8388608
    lcme_extent.e_end:   12582912
      lmm_stripe_count:  4
      lmm_stripe_size:   262144
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: 0
      lmm_objects:
      - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x1a:0x0] }
      - 1: { l_ost_idx: 1, l_fid: [0x100010000:0x1a:0x0] }
      - 2: { l_ost_idx: 2, l_fid: [0x100020000:0xf:0x0] }
      - 3: { l_ost_idx: 3, l_fid: [0x100030000:0xf:0x0] }

    lcme_id:             4
    lcme_flags:          0
    lcme_extent.e_start: 12582912
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2
      lmm_stripe_size:   2097152
      lmm_pattern:       1
      lmm_layout_gen:    0
      lmm_stripe_offset: -1
Comment by Emoly Liu [ 24/Aug/17 ]

I improved the code to use only one hint "romio_lustre_comp_layout" to specify the composite layout in 3 formats:

  • YAML template file, e.g. /a/b/layout.yaml
  • command option string, similar to "lfs setstripe" command, e.g. "-E 4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K"
  • lustre source file, e.g. /mnt/lustre/compfile, that means creating file with the same layout to this lustre file.

Here is an example of ior hint file:

IOR_HINT__MPI__romio_lustre_comp_layout=/root/ior/src/C/yaml_temp
#IOR_HINT__MPI__romio_lustre_comp_layout=/mnt/lustre/testfile
#IOR_HINT__MPI__romio_lustre_comp_layout=-E 4M -c 2 -S 512K -E 8M -c 4 -S 1M -E -1 -S 256K

The latter two formats are used in case that YAML is not present. The patch has been updated at https://review.whamcloud.com/#/c/27869/9/

Comment by Gerrit Updater [ 28/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27865/
Subject: LU-9657 pfl: llapi_layout_comp_usei should handle non-pfl file
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4f349ddde3567c19a522c707fc5ef2271016bfe6

Comment by Robert Latham [ 18/Jul/22 ]

I think it's correct to close this as fixed/resolved on your end, and the ball is in MPICH's court. I did not merge work into MPICH 4 years ago because we didn't have any PFL lustre to test on. I am sure I can find some PFL lustre nowadays and will revisit https://github.com/pmodels/mpich/pull/3290

Generated at Sat Feb 10 02:28:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.