[LU-15788] lazystatfs + FOFB + mpich problems Created: 27/Apr/22 Updated: 15/Jun/22 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Alexander Boyko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
During FOFB tests with IOR and mpich we observing next errors. I've created a timeline for a issue. Using Time Stamp 1648109998 (0x623c29ae) for Data Signature (03:19:58) delaying 15 seconds . . . Commencing write performance test. Thu Mar 24 03:21:10 2022 write 717.93 1048576 1024.00 0.113480 91.17 0.010149 91.28 3 XXCEL Verifying contents of the file(s) just written. Thu Mar 24 03:22:41 2022 delaying 15 seconds . . . [RANK 000] open for reading file /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m XXCEL Commencing read performance test. Thu Mar 24 03:23:27 2022 read 2698.93 1048576 1024.00 0.030882 24.25 0.005629 24.28 3 XXCEL Using Time Stamp 1648110232 (0x623c2a98) for Data Signature (03:24:42) delaying 15 seconds . . . (~03:24:57) Mar 24 03:24:51 kjcf05n03 kernel: Lustre: Failing over kjcf05-MDT0000 ** error ** ** error ** ADIO_RESOLVEFILETYPE_FNCALL(387): Invalid file name /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m, mpi_check_status: 939600165, mpi_check_status_errno: 107 MPI File does not exist, error stack: (unknown)(): Invalid file name, mpi_check_status: 939600165, mpi_check_status_errno: 2 Rank 0 [Thu Mar 24 03:25:00 2022] [c3-0c0s12n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0 Mar 24 03:25:46 kjcf05n03 kernel: Lustre: server umount kjcf05-MDT0000 complete Mar 24 03:25:46 kjcf05n03 kernel: md65: detected capacity change from 21009999921152 to 0 Mar 24 03:25:46 kjcf05n03 kernel: md: md65 stopped. Mar 24 03:25:48 kjcf05n02 kernel: md: md65 stopped. 00000020:00000001:22.0:1648110350.625691:0:512728:0:(obd_mount_server.c:1352:server_start_targets()) Process entered Mar 24 03:25:51 kjcf05n02 kernel: Lustre: kjcf05-MDT0000: Will be in recovery for at least 15:00, or until 24 clients reconnect The fail reason is the next mpich codepath vfs statfs part do a lookup for a file and then ll_statfs. If cluster lost MDT between these to calls, ll_statfs ends with one of next error EAGAIN,ENOTCONN,ENODEV. The exact number depends on a MDT failover stage. The error brakes MPICH logic for detecting FS type, and fails the IOR. Error doesn't happen for nolazystatfs cause ll_statfs is blocking and waits MDT. static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *error_code)
{
int err;
int64_t file_id;
static char myname[] = "ADIO_RESOLVEFILETYPE_FNCALL";
/* NFS can get stuck and end up returning ESTALE "forever" */
#define MAX_ESTALE_RETRY 10000
int retry_cnt;
*error_code = MPI_SUCCESS;
retry_cnt = 0;
do {
err = romio_statfs(filename, &file_id);
} while (err && (errno == ESTALE) && retry_cnt++ < MAX_ESTALE_RETRY);
I'm suggesting to add error masking to ESTALE for ll_statfs. This will make MPICH happy with lazystatfs option with FOFB. |
| Comments |
| Comment by Gerrit Updater [ 27/Apr/22 ] |
|
"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47152 |
| Comment by Alexander Boyko [ 27/Apr/22 ] |
|
adilger could you take look at description, I've pushed patch for discussing only. We have no agreement about fix. This also could be fixed at mpich library. I also want to mention that Lustre returns not approved errors from syscall, however estale is also wrong base on man pages. The all usermode concept to detect FS type with statfs call especially for distributed FS brings me to tears.
|
| Comment by Andreas Dilger [ 27/Apr/22 ] |
|
Probably the obd_statfs() call for MDT0000 should not be lazy, since MDT0000 is required for filesystem operation. That should also avoid this problem, and be "more correct" for users as well - they will get some valid return rather than an error. |
| Comment by Mikhail Pershin [ 28/Apr/22 ] |
|
does it mean that turning lazystatfs off would remove problem as well? |
| Comment by Alexander Boyko [ 28/Apr/22 ] |
|
Yeap, lazystatfs off makes ll_statfs blocking. Ptlrpc layer handles errors and resends statfs request when MDT0 finishes recovery. |
| Comment by Andreas Dilger [ 28/Apr/22 ] |
|
Mike, lazystatfs has been enabled by default for a long time. However, it should only apply to "lfs df" to return individual OST stats, not cause the whole statfs to fail. That is a bad interaction between STATFS_SUM (which only sends one RPC to one MDS) and lazystatfs (which allows individual RPCs to fail, but expects most of them to work). I think the current patch is a reasonable compromise. It retries the STATFS_SUM multiple times to different MDTs (which shouldn't all be failing at the same time), and should also block (loop retrying) if all MDTs are down. |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47152/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.15 |
| Comment by Alex Zhuravlev [ 15/Jun/22 ] |
|
with this patch landed I hit almost 100%:
PASS 150 (9s)
== recovery-small test complete, duration 4839 sec ======= 04:02:47 (1655265767)
rm: cannot remove '/mnt/lustre/d110h.recovery-small/target_dir/tgt_file': Input/output error
recovery-small : @@@@@@ FAIL: remove sub-test dirs failed
Trace dump:
= ./../tests/test-framework.sh:6522:error()
= ./../tests/test-framework.sh:6006:check_and_cleanup_lustre()
= recovery-small.sh:3306:main()
bisection: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION a3cba2ead7 1 0 1 BAD LU-13547 tests: remove ea_inode from mkfs MDT options 4c47900889 5 4 1 BAD LU-12186 ec: add necessary structure member for EC file b762319d5a 5 4 1 BAD LU-14195 libcfs: test for nla_strscpy 57f3262baa 2 1 1 BAD LU-15788 lmv: try another MDT if statfs failed b00ac5f703 5 5 0 GOOD LU-12756 lnet: Avoid redundant peer NI lookups 23028efcae 5 5 0 GOOD LU-6864 osp: manage number of modify RPCs in flight 7f157f8ef3 5 5 0 GOOD LU-15841 lod: iterate component to collect avoid array eb71aec27e 5 5 0 GOOD LU-15786 tests: get maxage param on mds1 properly 9523e99046 5 5 0 GOOD LU-15754 lfsck: skip an inode if iget() returns -ENOMEM |
| Comment by Andreas Dilger [ 15/Jun/22 ] |
|
Oleg had problems with v2 of this patch:
|