[LU-15788] lazystatfs + FOFB + mpich problems Created: 27/Apr/22  Updated: 15/Jun/22  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-15457 IOR MPIIO job abort - file handling i... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During FOFB tests with IOR and mpich we observing next errors. I've created a timeline for a issue.

Using Time Stamp 1648109998 (0x623c29ae) for Data Signature  (03:19:58)
delaying 15 seconds . . .
 Commencing write performance test.
 Thu Mar 24 03:21:10 2022

 write     717.93     1048576    1024.00    0.113480   91.17      0.010149   91.28      3    XXCEL
 Verifying contents of the file(s) just written.
 Thu Mar 24 03:22:41 2022

 delaying 15 seconds . . .
 [RANK 000] open for reading file /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m XXCEL
 Commencing read performance test.
 Thu Mar 24 03:23:27 2022

 read      2698.93    1048576    1024.00    0.030882   24.25      0.005629   24.28      3    XXCEL
 Using Time Stamp 1648110232 (0x623c2a98) for Data Signature (03:24:42)
 delaying 15 seconds . . . (~03:24:57)

Mar 24 03:24:51 kjcf05n03 kernel: Lustre: Failing over kjcf05-MDT0000

 ** error **
 ** error **
 ADIO_RESOLVEFILETYPE_FNCALL(387): Invalid file name /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m, mpi_check_status: 939600165, mpi_check_status_errno: 107
 MPI File does not exist, error stack:
 (unknown)(): Invalid file name, mpi_check_status: 939600165, mpi_check_status_errno: 2

Rank 0 [Thu Mar 24 03:25:00 2022] [c3-0c0s12n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0


Mar 24 03:25:46 kjcf05n03 kernel: Lustre: server umount kjcf05-MDT0000 complete
Mar 24 03:25:46 kjcf05n03 kernel: md65: detected capacity change from 21009999921152 to 0
Mar 24 03:25:46 kjcf05n03 kernel: md: md65 stopped.
Mar 24 03:25:48 kjcf05n02 kernel: md: md65 stopped.
00000020:00000001:22.0:1648110350.625691:0:512728:0:(obd_mount_server.c:1352:server_start_targets()) Process entered
Mar 24 03:25:51 kjcf05n02 kernel: Lustre: kjcf05-MDT0000: Will be in recovery for at least 15:00, or until 24 clients reconnect

The fail reason is the next mpich codepath
MPI_File_open()>ADIO_ResolveFileType()>ADIO_FileSysType_fncall()->statfs()

vfs statfs part do a lookup for a file and then ll_statfs. If cluster lost MDT between these to calls, ll_statfs ends with one of next error EAGAIN,ENOTCONN,ENODEV. The exact number depends on a MDT failover stage. The error brakes MPICH logic for detecting FS type, and fails the IOR. Error doesn't happen for nolazystatfs cause ll_statfs is blocking and waits MDT.
Lazystatfs was designed not to block statfs. However OST failover does not produce ll_statfs error cause statfs returns only MDT data and rc 0.
Also mpich has a workaround for ESTALE error from NFS

static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *error_code)
{
    int err;
    int64_t file_id;
    static char myname[] = "ADIO_RESOLVEFILETYPE_FNCALL";


/* NFS can get stuck and end up returning ESTALE "forever" */
#define MAX_ESTALE_RETRY 10000
    int retry_cnt;

    *error_code = MPI_SUCCESS;

    retry_cnt = 0;
    do {
        err = romio_statfs(filename, &file_id);
    } while (err && (errno == ESTALE) && retry_cnt++ < MAX_ESTALE_RETRY);

I'm suggesting to add error masking to ESTALE for ll_statfs. This will make MPICH happy with lazystatfs option with FOFB.



 Comments   
Comment by Gerrit Updater [ 27/Apr/22 ]

"Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47152
Subject: LU-15788 llite: statfs error masking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e33e50b695eb77a877b65c1070df08398fc76a8d

Comment by Alexander Boyko [ 27/Apr/22 ]

adilger could you take look at description, I've pushed patch for discussing only. We have no agreement about fix. This also could be fixed at mpich library. I also want to mention that Lustre returns not approved errors from syscall, however estale is also wrong base on man pages. The all usermode concept to detect FS type with statfs call especially for distributed FS  brings me to tears.

 

Comment by Andreas Dilger [ 27/Apr/22 ]

Probably the obd_statfs() call for MDT0000 should not be lazy, since MDT0000 is required for filesystem operation. That should also avoid this problem, and be "more correct" for users as well - they will get some valid return rather than an error.

Comment by Mikhail Pershin [ 28/Apr/22 ]

does it mean that turning lazystatfs off would remove problem as well?

Comment by Alexander Boyko [ 28/Apr/22 ]

Yeap, lazystatfs off makes ll_statfs blocking. Ptlrpc layer handles errors and resends statfs request when MDT0 finishes recovery.

Comment by Andreas Dilger [ 28/Apr/22 ]

Mike, lazystatfs has been enabled by default for a long time. However, it should only apply to "lfs df" to return individual OST stats, not cause the whole statfs to fail. That is a bad interaction between STATFS_SUM (which only sends one RPC to one MDS) and lazystatfs (which allows individual RPCs to fail, but expects most of them to work).

I think the current patch is a reasonable compromise. It retries the STATFS_SUM multiple times to different MDTs (which shouldn't all be failing at the same time), and should also block (loop retrying) if all MDTs are down.

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47152/
Subject: LU-15788 lmv: try another MDT if statfs failed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 57f3262baa7d8931176a81cde05bc057facfc3b6

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.15

Comment by Alex Zhuravlev [ 15/Jun/22 ]

with this patch landed I hit almost 100%:

PASS 150 (9s)
== recovery-small test complete, duration 4839 sec ======= 04:02:47 (1655265767)
rm: cannot remove '/mnt/lustre/d110h.recovery-small/target_dir/tgt_file': Input/output error
 recovery-small : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = ./../tests/test-framework.sh:6522:error()
  = ./../tests/test-framework.sh:6006:check_and_cleanup_lustre()
  = recovery-small.sh:3306:main()

bisection:

COMMIT		TESTED	PASSED	FAILED		COMMIT DESCRIPTION
a3cba2ead7      1       0       1       BAD     LU-13547 tests: remove ea_inode from mkfs MDT options
4c47900889      5       4       1       BAD     LU-12186 ec: add necessary structure member for EC file
b762319d5a      5       4       1       BAD     LU-14195 libcfs: test for nla_strscpy
57f3262baa      2       1       1       BAD     LU-15788 lmv: try another MDT if statfs failed
b00ac5f703      5       5       0       GOOD    LU-12756 lnet: Avoid redundant peer NI lookups
23028efcae      5       5       0       GOOD    LU-6864 osp: manage number of modify RPCs in flight
7f157f8ef3      5       5       0       GOOD    LU-15841 lod: iterate component to collect avoid array
eb71aec27e      5       5       0       GOOD    LU-15786 tests: get maxage param on mds1 properly
9523e99046      5       5       0       GOOD    LU-15754 lfsck: skip an inode if iget() returns -ENOMEM
Comment by Andreas Dilger [ 15/Jun/22 ]

Oleg had problems with v2 of this patch:

Oleg Drokin                                                                                05-29 20:21

Patch Set 2: Verified-1
This seem to introduce a 100% recovery-small timeout in janitor testing.

Generated at Sat Feb 10 03:21:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.