Details

    • 3
    • 9223372036854775807

    Description

      During FOFB tests with IOR and mpich we observing next errors. I've created a timeline for a issue.

      Using Time Stamp 1648109998 (0x623c29ae) for Data Signature  (03:19:58)
      delaying 15 seconds . . .
       Commencing write performance test.
       Thu Mar 24 03:21:10 2022
      
       write     717.93     1048576    1024.00    0.113480   91.17      0.010149   91.28      3    XXCEL
       Verifying contents of the file(s) just written.
       Thu Mar 24 03:22:41 2022
      
       delaying 15 seconds . . .
       [RANK 000] open for reading file /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m XXCEL
       Commencing read performance test.
       Thu Mar 24 03:23:27 2022
      
       read      2698.93    1048576    1024.00    0.030882   24.25      0.005629   24.28      3    XXCEL
       Using Time Stamp 1648110232 (0x623c2a98) for Data Signature (03:24:42)
       delaying 15 seconds . . . (~03:24:57)
      
      Mar 24 03:24:51 kjcf05n03 kernel: Lustre: Failing over kjcf05-MDT0000
      
       ** error **
       ** error **
       ADIO_RESOLVEFILETYPE_FNCALL(387): Invalid file name /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m, mpi_check_status: 939600165, mpi_check_status_errno: 107
       MPI File does not exist, error stack:
       (unknown)(): Invalid file name, mpi_check_status: 939600165, mpi_check_status_errno: 2
      
      Rank 0 [Thu Mar 24 03:25:00 2022] [c3-0c0s12n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
      
      
      Mar 24 03:25:46 kjcf05n03 kernel: Lustre: server umount kjcf05-MDT0000 complete
      Mar 24 03:25:46 kjcf05n03 kernel: md65: detected capacity change from 21009999921152 to 0
      Mar 24 03:25:46 kjcf05n03 kernel: md: md65 stopped.
      Mar 24 03:25:48 kjcf05n02 kernel: md: md65 stopped.
      00000020:00000001:22.0:1648110350.625691:0:512728:0:(obd_mount_server.c:1352:server_start_targets()) Process entered
      Mar 24 03:25:51 kjcf05n02 kernel: Lustre: kjcf05-MDT0000: Will be in recovery for at least 15:00, or until 24 clients reconnect
      

      The fail reason is the next mpich codepath
      MPI_File_open()>ADIO_ResolveFileType()>ADIO_FileSysType_fncall()->statfs()

      vfs statfs part do a lookup for a file and then ll_statfs. If cluster lost MDT between these to calls, ll_statfs ends with one of next error EAGAIN,ENOTCONN,ENODEV. The exact number depends on a MDT failover stage. The error brakes MPICH logic for detecting FS type, and fails the IOR. Error doesn't happen for nolazystatfs cause ll_statfs is blocking and waits MDT.
      Lazystatfs was designed not to block statfs. However OST failover does not produce ll_statfs error cause statfs returns only MDT data and rc 0.
      Also mpich has a workaround for ESTALE error from NFS

      static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *error_code)
      {
          int err;
          int64_t file_id;
          static char myname[] = "ADIO_RESOLVEFILETYPE_FNCALL";
      
      
      /* NFS can get stuck and end up returning ESTALE "forever" */
      #define MAX_ESTALE_RETRY 10000
          int retry_cnt;
      
          *error_code = MPI_SUCCESS;
      
          retry_cnt = 0;
          do {
              err = romio_statfs(filename, &file_id);
          } while (err && (errno == ESTALE) && retry_cnt++ < MAX_ESTALE_RETRY);
      

      I'm suggesting to add error masking to ESTALE for ll_statfs. This will make MPICH happy with lazystatfs option with FOFB.

      Attachments

        Issue Links

          Activity

            [LU-15788] lazystatfs + FOFB + mpich problems
            adilger Andreas Dilger added a comment - - edited

            Oleg had problems with v2 of this patch:

            Oleg Drokin                                                                                05-29 20:21

            Patch Set 2: Verified-1
            This seem to introduce a 100% recovery-small timeout in janitor testing.

            adilger Andreas Dilger added a comment - - edited Oleg had problems with v2 of this patch: Oleg Drokin                                                                                05-29 20:21 Patch Set 2: Verified-1 This seem to introduce a 100% recovery-small timeout in janitor testing.

            with this patch landed I hit almost 100%:

            PASS 150 (9s)
            == recovery-small test complete, duration 4839 sec ======= 04:02:47 (1655265767)
            rm: cannot remove '/mnt/lustre/d110h.recovery-small/target_dir/tgt_file': Input/output error
             recovery-small : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = ./../tests/test-framework.sh:6522:error()
              = ./../tests/test-framework.sh:6006:check_and_cleanup_lustre()
              = recovery-small.sh:3306:main()
            

            bisection:

            COMMIT		TESTED	PASSED	FAILED		COMMIT DESCRIPTION
            a3cba2ead7      1       0       1       BAD     LU-13547 tests: remove ea_inode from mkfs MDT options
            4c47900889      5       4       1       BAD     LU-12186 ec: add necessary structure member for EC file
            b762319d5a      5       4       1       BAD     LU-14195 libcfs: test for nla_strscpy
            57f3262baa      2       1       1       BAD     LU-15788 lmv: try another MDT if statfs failed
            b00ac5f703      5       5       0       GOOD    LU-12756 lnet: Avoid redundant peer NI lookups
            23028efcae      5       5       0       GOOD    LU-6864 osp: manage number of modify RPCs in flight
            7f157f8ef3      5       5       0       GOOD    LU-15841 lod: iterate component to collect avoid array
            eb71aec27e      5       5       0       GOOD    LU-15786 tests: get maxage param on mds1 properly
            9523e99046      5       5       0       GOOD    LU-15754 lfsck: skip an inode if iget() returns -ENOMEM
            
            bzzz Alex Zhuravlev added a comment - with this patch landed I hit almost 100%: PASS 150 (9s) == recovery-small test complete, duration 4839 sec ======= 04:02:47 (1655265767) rm: cannot remove '/mnt/lustre/d110h.recovery-small/target_dir/tgt_file' : Input/output error recovery-small : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = ./../tests/test-framework.sh:6522:error() = ./../tests/test-framework.sh:6006:check_and_cleanup_lustre() = recovery-small.sh:3306:main() bisection: COMMIT TESTED PASSED FAILED COMMIT DESCRIPTION a3cba2ead7 1 0 1 BAD LU-13547 tests: remove ea_inode from mkfs MDT options 4c47900889 5 4 1 BAD LU-12186 ec: add necessary structure member for EC file b762319d5a 5 4 1 BAD LU-14195 libcfs: test for nla_strscpy 57f3262baa 2 1 1 BAD LU-15788 lmv: try another MDT if statfs failed b00ac5f703 5 5 0 GOOD LU-12756 lnet: Avoid redundant peer NI lookups 23028efcae 5 5 0 GOOD LU-6864 osp: manage number of modify RPCs in flight 7f157f8ef3 5 5 0 GOOD LU-15841 lod: iterate component to collect avoid array eb71aec27e 5 5 0 GOOD LU-15786 tests: get maxage param on mds1 properly 9523e99046 5 5 0 GOOD LU-15754 lfsck: skip an inode if iget() returns -ENOMEM
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47152/
            Subject: LU-15788 lmv: try another MDT if statfs failed
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 57f3262baa7d8931176a81cde05bc057facfc3b6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47152/ Subject: LU-15788 lmv: try another MDT if statfs failed Project: fs/lustre-release Branch: master Current Patch Set: Commit: 57f3262baa7d8931176a81cde05bc057facfc3b6

            Mike, lazystatfs has been enabled by default for a long time. However, it should only apply to "lfs df" to return individual OST stats, not cause the whole statfs to fail. That is a bad interaction between STATFS_SUM (which only sends one RPC to one MDS) and lazystatfs (which allows individual RPCs to fail, but expects most of them to work).

            I think the current patch is a reasonable compromise. It retries the STATFS_SUM multiple times to different MDTs (which shouldn't all be failing at the same time), and should also block (loop retrying) if all MDTs are down.

            adilger Andreas Dilger added a comment - Mike, lazystatfs has been enabled by default for a long time. However, it should only apply to "lfs df" to return individual OST stats, not cause the whole statfs to fail. That is a bad interaction between STATFS_SUM (which only sends one RPC to one MDS) and lazystatfs (which allows individual RPCs to fail, but expects most of them to work). I think the current patch is a reasonable compromise. It retries the STATFS_SUM multiple times to different MDTs (which shouldn't all be failing at the same time), and should also block (loop retrying) if all MDTs are down.

            Yeap, lazystatfs off makes ll_statfs blocking. Ptlrpc layer handles errors and resends statfs request when MDT0 finishes recovery.

            aboyko Alexander Boyko added a comment - Yeap, lazystatfs off makes ll_statfs blocking. Ptlrpc layer handles errors and resends statfs request when MDT0 finishes recovery.

            does it mean that turning lazystatfs off would remove problem as well?

            tappro Mikhail Pershin added a comment - does it mean that turning lazystatfs off would remove problem as well?

            Probably the obd_statfs() call for MDT0000 should not be lazy, since MDT0000 is required for filesystem operation. That should also avoid this problem, and be "more correct" for users as well - they will get some valid return rather than an error.

            adilger Andreas Dilger added a comment - Probably the obd_statfs() call for MDT0000 should not be lazy, since MDT0000 is required for filesystem operation. That should also avoid this problem, and be "more correct" for users as well - they will get some valid return rather than an error.

            adilger could you take look at description, I've pushed patch for discussing only. We have no agreement about fix. This also could be fixed at mpich library. I also want to mention that Lustre returns not approved errors from syscall, however estale is also wrong base on man pages. The all usermode concept to detect FS type with statfs call especially for distributed FS  brings me to tears.

             

            aboyko Alexander Boyko added a comment - adilger could you take look at description, I've pushed patch for discussing only. We have no agreement about fix. This also could be fixed at mpich library. I also want to mention that Lustre returns not approved errors from syscall, however estale is also wrong base on man pages. The all usermode concept to detect FS type with statfs call especially for distributed FS  brings me to tears.  

            "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47152
            Subject: LU-15788 llite: statfs error masking
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e33e50b695eb77a877b65c1070df08398fc76a8d

            gerrit Gerrit Updater added a comment - "Alexander Boyko <alexander.boyko@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47152 Subject: LU-15788 llite: statfs error masking Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e33e50b695eb77a877b65c1070df08398fc76a8d

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: