Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18416

Data corruption/miscompare observed during 48hr FOFB

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Data corruption/miscompare issue observed during 48 hours lustre FOFB.

       [63] FAILED comparison of buffer containing 8-byte ints:
       [63]   File name = /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
       [63]   In transfer 129, 131072 errors between buffer indices 0 and 131071.
       [63]   File byte offset = 33957085184:
       [63]     Expected: 0x0000003f671ad176 0000000000000008 0000003f671ad176 0000000000000018 
       [63]     Actual:   0x6161616161616161 6161616161616161 6161616161616161 6161616161616161 
       ** error **
       ERROR in IOR.c (line 448): data check error, aborting execution.
       ERROR: No such file or directory
       ** exiting **
       Rank 63 [Thu Oct 24 18:07:48 2024] [c0-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 63
       _pmiu_daemon(SIGCHLD): [NID 00173] [c0-0c2s11n1] [Thu Oct 24 18:07:49 2024] PE RANK 63 exit signal Aborted
       [NID 00173] 2024-10-24 18:07:49 Apid 8319126: initiated application termination 

      First issue with timestamp :

       read      5759       524288     1024.00    0.042503   5.65       0.000407   5.69       6    XXCEL
       Using Time Stamp 1729810806 (0x671ad176) for Data Signature
       delaying 15 seconds . . .
       Commencing write performance test.
       Thu Oct 24 18:00:25 2024
       
       WARNING: Expected aggregate file size       = 34359738368.
       WARNING: Stat() of aggregate file size      = 33957085184.
       WARNING: Using actual aggregate bytes moved = 34359738368.
       write     74.34      524288     1024.00    0.010731   440.78     0.002005   440.79     7    XXCEL
       Verifying contents of the file(s) just written.
       Thu Oct 24 18:07:45 2024
       
       [63] At transfer buffer #128, index #0 (file byte offset 33957085184):
       [63] Expected: 0x0000003f671ad176
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #1 (file byte offset 33957085192):
       [63] Expected: 0x0000000000000008
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #2 (file byte offset 33957085200): 

      FOFB operation during the corruption time :

      2024-10-24t17:45:15 Test 17  -- operation kill failover of kjcf05n03
      2024-10-24t18:11:14 Test 17  -- operation kill failback of kjcf05n03 

      Recovery after kjcf05n03 Failover:

      Oct 24 18:04:49 kjcf05n06 kernel: Lustre: kjcf05-MDT0001-lwp-OST0002: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/4576 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/608 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:33 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib) 

      Recovery after kjcf05n03 Failback :

      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/4176 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/608 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:14 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)
      Oct 24 18:27:14 kjcf05n02 kernel: Lustre: kjcf05-MDT0001-osp-MDT0000: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib) 

      Timestamp for the IOR job :

      <150>1 2024-10-24T17:52:40.433638-05:00 c0-0c0s1n1 aprun 25706 p1-20241023t024851 [alps_msgs@34] apid=8319126, Starting, user=1356, cmd_line="aprun -n 64 /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m -a MPIIO -i 8 -w -r -W -t 1m -b 512m -C -k -vv -q -d 15 ", num_nodes=2, node_list=172-173, cwd="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633", wdir="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633"
      <150>1 2024-10-24T18:07:53.597430-05:00 c0-0c0s1n1 apsys 25844 p1-20241023t024851 [alps_msgs@34] apid=8319126, Finishing, user=1356, exit_code=137, exitcode_array=134:0, exitsignal_array=9:0
      <142>1 2024-10-24T18:07:55.386375-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: taskstats {"core": 0, "exitcode:signal": 0, "rur_errors": 0, "rur_truncated_record_errors": 0, "max_rss": 0, "stime": 0, "wchar": 0, "rchar": 0, "utime": 0}
      <142>1 2024-10-24T18:07:55.386441-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: timestamp APP_START 2024-10-24T17:52:40CDT APP_STOP 2024-10-24T18:07:54CDT 

      lfs getstripe O/P :

      ./CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
         lcm_layout_gen:    7
         lcm_mirror_count:  1
         lcm_entry_count:   4
           lcme_id:             1
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 0
           lcme_extent.e_end:   131072
             lmm_stripe_count:  0
             lmm_stripe_size:   131072
             lmm_pattern:       mdt
             lmm_layout_gen:    65535
             lmm_stripe_offset: 0
       
           lcme_id:             2
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 131072
           lcme_extent.e_end:   134217728
             lmm_stripe_count:  1
             lmm_stripe_size:   131072
             lmm_pattern:       raid0
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f1:0x0] }
       
           lcme_id:             4
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 134217728
           lcme_extent.e_end:   34359738368
             lmm_stripe_count:  32
             lmm_stripe_size:   131072
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f2:0x0] }
             -   1: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c414:0x0] }
             -   2: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f3:0x0] }
             -   3: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c415:0x0] }
             -   4: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f4:0x0] }
             -   5: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c416:0x0] }
             -   6: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f5:0x0] }
             -   7: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c417:0x0] }
             -   8: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f6:0x0] }
             -   9: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c418:0x0] }
             -  10: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f7:0x0] }
             -  11: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c419:0x0] }
             -  12: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f8:0x0] }
             -  13: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41a:0x0] }
             -  14: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f9:0x0] }
             -  15: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41b:0x0] }
             -  16: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fa:0x0] }
             -  17: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41c:0x0] }
             -  18: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fb:0x0] }
             -  19: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41d:0x0] }
             -  20: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fc:0x0] }
             -  21: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41e:0x0] }
             -  22: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fd:0x0] }
             -  23: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41f:0x0] }
             -  24: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fe:0x0] }
             -  25: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c420:0x0] }
             -  26: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44ff:0x0] }
             -  27: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c421:0x0] }
             -  28: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4500:0x0] }
             -  29: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c422:0x0] }
             -  30: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4501:0x0] }
             -  31: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c423:0x0] }
       
           lcme_id:             5
           lcme_mirror_id:      0
           lcme_flags:          extension
           lcme_extent.e_start: 34359738368
           lcme_extent.e_end:   EOF
             lmm_stripe_count:  0
             lmm_extension_size: 67108864
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    0
             lmm_stripe_offset: -1
             lmm_pool:          flash 

      debugfs stat:

      [root@kjcf05n03 ~]# debugfs -c -R "stat <481925412>" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      Inode: 481925412  Type: regular  Mode: 0644  Flags: 0x80000
      Generation: 2621827007  Version: 0x000005f9:19d97c78
      User: 1356  Group: 11121  Project:   0  Size: 131072
      File ACL: 620866789
      Links: 1  Blockcount: 264
      Fragment: Address: 0  Number: 0  Size: 0
       ctime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
       atime: 0x671f95bc:00000000 -- Mon Oct 28 08:46:36 2024
       mtime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
      crtime: 0x671ad189:07bed478 -- Thu Oct 24 18:00:25 2024
      Size of extra inode fields: 32
      Extended attributes:
       lma: fid=[0x2403375e4:0x14053:0x0] compat=0 incompat=0
       linkea: idx=0 parent=[0x240337296:0x454:0x0] name='IORfile_1m'
       trusted.dataver (8) = 78 7c d9 19 f9 05 00 00 
       trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 10 01 f4 03 00 00 00 00 
       trusted.lov (1216)
      EXTENTS:
      (0-31):671298939-671298970
      
      [root@kjcf05n03 ~]#
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # lfs path2fid /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
      [0x2403375e4:0x14053:0x0]
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # 
      
      [root@kjcf05n03 ~]# debugfs -c -R " dump <481925412> /tmp/mdt.IORfile.1024" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | head -n3
      00000000  76 d1 1a 67 00 00 00 00  08 00 00 00 00 00 00 00  |v..g............|
      00000010  76 d1 1a 67 00 00 00 00  18 00 00 00 00 00 00 00  |v..g............|
      00000020  76 d1 1a 67 00 00 00 00  28 00 00 00 00 00 00 00  |v..g....(.......|
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | grep -v '76 d1 1a 67'
      00020000
      [root@kjcf05n03 ~]#
      

      00020000 > 131072/1024 = 128k.

      Attachments

        1. alternate_posix_signature_Command.out15537.txt
          1.39 MB
        2. bad_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241107005243.txt
          286 kB
        3. bad_dklog.nid00037.20241107005243.llog.gz
          57.30 MB
        4. bad_dklog.nid00037.20241107005243.llog-1.gz
          57.30 MB
        5. bad_dklog.nid00038.20241107005243.llog.gz
          46.37 MB
        6. bad_dklog.nid00038.20241107005243.llog-1.gz
          46.37 MB
        7. bad_dklog.nid00039.20241107005243.llog.gz
          37.98 MB
        8. bad_dklog.nid00039.20241107005243.llog-1.gz
          37.98 MB
        9. CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt
          297 kB
        10. CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          207 kB
        11. CL_IOR_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          206 kB
        12. client_dk.tar.gz
          153.44 MB
        13. client_dk.tar-1.gz
          153.44 MB
        14. client_dmesg.tar.gz
          54 kB
        15. client_dmesg.tar-1.gz
          54 kB
        16. client_logs_20241111.tar.gz
          185.82 MB
        17. data.20241108.tar.gz
          369.31 MB
        18. dmesg.log
          150 kB
        19. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241106181830_output.txt
          201 kB
        20. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_202411077072444.txt
          7 kB
        21. good_dklog.nid00172_202411077072444.log.tar.gz
          91.56 MB
        22. good_dklog.nid00172.20241106181830.llog.gz
          82.55 MB
        23. good_dklog.nid00173_202411077072444.log.tar.gz
          91.21 MB
        24. good_dklog.nid00173.20241106181830.llog.gz
          83.21 MB
        25. kjcf05_llog.tar.gz
          581 kB
        26. kjcf05n02_llog.tar.gz
          68 kB
        27. kjcf05n03_llog.tar.gz
          56 kB
        28. mds_mgs_dklog.tar.gz
          3.03 MB
        29. mds_mgs_dklog.tar-1.gz
          3.03 MB
        30. mds_mgs_dmesg.tar.gz
          70 kB
        31. mgs_mds_logs_20241111.tar.gz
          11.69 MB
        32. README.lu18416
          41 kB
        33. test_output_20241111.txt
          298 kB

        Issue Links

          Activity

            [LU-18416] Data corruption/miscompare observed during 48hr FOFB
            spitzcor Cory Spitz added a comment -

            > It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off?
            paf, no, bzzz asked for a debug mask without D_HA:

            please try to collect again with debug=vfstrace (not +vfstrace)

            Sure, we can add D_HA to clients and +layout on the MDS.

            spitzcor Cory Spitz added a comment - > It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? paf , no, bzzz asked for a debug mask without D_HA: please try to collect again with debug=vfstrace (not +vfstrace) Sure, we can add D_HA to clients and +layout on the MDS.
            bzzz Alex Zhuravlev added a comment - - edited

            It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured.

            I tried to enable full debugging on the condition in the patch, but it didn't help given the test needs few runs to reproduce.

            bzzz Alex Zhuravlev added a comment - - edited It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured. I tried to enable full debugging on the condition in the patch, but it didn't help given the test needs few runs to reproduce.

            Can we please run this with +layout debug enabled on the MDS so we can see what's happening in SEL?

            paf Patrick Farrell (Inactive) added a comment - Can we please run this with +layout debug enabled on the MDS so we can see what's happening in SEL?

            It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured.  Could you please enable +ha in the debug on the client?

            paf Patrick Farrell (Inactive) added a comment - It looks like D_HA is not enabled on those clients - it is by default in master/2.16 but I guess HPE turns it off? - so Alex's additional debug from lov_io.c was not captured.  Could you please enable +ha in the debug on the client?

            I can add an LASSERT for the case, but not sure whether you'll be able to fetch lustre logs after, but that would be exact catch.

            bzzz Alex Zhuravlev added a comment - I can add an LASSERT for the case, but not sure whether you'll be able to fetch lustre logs after, but that would be exact catch.

            "
            so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.
            "
            IOR wouldn't consider a return of 0 to be success - that's what it's complaining about elsewhere here.

            paf Patrick Farrell (Inactive) added a comment - " so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare. " IOR wouldn't consider a return of 0 to be success - that's what it's complaining about elsewhere here.
            prasannakumar Prasannakumar Nagasubramani added a comment - - edited bzzz Reproduced the issue with (  https://review.whamcloud.com/c/fs/lustre-release/+/56926 ) patch on clients. Attaching logs with recommended debug settings(vfstrace). mds_mgs_dmesg.tar.gz mds_mgs_dklog.tar.gz client_dmesg.tar.gz   client_dk.tar.gz kjcf05n02_llog.tar.gz kjcf05n03_llog.tar.gz CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt

            However, when we just removed the setstripe entirely and used the default layout, e.g., we can no longer reproduce the problem.

            exactly, this is what I asked for above when mentioned "more regular layout", a good sign.

            Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow.

            yes, the retries I described is due to changing PFL layout.

            please, try to reproduce with https://review.whamcloud.com/c/fs/lustre-release/+/56926 and collect debug logs from the client. debug=vfstrace is fine, the patch changes debugging level internally during retries so we'll get a detailed info hopefully.

            bzzz Alex Zhuravlev added a comment - However, when we just removed the setstripe entirely and used the default layout, e.g., we can no longer reproduce the problem. exactly, this is what I asked for above when mentioned "more regular layout", a good sign. Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow. yes, the retries I described is due to changing PFL layout. please, try to reproduce with https://review.whamcloud.com/c/fs/lustre-release/+/56926 and collect debug logs from the client. debug=vfstrace is fine, the patch changes debugging level internally during retries so we'll get a detailed info hopefully.
            jschwartz Josh Schwartz added a comment - - edited

            I think the following looks extremely suspicious:

            ...

            basically we try to restart write 1000 times with no success (because of losing layout lock, but this is file per process and no cancel/enqueue for LL has happened) and just exit with result=0 finally.
            I'm not very familiar with this code, still digging.. any advice is welcome.

            This caught my attention because one of the other things we tried was running this same test with POSIX rather than MPIIO. We were unable to reproduce the miscompare signature and concluded that the problem was MPIIO specific but we've also been dealing with a number of other (presumably) unrelated failures in these tests that are generating a lot of noise. I was going through some of the prior results in more detail looking at all the failures and noticed that in the POSIX runs we had failures that weren't miscompares, but a different case where IOR failed due to exhausting retries much like you describe above that I wanted to point out:

             WARNING: Task 63 requested transfer of 1048576 bytes,
                      but transferred 0 bytes at offset 33957085184
             WARNING: This file system requires support of partial write()s.
             WARNING: Requested xfer of 1048576 bytes, but xferred 0 bytes
             Only transferred 0 of 1048576 bytes
            

            and it eventually fails with:

            WARNING: Task 63 requested transfer of 1048576 bytes,
                      but transferred 0 bytes at offset 33957085184
             ** error **
             ERROR in aiori-POSIX.c (line 266): too many retries -- aborting.
             ERROR: No such file or directory
             ** exiting **
            

            see attached complete log alternate_posix_signature_Command.out15537.txt

            so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.

            jschwartz Josh Schwartz added a comment - - edited I think the following looks extremely suspicious: ... basically we try to restart write 1000 times with no success (because of losing layout lock, but this is file per process and no cancel/enqueue for LL has happened) and just exit with result=0 finally. I'm not very familiar with this code, still digging.. any advice is welcome. This caught my attention because one of the other things we tried was running this same test with POSIX rather than MPIIO. We were unable to reproduce the miscompare signature and concluded that the problem was MPIIO specific but we've also been dealing with a number of other (presumably) unrelated failures in these tests that are generating a lot of noise. I was going through some of the prior results in more detail looking at all the failures and noticed that in the POSIX runs we had failures that weren't miscompares, but a different case where IOR failed due to exhausting retries much like you describe above that I wanted to point out: WARNING: Task 63 requested transfer of 1048576 bytes, but transferred 0 bytes at offset 33957085184 WARNING: This file system requires support of partial write()s. WARNING: Requested xfer of 1048576 bytes, but xferred 0 bytes Only transferred 0 of 1048576 bytes and it eventually fails with: WARNING: Task 63 requested transfer of 1048576 bytes, but transferred 0 bytes at offset 33957085184 ** error ** ERROR in aiori-POSIX.c (line 266): too many retries -- aborting. ERROR: No such file or directory ** exiting ** see attached complete log alternate_posix_signature_Command.out15537.txt so this might also provide another avenue for debugging the issue if it's easier to debug the POSIX than the MPIIO. This is probably a little earlier / less misleading since the root cause of the MPIIO "miscompare" may just be this partial write that retries many times and then returns 0 as you noted causing IOR to think it was successful and then discover it later during a read and declare it a miscompare.
            jschwartz Josh Schwartz added a comment - - edited

            Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow.

            Here are the layouts we have tried:

            original failure (DoM+SEL+OVS):

            lfs setstripe --component-end 128K -L mdt --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname 

            removed DoM, still failed:

            lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname 

            removed OVS, still failed:

            lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --extension-size 64M $dirname 

            However, when we just removed the setstripe entirely and used the default layout, e.g.

             vers@jupiter-p1:/lus/kjcf05/flash/jbs/rerun/CL_IOR_b256m_1s_fpp_mpiio_wr_8iter_1m_LUS_12584> lfs getstripe .
            .
            stripe_count:  1 stripe_size:   4194304 pattern:       raid0 stripe_offset: -1 pool:          flash
            
            ./IORfile_1m.00000059
            lmm_stripe_count:  1
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 3
            lmm_pool:          flash
                    obdidx           objid           objid           group
                         3        14734531       0xe0d4c3      0x3400088d1
            
            ...

            we can no longer reproduce the problem.

             

            You requested:

            2) with a simpler layout like a single component, or few components, but no extention?
            thanks in advance.

            Can you provide a specific layout you would like to see that you suspect would still hit the problem but be sufficiently simple to debug?

            We can use up to four OSTs, two are flash and two are disk, the problem seems to occur more often/readily on the disk OSTs, so a layout limited to those two OSTs will likely yield the best reproducibility:

            Pool: kjcf05.all
            kjcf05-OST0000_UUID
            kjcf05-OST0001_UUID
            kjcf05-OST0002_UUID
            kjcf05-OST0003_UUID
            Pool: kjcf05.disk
            kjcf05-OST0000_UUID
            kjcf05-OST0001_UUID
            ...
            Pool: kjcf05.flash
            kjcf05-OST0002_UUID
            kjcf05-OST0003_UUID
            
            jschwartz Josh Schwartz added a comment - - edited Prior to the bisect we tried a few other things that people have suggested and one of those was just to use the default layout; so far we have not been able to reproduce the problem that way, so that leads me to suspect that PFL/SEL is related somehow. Here are the layouts we have tried: original failure (DoM+SEL+OVS): lfs setstripe --component-end 128K -L mdt --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname removed DoM, still failed: lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --overstripe-count 32 --extension-size 64M $dirname removed OVS, still failed: lfs setstripe --component-end 128M --extension-size 64M --component-end -1 --extension-size 64M $dirname However, when we just removed the setstripe entirely and used the default layout, e.g. vers@jupiter-p1:/lus/kjcf05/flash/jbs/rerun/CL_IOR_b256m_1s_fpp_mpiio_wr_8iter_1m_LUS_12584> lfs getstripe . . stripe_count:  1 stripe_size:   4194304 pattern:       raid0 stripe_offset: -1 pool:          flash ./IORfile_1m.00000059 lmm_stripe_count:  1 lmm_stripe_size:   4194304 lmm_pattern:       raid0 lmm_layout_gen:    0 lmm_stripe_offset: 3 lmm_pool:          flash         obdidx           objid           objid           group              3        14734531       0xe0d4c3      0x3400088d1 ... we can no longer reproduce the problem.   You requested: 2) with a simpler layout like a single component, or few components, but no extention? thanks in advance. Can you provide a specific layout you would like to see that you suspect would still hit the problem but be sufficiently simple to debug? We can use up to four OSTs, two are flash and two are disk, the problem seems to occur more often/readily on the disk OSTs, so a layout limited to those two OSTs will likely yield the best reproducibility: Pool: kjcf05.all kjcf05-OST0000_UUID kjcf05-OST0001_UUID kjcf05-OST0002_UUID kjcf05-OST0003_UUID Pool: kjcf05.disk kjcf05-OST0000_UUID kjcf05-OST0001_UUID ... Pool: kjcf05.flash kjcf05-OST0002_UUID kjcf05-OST0003_UUID

            prasannakumar the next time please grab dmesg logs from MDSs

            bzzz Alex Zhuravlev added a comment - prasannakumar the next time please grab dmesg logs from MDSs

            People

              bzzz Alex Zhuravlev
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: