Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18416

Data corruption/miscompare observed during 48hr FOFB

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Data corruption/miscompare issue observed during 48 hours lustre FOFB.

       [63] FAILED comparison of buffer containing 8-byte ints:
       [63]   File name = /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
       [63]   In transfer 129, 131072 errors between buffer indices 0 and 131071.
       [63]   File byte offset = 33957085184:
       [63]     Expected: 0x0000003f671ad176 0000000000000008 0000003f671ad176 0000000000000018 
       [63]     Actual:   0x6161616161616161 6161616161616161 6161616161616161 6161616161616161 
       ** error **
       ERROR in IOR.c (line 448): data check error, aborting execution.
       ERROR: No such file or directory
       ** exiting **
       Rank 63 [Thu Oct 24 18:07:48 2024] [c0-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 63
       _pmiu_daemon(SIGCHLD): [NID 00173] [c0-0c2s11n1] [Thu Oct 24 18:07:49 2024] PE RANK 63 exit signal Aborted
       [NID 00173] 2024-10-24 18:07:49 Apid 8319126: initiated application termination 

      First issue with timestamp :

       read      5759       524288     1024.00    0.042503   5.65       0.000407   5.69       6    XXCEL
       Using Time Stamp 1729810806 (0x671ad176) for Data Signature
       delaying 15 seconds . . .
       Commencing write performance test.
       Thu Oct 24 18:00:25 2024
       
       WARNING: Expected aggregate file size       = 34359738368.
       WARNING: Stat() of aggregate file size      = 33957085184.
       WARNING: Using actual aggregate bytes moved = 34359738368.
       write     74.34      524288     1024.00    0.010731   440.78     0.002005   440.79     7    XXCEL
       Verifying contents of the file(s) just written.
       Thu Oct 24 18:07:45 2024
       
       [63] At transfer buffer #128, index #0 (file byte offset 33957085184):
       [63] Expected: 0x0000003f671ad176
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #1 (file byte offset 33957085192):
       [63] Expected: 0x0000000000000008
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #2 (file byte offset 33957085200): 

      FOFB operation during the corruption time :

      2024-10-24t17:45:15 Test 17  -- operation kill failover of kjcf05n03
      2024-10-24t18:11:14 Test 17  -- operation kill failback of kjcf05n03 

      Recovery after kjcf05n03 Failover:

      Oct 24 18:04:49 kjcf05n06 kernel: Lustre: kjcf05-MDT0001-lwp-OST0002: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/4576 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/608 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:33 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib) 

      Recovery after kjcf05n03 Failback :

      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/4176 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/608 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:14 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)
      Oct 24 18:27:14 kjcf05n02 kernel: Lustre: kjcf05-MDT0001-osp-MDT0000: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib) 

      Timestamp for the IOR job :

      <150>1 2024-10-24T17:52:40.433638-05:00 c0-0c0s1n1 aprun 25706 p1-20241023t024851 [alps_msgs@34] apid=8319126, Starting, user=1356, cmd_line="aprun -n 64 /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m -a MPIIO -i 8 -w -r -W -t 1m -b 512m -C -k -vv -q -d 15 ", num_nodes=2, node_list=172-173, cwd="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633", wdir="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633"
      <150>1 2024-10-24T18:07:53.597430-05:00 c0-0c0s1n1 apsys 25844 p1-20241023t024851 [alps_msgs@34] apid=8319126, Finishing, user=1356, exit_code=137, exitcode_array=134:0, exitsignal_array=9:0
      <142>1 2024-10-24T18:07:55.386375-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: taskstats {"core": 0, "exitcode:signal": 0, "rur_errors": 0, "rur_truncated_record_errors": 0, "max_rss": 0, "stime": 0, "wchar": 0, "rchar": 0, "utime": 0}
      <142>1 2024-10-24T18:07:55.386441-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: timestamp APP_START 2024-10-24T17:52:40CDT APP_STOP 2024-10-24T18:07:54CDT 

      lfs getstripe O/P :

      ./CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
         lcm_layout_gen:    7
         lcm_mirror_count:  1
         lcm_entry_count:   4
           lcme_id:             1
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 0
           lcme_extent.e_end:   131072
             lmm_stripe_count:  0
             lmm_stripe_size:   131072
             lmm_pattern:       mdt
             lmm_layout_gen:    65535
             lmm_stripe_offset: 0
       
           lcme_id:             2
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 131072
           lcme_extent.e_end:   134217728
             lmm_stripe_count:  1
             lmm_stripe_size:   131072
             lmm_pattern:       raid0
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f1:0x0] }
       
           lcme_id:             4
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 134217728
           lcme_extent.e_end:   34359738368
             lmm_stripe_count:  32
             lmm_stripe_size:   131072
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f2:0x0] }
             -   1: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c414:0x0] }
             -   2: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f3:0x0] }
             -   3: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c415:0x0] }
             -   4: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f4:0x0] }
             -   5: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c416:0x0] }
             -   6: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f5:0x0] }
             -   7: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c417:0x0] }
             -   8: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f6:0x0] }
             -   9: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c418:0x0] }
             -  10: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f7:0x0] }
             -  11: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c419:0x0] }
             -  12: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f8:0x0] }
             -  13: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41a:0x0] }
             -  14: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f9:0x0] }
             -  15: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41b:0x0] }
             -  16: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fa:0x0] }
             -  17: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41c:0x0] }
             -  18: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fb:0x0] }
             -  19: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41d:0x0] }
             -  20: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fc:0x0] }
             -  21: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41e:0x0] }
             -  22: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fd:0x0] }
             -  23: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41f:0x0] }
             -  24: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fe:0x0] }
             -  25: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c420:0x0] }
             -  26: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44ff:0x0] }
             -  27: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c421:0x0] }
             -  28: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4500:0x0] }
             -  29: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c422:0x0] }
             -  30: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4501:0x0] }
             -  31: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c423:0x0] }
       
           lcme_id:             5
           lcme_mirror_id:      0
           lcme_flags:          extension
           lcme_extent.e_start: 34359738368
           lcme_extent.e_end:   EOF
             lmm_stripe_count:  0
             lmm_extension_size: 67108864
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    0
             lmm_stripe_offset: -1
             lmm_pool:          flash 

      debugfs stat:

      [root@kjcf05n03 ~]# debugfs -c -R "stat <481925412>" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      Inode: 481925412  Type: regular  Mode: 0644  Flags: 0x80000
      Generation: 2621827007  Version: 0x000005f9:19d97c78
      User: 1356  Group: 11121  Project:   0  Size: 131072
      File ACL: 620866789
      Links: 1  Blockcount: 264
      Fragment: Address: 0  Number: 0  Size: 0
       ctime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
       atime: 0x671f95bc:00000000 -- Mon Oct 28 08:46:36 2024
       mtime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
      crtime: 0x671ad189:07bed478 -- Thu Oct 24 18:00:25 2024
      Size of extra inode fields: 32
      Extended attributes:
       lma: fid=[0x2403375e4:0x14053:0x0] compat=0 incompat=0
       linkea: idx=0 parent=[0x240337296:0x454:0x0] name='IORfile_1m'
       trusted.dataver (8) = 78 7c d9 19 f9 05 00 00 
       trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 10 01 f4 03 00 00 00 00 
       trusted.lov (1216)
      EXTENTS:
      (0-31):671298939-671298970
      
      [root@kjcf05n03 ~]#
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # lfs path2fid /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
      [0x2403375e4:0x14053:0x0]
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # 
      
      [root@kjcf05n03 ~]# debugfs -c -R " dump <481925412> /tmp/mdt.IORfile.1024" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | head -n3
      00000000  76 d1 1a 67 00 00 00 00  08 00 00 00 00 00 00 00  |v..g............|
      00000010  76 d1 1a 67 00 00 00 00  18 00 00 00 00 00 00 00  |v..g............|
      00000020  76 d1 1a 67 00 00 00 00  28 00 00 00 00 00 00 00  |v..g....(.......|
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | grep -v '76 d1 1a 67'
      00020000
      [root@kjcf05n03 ~]#
      

      00020000 > 131072/1024 = 128k.

      Attachments

        1. alternate_posix_signature_Command.out15537.txt
          1.39 MB
        2. bad_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241107005243.txt
          286 kB
        3. bad_dklog.nid00037.20241107005243.llog.gz
          57.30 MB
        4. bad_dklog.nid00037.20241107005243.llog-1.gz
          57.30 MB
        5. bad_dklog.nid00038.20241107005243.llog.gz
          46.37 MB
        6. bad_dklog.nid00038.20241107005243.llog-1.gz
          46.37 MB
        7. bad_dklog.nid00039.20241107005243.llog.gz
          37.98 MB
        8. bad_dklog.nid00039.20241107005243.llog-1.gz
          37.98 MB
        9. CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt
          297 kB
        10. CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          207 kB
        11. CL_IOR_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          206 kB
        12. client_dk.tar.gz
          153.44 MB
        13. client_dk.tar-1.gz
          153.44 MB
        14. client_dmesg.tar.gz
          54 kB
        15. client_dmesg.tar-1.gz
          54 kB
        16. client_logs_20241111.tar.gz
          185.82 MB
        17. data.20241108.tar.gz
          369.31 MB
        18. dmesg.log
          150 kB
        19. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241106181830_output.txt
          201 kB
        20. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_202411077072444.txt
          7 kB
        21. good_dklog.nid00172_202411077072444.log.tar.gz
          91.56 MB
        22. good_dklog.nid00172.20241106181830.llog.gz
          82.55 MB
        23. good_dklog.nid00173_202411077072444.log.tar.gz
          91.21 MB
        24. good_dklog.nid00173.20241106181830.llog.gz
          83.21 MB
        25. kjcf05_llog.tar.gz
          581 kB
        26. kjcf05n02_llog.tar.gz
          68 kB
        27. kjcf05n03_llog.tar.gz
          56 kB
        28. mds_mgs_dklog.tar.gz
          3.03 MB
        29. mds_mgs_dklog.tar-1.gz
          3.03 MB
        30. mds_mgs_dmesg.tar.gz
          70 kB
        31. mgs_mds_logs_20241111.tar.gz
          11.69 MB
        32. README.lu18416
          41 kB
        33. test_output_20241111.txt
          298 kB

        Issue Links

          Activity

            [LU-18416] Data corruption/miscompare observed during 48hr FOFB
            spitzcor Cory Spitz added a comment -

            > So, is there anything further to track under this ticket or can we close it as a duplicate of LU-18435?
            I think we're all done here. Let's resolve it as a duplicate.

            spitzcor Cory Spitz added a comment - > So, is there anything further to track under this ticket or can we close it as a duplicate of LU-18435 ? I think we're all done here. Let's resolve it as a duplicate.
            spitzcor Cory Spitz added a comment - - edited

            Thanks for clarifying, Alex. Right, we were talking about the same thing. I think that latest/greatest canonical IOR operates differently and is better in that regard. If not we'll file a ticket with the upstream IOR/mdtest project.

            spitzcor Cory Spitz added a comment - - edited Thanks for clarifying, Alex. Right, we were talking about the same thing. I think that latest/greatest canonical IOR operates differently and is better in that regard. If not we'll file a ticket with the upstream IOR/mdtest project.
            pjones Peter Jones added a comment -

            Very good news - thanks for the update. So, is there anything further to track under this ticket or can we close it as a duplicate of LU-18435?

            pjones Peter Jones added a comment - Very good news - thanks for the update. So, is there anything further to track under this ticket or can we close it as a duplicate of LU-18435 ?

            Completed 48 hour reproducer run with fix. Issue not reproduced.  

            prasannakumar Prasannakumar Nagasubramani added a comment - Completed 48 hour reproducer run with fix. Issue not reproduced.  

            Can you clarify what you meant there?

            IMHO, IOR shouldn't proceed if it can not write some data (short writes) and there is no point to verify full dataset after short writes.

            bzzz Alex Zhuravlev added a comment - Can you clarify what you meant there? IMHO, IOR shouldn't proceed if it can not write some data (short writes) and there is no point to verify full dataset after short writes.
            pjones Peter Jones added a comment -

            Cory

            The patch is already in Gerrit for testing/reviews, just tracked under LU-18435 instead of this ticket number

            Peter

            pjones Peter Jones added a comment - Cory The patch is already in Gerrit for testing/reviews, just tracked under LU-18435 instead of this ticket number Peter
            spitzcor Cory Spitz added a comment -

            Hi, Alex.
            > probably makes sense to ask IOR authors to check for write(2) results?
            Can you clarify what you meant there? We know that the IOR we've used [we're not using the latest https://github.com/hpc/ior] is misinterpreting the failed and correct, shortened file size, as a miscompare. But, are you referring to something else?

            p.s., thanks for proposing your fix; can you post it for formal review?

            spitzcor Cory Spitz added a comment - Hi, Alex. > probably makes sense to ask IOR authors to check for write(2) results? Can you clarify what you meant there? We know that the IOR we've used [we're not using the latest https://github.com/hpc/ior] is misinterpreting the failed and correct , shortened file size, as a miscompare. But, are you referring to something else? p.s., thanks for proposing your fix; can you post it for formal review?

            Reproducer run with fix patch crossed half way mark (24 hours) without corruption. 

            Running: 1d 15m 19s Run ends: 23h 44m 41s

            prasannakumar Prasannakumar Nagasubramani added a comment - Reproducer run with fix patch crossed half way mark (24 hours) without corruption.  Running: 1d 15m 19s Run ends: 23h 44m 41s

            probably makes sense to ask IOR authors to check for write(2) results?

            bzzz Alex Zhuravlev added a comment - probably makes sense to ask IOR authors to check for write(2) results?

            bzzz I started 48 hour reproducer run with patch on server. Noticed the run crossed 4 hours mark without corruption, we used to hit issue within 90 mins.

            prasannakumar Prasannakumar Nagasubramani added a comment - - edited bzzz I started 48 hour reproducer run with patch on server. Noticed the run crossed 4 hours mark without corruption, we used to hit issue within 90 mins.

            People

              bzzz Alex Zhuravlev
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: