Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18416

Data corruption/miscompare observed during 48hr FOFB

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Data corruption/miscompare issue observed during 48 hours lustre FOFB.

       [63] FAILED comparison of buffer containing 8-byte ints:
       [63]   File name = /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
       [63]   In transfer 129, 131072 errors between buffer indices 0 and 131071.
       [63]   File byte offset = 33957085184:
       [63]     Expected: 0x0000003f671ad176 0000000000000008 0000003f671ad176 0000000000000018 
       [63]     Actual:   0x6161616161616161 6161616161616161 6161616161616161 6161616161616161 
       ** error **
       ERROR in IOR.c (line 448): data check error, aborting execution.
       ERROR: No such file or directory
       ** exiting **
       Rank 63 [Thu Oct 24 18:07:48 2024] [c0-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 63
       _pmiu_daemon(SIGCHLD): [NID 00173] [c0-0c2s11n1] [Thu Oct 24 18:07:49 2024] PE RANK 63 exit signal Aborted
       [NID 00173] 2024-10-24 18:07:49 Apid 8319126: initiated application termination 

      First issue with timestamp :

       read      5759       524288     1024.00    0.042503   5.65       0.000407   5.69       6    XXCEL
       Using Time Stamp 1729810806 (0x671ad176) for Data Signature
       delaying 15 seconds . . .
       Commencing write performance test.
       Thu Oct 24 18:00:25 2024
       
       WARNING: Expected aggregate file size       = 34359738368.
       WARNING: Stat() of aggregate file size      = 33957085184.
       WARNING: Using actual aggregate bytes moved = 34359738368.
       write     74.34      524288     1024.00    0.010731   440.78     0.002005   440.79     7    XXCEL
       Verifying contents of the file(s) just written.
       Thu Oct 24 18:07:45 2024
       
       [63] At transfer buffer #128, index #0 (file byte offset 33957085184):
       [63] Expected: 0x0000003f671ad176
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #1 (file byte offset 33957085192):
       [63] Expected: 0x0000000000000008
       [63] Actual:   0x6161616161616161
       [63] At transfer buffer #128, index #2 (file byte offset 33957085200): 

      FOFB operation during the corruption time :

      2024-10-24t17:45:15 Test 17  -- operation kill failover of kjcf05n03
      2024-10-24t18:11:14 Test 17  -- operation kill failback of kjcf05n03 

      Recovery after kjcf05n03 Failover:

      Oct 24 18:04:49 kjcf05n06 kernel: Lustre: kjcf05-MDT0001-lwp-OST0002: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib)
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/4576 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:31 kjcf05n02 kernel: LustreError: 1970850:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000cb79bd48 x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:341/0 lens 520/608 e 0 to 0 dl 1729811181 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:05:33 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.52@o2ib (at 10.16.100.52@o2ib) 

      Recovery after kjcf05n03 Failback :

      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_open.c:1442:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/4176 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:05 kjcf05n03 kernel: LustreError: 52805:0:(mdt_handler.c:5119:mdt_intent_open()) @@@ Replay open failed with -14  req@00000000b48954bd x1813706872336512/t0(6532655383690) o101->133a053d-2a7a-4972-bd0d-89152276f0a7@35@gni:119/0 lens 520/608 e 0 to 0 dl 1729812474 ref 1 fl Interpret:/204/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 24 18:27:14 kjcf05n00 kernel: Lustre: kjcf05-MDT0001-mdc-ffff9734e40bd800: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib)
      Oct 24 18:27:14 kjcf05n02 kernel: Lustre: kjcf05-MDT0001-osp-MDT0000: Connection restored to 10.16.100.53@o2ib (at 10.16.100.53@o2ib) 

      Timestamp for the IOR job :

      <150>1 2024-10-24T17:52:40.433638-05:00 c0-0c0s1n1 aprun 25706 p1-20241023t024851 [alps_msgs@34] apid=8319126, Starting, user=1356, cmd_line="aprun -n 64 /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR -o /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m -a MPIIO -i 8 -w -r -W -t 1m -b 512m -C -k -vv -q -d 15 ", num_nodes=2, node_list=172-173, cwd="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633", wdir="/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.25633"
      <150>1 2024-10-24T18:07:53.597430-05:00 c0-0c0s1n1 apsys 25844 p1-20241023t024851 [alps_msgs@34] apid=8319126, Finishing, user=1356, exit_code=137, exitcode_array=134:0, exitsignal_array=9:0
      <142>1 2024-10-24T18:07:55.386375-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: taskstats {"core": 0, "exitcode:signal": 0, "rur_errors": 0, "rur_truncated_record_errors": 0, "max_rss": 0, "stime": 0, "wchar": 0, "rchar": 0, "utime": 0}
      <142>1 2024-10-24T18:07:55.386441-05:00 c0-0c0s1n1 RUR 29745 p1-20241023t024851 [RUR@34] uid: 1356, apid: 8319126, jobid: 0, cmdname: /hpcdc/project/craytest/ostest/binaries/xt/rel.70up03.aries.cray-sp2/xtcnl/ostest/ROOT.latest/tests/gold/ioperf/IOR/IOR, plugin: timestamp APP_START 2024-10-24T17:52:40CDT APP_STOP 2024-10-24T18:07:54CDT 

      lfs getstripe O/P :

      ./CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
         lcm_layout_gen:    7
         lcm_mirror_count:  1
         lcm_entry_count:   4
           lcme_id:             1
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 0
           lcme_extent.e_end:   131072
             lmm_stripe_count:  0
             lmm_stripe_size:   131072
             lmm_pattern:       mdt
             lmm_layout_gen:    65535
             lmm_stripe_offset: 0
       
           lcme_id:             2
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 131072
           lcme_extent.e_end:   134217728
             lmm_stripe_count:  1
             lmm_stripe_size:   131072
             lmm_pattern:       raid0
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f1:0x0] }
       
           lcme_id:             4
           lcme_mirror_id:      0
           lcme_flags:          init
           lcme_extent.e_start: 134217728
           lcme_extent.e_end:   34359738368
             lmm_stripe_count:  32
             lmm_stripe_size:   131072
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    65535
             lmm_stripe_offset: 2
             lmm_pool:          flash
             lmm_objects:
             -   0: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f2:0x0] }
             -   1: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c414:0x0] }
             -   2: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f3:0x0] }
             -   3: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c415:0x0] }
             -   4: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f4:0x0] }
             -   5: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c416:0x0] }
             -   6: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f5:0x0] }
             -   7: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c417:0x0] }
             -   8: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f6:0x0] }
             -   9: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c418:0x0] }
             -  10: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f7:0x0] }
             -  11: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c419:0x0] }
             -  12: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f8:0x0] }
             -  13: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41a:0x0] }
             -  14: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44f9:0x0] }
             -  15: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41b:0x0] }
             -  16: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fa:0x0] }
             -  17: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41c:0x0] }
             -  18: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fb:0x0] }
             -  19: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41d:0x0] }
             -  20: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fc:0x0] }
             -  21: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41e:0x0] }
             -  22: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fd:0x0] }
             -  23: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c41f:0x0] }
             -  24: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44fe:0x0] }
             -  25: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c420:0x0] }
             -  26: { l_ost_idx:   2, l_fid: [0x300007930:0xfa44ff:0x0] }
             -  27: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c421:0x0] }
             -  28: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4500:0x0] }
             -  29: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c422:0x0] }
             -  30: { l_ost_idx:   2, l_fid: [0x300007930:0xfa4501:0x0] }
             -  31: { l_ost_idx:   3, l_fid: [0x3400061c0:0xf4c423:0x0] }
       
           lcme_id:             5
           lcme_mirror_id:      0
           lcme_flags:          extension
           lcme_extent.e_start: 34359738368
           lcme_extent.e_end:   EOF
             lmm_stripe_count:  0
             lmm_extension_size: 67108864
             lmm_pattern:       raid0,overstriped
             lmm_layout_gen:    0
             lmm_stripe_offset: -1
             lmm_pool:          flash 

      debugfs stat:

      [root@kjcf05n03 ~]# debugfs -c -R "stat <481925412>" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      Inode: 481925412  Type: regular  Mode: 0644  Flags: 0x80000
      Generation: 2621827007  Version: 0x000005f9:19d97c78
      User: 1356  Group: 11121  Project:   0  Size: 131072
      File ACL: 620866789
      Links: 1  Blockcount: 264
      Fragment: Address: 0  Number: 0  Size: 0
       ctime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
       atime: 0x671f95bc:00000000 -- Mon Oct 28 08:46:36 2024
       mtime: 0x671ad2fa:00000000 -- Thu Oct 24 18:06:34 2024
      crtime: 0x671ad189:07bed478 -- Thu Oct 24 18:00:25 2024
      Size of extra inode fields: 32
      Extended attributes:
       lma: fid=[0x2403375e4:0x14053:0x0] compat=0 incompat=0
       linkea: idx=0 parent=[0x240337296:0x454:0x0] name='IORfile_1m'
       trusted.dataver (8) = 78 7c d9 19 f9 05 00 00 
       trusted.som (24) = 04 00 00 00 00 00 00 00 00 00 00 e8 07 00 00 00 10 01 f4 03 00 00 00 00 
       trusted.lov (1216)
      EXTENTS:
      (0-31):671298939-671298970
      
      [root@kjcf05n03 ~]#
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # lfs path2fid /lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m/IORfile_1m
      [0x2403375e4:0x14053:0x0]
      jupiter-p1:/lus/kjcf05/flash/ostest.vers/alsorun.20241024070601.93268.jupiter-p1/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.1.kNcCgv.1729810355/CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1
      m # 
      
      [root@kjcf05n03 ~]# debugfs -c -R " dump <481925412> /tmp/mdt.IORfile.1024" /dev/md66
      debugfs 1.46.5.cr1 (16-Jan-2023)
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | head -n3
      00000000  76 d1 1a 67 00 00 00 00  08 00 00 00 00 00 00 00  |v..g............|
      00000010  76 d1 1a 67 00 00 00 00  18 00 00 00 00 00 00 00  |v..g............|
      00000020  76 d1 1a 67 00 00 00 00  28 00 00 00 00 00 00 00  |v..g....(.......|
      [root@kjcf05n03 ~]# hexdump -Cv /tmp/mdt.IORfile.1024 | grep -v '76 d1 1a 67'
      00020000
      [root@kjcf05n03 ~]#
      

      00020000 > 131072/1024 = 128k.

      Attachments

        1. CL_IOR_dom_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          207 kB
        2. CL_IOR_sel_32ovs_all_mpiio_wr_8iter_1m.txt
          206 kB
        3. dmesg.log
          150 kB
        4. README.lu18416
          41 kB
        5. good_dklog.nid00172.20241106181830.llog.gz
          82.55 MB
        6. good_dklog.nid00173.20241106181830.llog.gz
          83.21 MB
        7. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241106181830_output.txt
          201 kB
        8. bad_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_20241107005243.txt
          286 kB
        9. bad_dklog.nid00038.20241107005243.llog.gz
          46.37 MB
        10. bad_dklog.nid00039.20241107005243.llog.gz
          37.98 MB
        11. bad_dklog.nid00037.20241107005243.llog.gz
          57.30 MB
        12. bad_dklog.nid00037.20241107005243.llog-1.gz
          57.30 MB
        13. bad_dklog.nid00039.20241107005243.llog-1.gz
          37.98 MB
        14. bad_dklog.nid00038.20241107005243.llog-1.gz
          46.37 MB
        15. kjcf05_llog.tar.gz
          581 kB
        16. good_CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584_202411077072444.txt
          7 kB
        17. good_dklog.nid00172_202411077072444.log.tar.gz
          91.56 MB
        18. good_dklog.nid00173_202411077072444.log.tar.gz
          91.21 MB
        19. alternate_posix_signature_Command.out15537.txt
          1.39 MB
        20. client_dmesg.tar.gz
          54 kB
        21. mds_mgs_dmesg.tar.gz
          70 kB
        22. mds_mgs_dklog.tar.gz
          3.03 MB
        23. mds_mgs_dklog.tar-1.gz
          3.03 MB
        24. client_dmesg.tar-1.gz
          54 kB
        25. client_dk.tar.gz
          153.44 MB
        26. client_dk.tar-1.gz
          153.44 MB
        27. kjcf05n02_llog.tar.gz
          68 kB
        28. kjcf05n03_llog.tar.gz
          56 kB
        29. CL_IOR_b256m_sel_all_fpp_mpiio_wr_8iter_1m_LUS_12584.txt
          297 kB
        30. data.20241108.tar.gz
          369.31 MB
        31. test_output_20241111.txt
          298 kB
        32. mgs_mds_logs_20241111.tar.gz
          11.69 MB
        33. client_logs_20241111.tar.gz
          185.82 MB

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: