Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13899

IOR data corruption detected during automated lnet fofb testing

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Seeing data corruption on two IOR jobs while running lnet failover test with " Regression write/verify DNE2 PFL". Both the failures showup data from immediate previous run. Likely cache related issue. fsync related warnings are also logged.

      CL_IOR_all_wr_20iter_1666K_rand

       access             = file-per-process
       	pattern            = strided (33 segments)
       	ordering in a file = random offsets
       	ordering inter file=constant task offsets = 1
       	clients            = 48 (4 per node)
       	repetitions        = 20
       	xfersize           = 1.63 MiB
       	blocksize          = 27.66 MiB
       	aggregate filesize = 42.78 GiB
      
       read      1316.54    28322      1666.00    0.001729   33.27      0.010429   33.28      16   XXCEL
       Using Time Stamp 1594461626 (0x5f098dba) for Data Signature
       Commencing write performance test.
       Sat Jul 11 05:00:26 2020read      1852.87    28322      1666.00    0.001450   23.64      0.004490   23.64      17   XXCEL
       Using Time Stamp 1594461701 (0x5f098e05) for Data Signature
       Commencing write performance test.
       Sat Jul 11 05:01:41 2020WARNING: cannot perform fsync on file.
       WARNING: cannot perform fsync on file.
       write     197.50     28322      1666.00    0.001447   221.82     0.421875   221.82     18   XXCEL
       Verifying contents of the file(s) just written.
       Sat Jul 11 05:05:23 2020 [6] At transfer buffer #3, index #0 (file byte offset 235425792):
       [6] Expected: 0x0000000e5f098e05
       [6] Actual:   0x0000000e5f098dba
       [6] At transfer buffer #3, index #2 (file byte offset 235425808):
       [6] Expected: 0x0000000e5f098e05
       [6] Actual:   0x0000000e5f098dbaIOR job: -a POSIX -i 20 -w -r -W -t 1666K -b 28322K -C -e -k -vv -E -F -q -s 33 -x -z
      
      
      File in question:/lus/snx11205/ostest.vers/alsorun.20200711032404.2543.pollux-p4/CL_IOR_all_wr_20iter_1666K_rand.2.9t3jzC.1594459731/CL_IOR_all_wr_20iter_1666K_rand/IORfile.00000004 
      

      Attachments

        Activity

          [LU-13899] IOR data corruption detected during automated lnet fofb testing
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39612/
          Subject: LU-13899 tgt: drop old epoch request
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4f192768293364c65411015de7531f62fdfb754c

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39612/ Subject: LU-13899 tgt: drop old epoch request Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f192768293364c65411015de7531f62fdfb754c

          > shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0.
          agreed

          zam Alexander Zarochentsev added a comment - > shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0. agreed
          eaujames Etienne Aujames added a comment - - edited

          Hello,
          Is a backport planned for b2_12?

          eaujames Etienne Aujames added a comment - - edited Hello, Is a backport planned for b2_12?
          spitzcor Cory Spitz added a comment -

          zam, shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0.

          spitzcor Cory Spitz added a comment - zam , shouldn't this data integrity issue carry a higher priority? I think we ought to target landing for 2.14.0.

          Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/39612
          Subject: LU-13899 tgt: drop old epoch request
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 2af903484430def99448b0e0ba8685aec720dda6

          gerrit Gerrit Updater added a comment - Alexander Zarochentsev (alexander.zarochentsev@hpe.com) uploaded a new patch: https://review.whamcloud.com/39612 Subject: LU-13899 tgt: drop old epoch request Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2af903484430def99448b0e0ba8685aec720dda6

          Further debugging showed that
          1. failed fsync() was due to a AS_EIO flag set for the address space, and 492 pages with PG_error flag set:

          crash-7.2.8> kmem -p | grep ffff8803f2c10c50|awk '{print $7}'|sort|uniq -c
              491 error,referenced,uptodate,lru,private
                1 error,uptodate,lru,active,private
           232875 referenced,uptodate,lru,private
               11 referenced,uptodate,private
              279 uptodate,lru,active,private
          crash-7.2.8>
          

          2. A debug patch gave us error code -ESTALE:

          18282:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) page@ffff88017556f200[2 ffff8801f8663
          0e8 0 1           (null)]
          00000020:00000001:28.0:1595864091.118282:0:3566:0:(cl_page.c:840:cl_req_type_state()) Process leaving (rc=2 : 2 : 2)
          00000080:00008000:25.0:1595864091.118283:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) completing WRITE with -116
          00000020:00000001:28.0:1595864091.118283:0:3566:0:(cl_page.c:925:cl_page_completion()) Process entered
          00000080:00040000:25.0:1595864091.118284:0:3564:0:(vvp_page.c:246:vvp_vmpage_error()) LBUG
          00000020:00000001:28.0:1595864091.118285:0:3566:0:(cl_page.c:926:cl_page_completion()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1 
                    (null)]
          00000020:00000001:28.0:1595864091.118286:0:3566:0:(cl_page.c:926:cl_page_completion()) 1 0
          00000020:00000001:28.0:1595864091.118287:0:3566:0:(cl_page.c:341:cl_page_state_set0()) Process entered
          00000020:00000001:28.0:1595864091.118288:0:3566:0:(cl_page.c:344:cl_page_state_set0()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1 
                    (null)]
          

          3. and finally, the -ESTALE comes from (process_req_last_xid):

          00000100:00000200:15.0:1596013191.408113:0:11072:0:(service.c:2233:ptlrpc_server_handle_request()) got req 1673498761985920
          00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:706:tgt_request_handle()) Process entered
          00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:637:process_req_last_xid()) Process entered
          00000020:00000001:15.0:1596013191.408115:0:11072:0:(tgt_handler.c:679:process_req_last_xid()) Process leaving via out (rc=18446744073709551500 : -116 : 0xffffffffffffff8c)
          00000020:00000001:15.0:1596013191.408116:0:11072:0:(tgt_handler.c:691:process_req_last_xid()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)
          

          It appears the -ESTALE is not handled properly at client.

          zam Alexander Zarochentsev added a comment - Further debugging showed that 1. failed fsync() was due to a AS_EIO flag set for the address space, and 492 pages with PG_error flag set: crash-7.2.8> kmem -p | grep ffff8803f2c10c50|awk '{print $7}'|sort|uniq -c 491 error,referenced,uptodate,lru,private 1 error,uptodate,lru,active,private 232875 referenced,uptodate,lru,private 11 referenced,uptodate,private 279 uptodate,lru,active,private crash-7.2.8> 2. A debug patch gave us error code -ESTALE: 18282:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) page@ffff88017556f200[2 ffff8801f8663 0e8 0 1 (null)] 00000020:00000001:28.0:1595864091.118282:0:3566:0:(cl_page.c:840:cl_req_type_state()) Process leaving (rc=2 : 2 : 2) 00000080:00008000:25.0:1595864091.118283:0:3564:0:(vvp_page.c:304:vvp_page_completion_write()) completing WRITE with -116 00000020:00000001:28.0:1595864091.118283:0:3566:0:(cl_page.c:925:cl_page_completion()) Process entered 00000080:00040000:25.0:1595864091.118284:0:3564:0:(vvp_page.c:246:vvp_vmpage_error()) LBUG 00000020:00000001:28.0:1595864091.118285:0:3566:0:(cl_page.c:926:cl_page_completion()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1 (null)] 00000020:00000001:28.0:1595864091.118286:0:3566:0:(cl_page.c:926:cl_page_completion()) 1 0 00000020:00000001:28.0:1595864091.118287:0:3566:0:(cl_page.c:341:cl_page_state_set0()) Process entered 00000020:00000001:28.0:1595864091.118288:0:3566:0:(cl_page.c:344:cl_page_state_set0()) page@ffff8801ff17fc00[2 ffff8801f86630e8 2 1 (null)] 3. and finally, the -ESTALE comes from (process_req_last_xid): 00000100:00000200:15.0:1596013191.408113:0:11072:0:(service.c:2233:ptlrpc_server_handle_request()) got req 1673498761985920 00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:706:tgt_request_handle()) Process entered 00000020:00000001:15.0:1596013191.408114:0:11072:0:(tgt_handler.c:637:process_req_last_xid()) Process entered 00000020:00000001:15.0:1596013191.408115:0:11072:0:(tgt_handler.c:679:process_req_last_xid()) Process leaving via out (rc=18446744073709551500 : -116 : 0xffffffffffffff8c) 00000020:00000001:15.0:1596013191.408116:0:11072:0:(tgt_handler.c:691:process_req_last_xid()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c) It appears the -ESTALE is not handled properly at client.

          People

            zam Alexander Zarochentsev
            zam Alexander Zarochentsev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: