Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11020

Data corruption during MDT<>OST recovery

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.5.0, Lustre 2.11.0
    • None
    • found in IEEL3 testing, but bug looks exist from 2.5.0 to master
    • 3
    • 9223372036854775807

    Description

      Data corruption found during mdtest run with MDT failover testing in parallel.
      Visible effects is some files have lack own OST objects after delete orphan phase.
      This can be result of unreplayed requests or wrong OST reading after failover.
      Originally testing start with power lost failover tests, but it have replicated with normal shutdown + failover. This allow to grab a logs during failover / recovery.
      Cut from logs.
      Server take a request from client:

      00000100:00100000:3.0:1524592151.996038:0:18109:0:(service.c:1955:ptlrpc_server_handle_req_in()) got req x1598582492486880
      00000100:00100000:3.0:1524592151.996042:0:18109:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-54@gni, seq: 1564278792
      00000100:00100000:3.0:1524592151.996043:0:18109:0:(service.c:2103:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6881:15972:x1598582492486880:12345-54@gni:101
      

      Server starts generate a ENODEV during obd_fail set in unmount.

      00000100:00080000:1.0:1524592151.996485:0:2924:0:(niobuf.c:583:ptlrpc_send_reply()) sending ENODEV from failed obd 16
      

      server issue a sync as next step of failover and last committed updated.

      00010000:00080000:3.0:1524592151.996589:0:5591:0:(ldlm_lib.c:2492:target_committed_to_req()) 0c6ce748-6618-ff9d-ad84-6e6f761558b2 last_committed 2937774893859, transno 2937774968804, xid 1598584386635888
      

      — this create is on disk

      00000004:00080000:13.0:1524592151.996630:0:8166:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aae:0x0] last created [0x240000400:0x91562c9d:0x0]
      00000004:00080000:13.0:1524592151.996632:0:8166:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aae:0x0]
      00000004:00080000:13.0:1524592151.996867:0:8166:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aae:0x0], index 0: 0
      

      – but this is not.

      00000004:00080000:3.0:1524592151.996637:0:18109:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aaf:0x0] last created [0x240000400:0x91562c9d:0x0]
      00000004:00080000:3.0:1524592151.996639:0:18109:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aaf:0x0]
      00000004:00080000:3.0:1524592151.996641:0:18109:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aaf:0x0], index 0: 0
      

      Looking in time stamps of osp_object_create calls, I found it have called in reverse. so First write have finished as second it have caused most modern write have overwrite a stale data.

      00000100:00100000:3.0:1524592151.996703:0:18109:0:(service.c:2153:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6888:15972:x1598582492486880:12345-54@gni:101 Request procesed in 660us (669us total) trans 2937774968811 rc -19/-19
      

      as affected create have processed after obd_fail have set it have finished with ENODEV and it handled as ptlrpc recoverable error, so it have resend after connection restored and recovery finished.

      Attachments

        Activity

          [LU-11020] Data corruption during MDT<>OST recovery
          pjones Peter Jones added a comment -

          Landed for 2.12

          pjones Peter Jones added a comment - Landed for 2.12

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/
          Subject: LU-11020 osp: fix race during lov_objids update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 8cd4760536d7f423db87c67bdc8214f13ede3ca8

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/ Subject: LU-11020 osp: fix race during lov_objids update Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8cd4760536d7f423db87c67bdc8214f13ede3ca8

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/
          Subject: LU-11020 osd: use right sync
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: e2b4a521e260cb7b121dc51d4c29d4d47b7c1e1e

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/ Subject: LU-11020 osd: use right sync Project: fs/lustre-release Branch: master Current Patch Set: Commit: e2b4a521e260cb7b121dc51d4c29d4d47b7c1e1e

          Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867
          Subject: LU-11020 osp: fix race when lov objid update
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: fc4c31ff634fd10a1d5a805426a1879e4b27bd3c

          gerrit Gerrit Updater added a comment - Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867 Subject: LU-11020 osp: fix race when lov objid update Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fc4c31ff634fd10a1d5a805426a1879e4b27bd3c

          Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866
          Subject: LU-11020 osd: use right sync
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b62e159963b02957f18e35cb9cb7997ed9d4dede

          gerrit Gerrit Updater added a comment - Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866 Subject: LU-11020 osd: use right sync Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b62e159963b02957f18e35cb9cb7997ed9d4dede

          Alex,

          yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function.

                  /* Only need update last_used oid file, seq file will only be update
                   * during seq rollover */
                  if (fid_is_idif((last_fid)))
                          osi->osi_id = fid_idif_id(fid_seq(last_fid),
                                                    fid_oid(last_fid), fid_ver(last_fid));
                  else
                          osi->osi_id = fid_oid(last_fid);
                  osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off,
                                     &osi->osi_id, d->opd_index);
          
                  rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb,
                                       &osi->osi_off, th);
          

          this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code.

          shadow Alexey Lyashkov added a comment - Alex, yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function. /* Only need update last_used oid file, seq file will only be update * during seq rollover */ if (fid_is_idif((last_fid))) osi->osi_id = fid_idif_id(fid_seq(last_fid), fid_oid(last_fid), fid_ver(last_fid)); else osi->osi_id = fid_oid(last_fid); osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off, &osi->osi_id, d->opd_index); rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb, &osi->osi_off, th); this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code.

          >  I think problem is osp_create_object isn't have protection against parallel execution

          not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.

           

          bzzz Alex Zhuravlev added a comment - >  I think problem is osp_create_object isn't have protection against parallel execution not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.  

          I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading.

          Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided.
          It allow to use less locking (a specially for ldiskfs case), mutex will be need for zfs anyway but need more work and dt_record_write prototype change.

          shadow Alexey Lyashkov added a comment - I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading. Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided. It allow to use less locking (a specially for ldiskfs case), mutex will be need for zfs anyway but need more work and dt_record_write prototype change.

          some additional info

          [root@snx11205n000 #test-dir.0]# lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004
          mdtest_tree.19.0/file.mdtest.19.2004
          lmm_stripe_count: 4
          lmm_stripe_size: 1048576
          lmm_pattern: 1
          lmm_layout_gen: 0
          lmm_stripe_offset: 0
          obdidx objid objid group
          0 2438339247 0x91561aaf 0x240000400
          1 2443066678 0x919e3d36 0x2c0000400
          3 2429947051 0x90d60cab 0x280000400
          2 2428369630 0x90bdfade 0x200000400

          from kern log:

          Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999
          Apr 24 12:52:23 snx11205n005 kernel: Lustre: snx11205-OST0001: deleting orphan objects from 0x2c0000400:2443066688 to 0x2c0000400:2443071739
          Apr 24 12:52:23 snx11205n004 kernel: Lustre: snx11205-OST0000: deleting orphan objects from 0x240000400:2438339247 to 0x240000400:2438353837
          Apr 24 12:52:23 snx11205n006 kernel: Lustre: snx11205-OST0002: deleting orphan objects from 0x200000400:2428369639 to 0x200000400:2428385256

          problem files objects:
          ost0 objects 2438339247 - 2438339257 <<< these objects were deleted

          $ ./a.out mdt1.pre_mnt.lov_objid
          91561aae
          919e3d3f
          90bdfae6
          90d60cb3

          shadow Alexey Lyashkov added a comment - some additional info [root@snx11205n000 #test-dir.0] # lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004 mdtest_tree.19.0/file.mdtest.19.2004 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 2438339247 0x91561aaf 0x240000400 1 2443066678 0x919e3d36 0x2c0000400 3 2429947051 0x90d60cab 0x280000400 2 2428369630 0x90bdfade 0x200000400 from kern log: Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999 Apr 24 12:52:23 snx11205n005 kernel: Lustre: snx11205-OST0001: deleting orphan objects from 0x2c0000400:2443066688 to 0x2c0000400:2443071739 Apr 24 12:52:23 snx11205n004 kernel: Lustre: snx11205-OST0000: deleting orphan objects from 0x240000400:2438339247 to 0x240000400:2438353837 Apr 24 12:52:23 snx11205n006 kernel: Lustre: snx11205-OST0002: deleting orphan objects from 0x200000400:2428369639 to 0x200000400:2428385256 problem files objects: ost0 objects 2438339247 - 2438339257 <<< these objects were deleted $ ./a.out mdt1.pre_mnt.lov_objid 91561aae 919e3d3f 90bdfae6 90d60cb3

          People

            shadow Alexey Lyashkov
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: