[LU-11020] Data corruption during MDT<>OST recovery Created: 15/May/18 Updated: 15/Oct/19 Resolved: 17/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alexey Lyashkov | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
found in IEEL3 testing, but bug looks exist from 2.5.0 to master |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Data corruption found during mdtest run with MDT failover testing in parallel. 00000100:00100000:3.0:1524592151.996038:0:18109:0:(service.c:1955:ptlrpc_server_handle_req_in()) got req x1598582492486880 00000100:00100000:3.0:1524592151.996042:0:18109:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-54@gni, seq: 1564278792 00000100:00100000:3.0:1524592151.996043:0:18109:0:(service.c:2103:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6881:15972:x1598582492486880:12345-54@gni:101 Server starts generate a ENODEV during obd_fail set in unmount. 00000100:00080000:1.0:1524592151.996485:0:2924:0:(niobuf.c:583:ptlrpc_send_reply()) sending ENODEV from failed obd 16 server issue a sync as next step of failover and last committed updated. 00010000:00080000:3.0:1524592151.996589:0:5591:0:(ldlm_lib.c:2492:target_committed_to_req()) 0c6ce748-6618-ff9d-ad84-6e6f761558b2 last_committed 2937774893859, transno 2937774968804, xid 1598584386635888 — this create is on disk 00000004:00080000:13.0:1524592151.996630:0:8166:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aae:0x0] last created [0x240000400:0x91562c9d:0x0] 00000004:00080000:13.0:1524592151.996632:0:8166:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aae:0x0] 00000004:00080000:13.0:1524592151.996867:0:8166:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aae:0x0], index 0: 0 – but this is not. 00000004:00080000:3.0:1524592151.996637:0:18109:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aaf:0x0] last created [0x240000400:0x91562c9d:0x0] 00000004:00080000:3.0:1524592151.996639:0:18109:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aaf:0x0] 00000004:00080000:3.0:1524592151.996641:0:18109:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aaf:0x0], index 0: 0 Looking in time stamps of osp_object_create calls, I found it have called in reverse. so First write have finished as second it have caused most modern write have overwrite a stale data. 00000100:00100000:3.0:1524592151.996703:0:18109:0:(service.c:2153:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6888:15972:x1598582492486880:12345-54@gni:101 Request procesed in 660us (669us total) trans 2937774968811 rc -19/-19 as affected create have processed after obd_fail have set it have finished with ENODEV and it handled as ptlrpc recoverable error, so it have resend after connection restored and recovery finished. |
| Comments |
| Comment by Alexey Lyashkov [ 15/May/18 ] |
|
some additional info [root@snx11205n000 #test-dir.0]# lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004 from kern log: Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999 problem files objects: $ ./a.out mdt1.pre_mnt.lov_objid |
| Comment by Alexey Lyashkov [ 15/May/18 ] |
|
I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading. Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided. |
| Comment by Alex Zhuravlev [ 15/May/18 ] |
|
> I think problem is osp_create_object isn't have protection against parallel execution not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.
|
| Comment by Alexey Lyashkov [ 15/May/18 ] |
|
Alex, yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function. /* Only need update last_used oid file, seq file will only be update
* during seq rollover */
if (fid_is_idif((last_fid)))
osi->osi_id = fid_idif_id(fid_seq(last_fid),
fid_oid(last_fid), fid_ver(last_fid));
else
osi->osi_id = fid_oid(last_fid);
osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off,
&osi->osi_id, d->opd_index);
rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb,
&osi->osi_off, th);
this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code. |
| Comment by Gerrit Updater [ 24/Jul/18 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866 |
| Comment by Gerrit Updater [ 24/Jul/18 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867 |
| Comment by Gerrit Updater [ 01/Oct/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/ |
| Comment by Gerrit Updater [ 17/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/ |
| Comment by Peter Jones [ 17/Nov/18 ] |
|
Landed for 2.12 |