[LU-11020] Data corruption during MDT<>OST recovery Created: 15/May/18  Updated: 15/Oct/19  Resolved: 17/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None
Environment:

found in IEEL3 testing, but bug looks exist from 2.5.0 to master


Attachments: HTML File lost1     File mdt1.pre_mnt.lov_objid    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Data corruption found during mdtest run with MDT failover testing in parallel.
Visible effects is some files have lack own OST objects after delete orphan phase.
This can be result of unreplayed requests or wrong OST reading after failover.
Originally testing start with power lost failover tests, but it have replicated with normal shutdown + failover. This allow to grab a logs during failover / recovery.
Cut from logs.
Server take a request from client:

00000100:00100000:3.0:1524592151.996038:0:18109:0:(service.c:1955:ptlrpc_server_handle_req_in()) got req x1598582492486880
00000100:00100000:3.0:1524592151.996042:0:18109:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-54@gni, seq: 1564278792
00000100:00100000:3.0:1524592151.996043:0:18109:0:(service.c:2103:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6881:15972:x1598582492486880:12345-54@gni:101

Server starts generate a ENODEV during obd_fail set in unmount.

00000100:00080000:1.0:1524592151.996485:0:2924:0:(niobuf.c:583:ptlrpc_send_reply()) sending ENODEV from failed obd 16

server issue a sync as next step of failover and last committed updated.

00010000:00080000:3.0:1524592151.996589:0:5591:0:(ldlm_lib.c:2492:target_committed_to_req()) 0c6ce748-6618-ff9d-ad84-6e6f761558b2 last_committed 2937774893859, transno 2937774968804, xid 1598584386635888

— this create is on disk

00000004:00080000:13.0:1524592151.996630:0:8166:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aae:0x0] last created [0x240000400:0x91562c9d:0x0]
00000004:00080000:13.0:1524592151.996632:0:8166:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aae:0x0]
00000004:00080000:13.0:1524592151.996867:0:8166:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aae:0x0], index 0: 0

– but this is not.

00000004:00080000:3.0:1524592151.996637:0:18109:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aaf:0x0] last created [0x240000400:0x91562c9d:0x0]
00000004:00080000:3.0:1524592151.996639:0:18109:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aaf:0x0]
00000004:00080000:3.0:1524592151.996641:0:18109:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aaf:0x0], index 0: 0

Looking in time stamps of osp_object_create calls, I found it have called in reverse. so First write have finished as second it have caused most modern write have overwrite a stale data.

00000100:00100000:3.0:1524592151.996703:0:18109:0:(service.c:2153:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6888:15972:x1598582492486880:12345-54@gni:101 Request procesed in 660us (669us total) trans 2937774968811 rc -19/-19

as affected create have processed after obd_fail have set it have finished with ENODEV and it handled as ptlrpc recoverable error, so it have resend after connection restored and recovery finished.



 Comments   
Comment by Alexey Lyashkov [ 15/May/18 ]

some additional info

[root@snx11205n000 #test-dir.0]# lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004
mdtest_tree.19.0/file.mdtest.19.2004
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 2438339247 0x91561aaf 0x240000400
1 2443066678 0x919e3d36 0x2c0000400
3 2429947051 0x90d60cab 0x280000400
2 2428369630 0x90bdfade 0x200000400

from kern log:

Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999
Apr 24 12:52:23 snx11205n005 kernel: Lustre: snx11205-OST0001: deleting orphan objects from 0x2c0000400:2443066688 to 0x2c0000400:2443071739
Apr 24 12:52:23 snx11205n004 kernel: Lustre: snx11205-OST0000: deleting orphan objects from 0x240000400:2438339247 to 0x240000400:2438353837
Apr 24 12:52:23 snx11205n006 kernel: Lustre: snx11205-OST0002: deleting orphan objects from 0x200000400:2428369639 to 0x200000400:2428385256

problem files objects:
ost0 objects 2438339247 - 2438339257 <<< these objects were deleted

$ ./a.out mdt1.pre_mnt.lov_objid
91561aae
919e3d3f
90bdfae6
90d60cb3

Comment by Alexey Lyashkov [ 15/May/18 ]

I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading.

Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided.
It allow to use less locking (a specially for ldiskfs case), mutex will be need for zfs anyway but need more work and dt_record_write prototype change.

Comment by Alex Zhuravlev [ 15/May/18 ]

>  I think problem is osp_create_object isn't have protection against parallel execution

not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.

 

Comment by Alexey Lyashkov [ 15/May/18 ]

Alex,

yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function.

        /* Only need update last_used oid file, seq file will only be update
         * during seq rollover */
        if (fid_is_idif((last_fid)))
                osi->osi_id = fid_idif_id(fid_seq(last_fid),
                                          fid_oid(last_fid), fid_ver(last_fid));
        else
                osi->osi_id = fid_oid(last_fid);
        osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off,
                           &osi->osi_id, d->opd_index);

        rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb,
                             &osi->osi_off, th);

this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code.

Comment by Gerrit Updater [ 24/Jul/18 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866
Subject: LU-11020 osd: use right sync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b62e159963b02957f18e35cb9cb7997ed9d4dede

Comment by Gerrit Updater [ 24/Jul/18 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867
Subject: LU-11020 osp: fix race when lov objid update
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fc4c31ff634fd10a1d5a805426a1879e4b27bd3c

Comment by Gerrit Updater [ 01/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/
Subject: LU-11020 osd: use right sync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e2b4a521e260cb7b121dc51d4c29d4d47b7c1e1e

Comment by Gerrit Updater [ 17/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/
Subject: LU-11020 osp: fix race during lov_objids update
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8cd4760536d7f423db87c67bdc8214f13ede3ca8

Comment by Peter Jones [ 17/Nov/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:40:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.