[LU-11020] Data corruption during MDT<>OST recovery - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.5.0, Lustre 2.11.0
Labels:
None
Environment:
found in IEEL3 testing, but bug looks exist from 2.5.0 to master

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Data corruption found during mdtest run with MDT failover testing in parallel.
Visible effects is some files have lack own OST objects after delete orphan phase.
This can be result of unreplayed requests or wrong OST reading after failover.
Originally testing start with power lost failover tests, but it have replicated with normal shutdown + failover. This allow to grab a logs during failover / recovery.
Cut from logs.
Server take a request from client:

00000100:00100000:3.0:1524592151.996038:0:18109:0:(service.c:1955:ptlrpc_server_handle_req_in()) got req x1598582492486880
00000100:00100000:3.0:1524592151.996042:0:18109:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-54@gni, seq: 1564278792
00000100:00100000:3.0:1524592151.996043:0:18109:0:(service.c:2103:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6881:15972:x1598582492486880:12345-54@gni:101

Server starts generate a ENODEV during obd_fail set in unmount.

00000100:00080000:1.0:1524592151.996485:0:2924:0:(niobuf.c:583:ptlrpc_send_reply()) sending ENODEV from failed obd 16

server issue a sync as next step of failover and last committed updated.

00010000:00080000:3.0:1524592151.996589:0:5591:0:(ldlm_lib.c:2492:target_committed_to_req()) 0c6ce748-6618-ff9d-ad84-6e6f761558b2 last_committed 2937774893859, transno 2937774968804, xid 1598584386635888

— this create is on disk

00000004:00080000:13.0:1524592151.996630:0:8166:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aae:0x0] last created [0x240000400:0x91562c9d:0x0]
00000004:00080000:13.0:1524592151.996632:0:8166:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aae:0x0]
00000004:00080000:13.0:1524592151.996867:0:8166:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aae:0x0], index 0: 0

– but this is not.

00000004:00080000:3.0:1524592151.996637:0:18109:0:(osp_precreate.c:1476:osp_precreate_get_fid()) next [0x240000400:0x91561aaf:0x0] last created [0x240000400:0x91562c9d:0x0]
00000004:00080000:3.0:1524592151.996639:0:18109:0:(osp_object.c:1547:osp_object_create()) last fid for osp_object is [0x240000400:0x91561aaf:0x0]
00000004:00080000:3.0:1524592151.996641:0:18109:0:(osp_object.c:1601:osp_object_create()) snx11205-OST0000-osc-MDT0001: Wrote last used FID: [0x240000400:0x91561aaf:0x0], index 0: 0

Looking in time stamps of osp_object_create calls, I found it have called in reverse. so First write have finished as second it have caused most modern write have overwrite a stale data.

00000100:00100000:3.0:1524592151.996703:0:18109:0:(service.c:2153:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_003:061cd4a0-cd22-5048-946a-d8f4ec53917a+6888:15972:x1598582492486880:12345-54@gni:101 Request procesed in 660us (669us total) trans 2937774968811 rc -19/-19

as affected create have processed after obd_fail have set it have finished with ENODEV and it handled as ptlrpc recoverable error, so it have resend after connection restored and recovery finished.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lost1
11 kB
15/May/18 8:52 AM
mdt1.pre_mnt.lov_objid
0.0 kB
15/May/18 8:55 AM

Activity

[LU-11020] Data corruption during MDT<>OST recovery

Peter Jones added a comment - 17/Nov/18 6:42 AM

Landed for 2.12

Peter Jones added a comment - 17/Nov/18 6:42 AM Landed for 2.12

Gerrit Updater added a comment - 17/Nov/18 1:25 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/
Subject: ~~LU-11020~~ osp: fix race during lov_objids update
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8cd4760536d7f423db87c67bdc8214f13ede3ca8

Gerrit Updater added a comment - 17/Nov/18 1:25 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32867/ Subject: LU-11020 osp: fix race during lov_objids update Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8cd4760536d7f423db87c67bdc8214f13ede3ca8

Gerrit Updater added a comment - 01/Oct/18 2:00 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/
Subject: ~~LU-11020~~ osd: use right sync
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e2b4a521e260cb7b121dc51d4c29d4d47b7c1e1e

Gerrit Updater added a comment - 01/Oct/18 2:00 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32866/ Subject: LU-11020 osd: use right sync Project: fs/lustre-release Branch: master Current Patch Set: Commit: e2b4a521e260cb7b121dc51d4c29d4d47b7c1e1e

Gerrit Updater added a comment - 24/Jul/18 8:30 AM

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867
Subject: ~~LU-11020~~ osp: fix race when lov objid update
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fc4c31ff634fd10a1d5a805426a1879e4b27bd3c

Gerrit Updater added a comment - 24/Jul/18 8:30 AM Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32867 Subject: LU-11020 osp: fix race when lov objid update Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fc4c31ff634fd10a1d5a805426a1879e4b27bd3c

Gerrit Updater added a comment - 24/Jul/18 8:30 AM

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866
Subject: ~~LU-11020~~ osd: use right sync
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b62e159963b02957f18e35cb9cb7997ed9d4dede

Gerrit Updater added a comment - 24/Jul/18 8:30 AM Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/32866 Subject: LU-11020 osd: use right sync Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b62e159963b02957f18e35cb9cb7997ed9d4dede

Alexey Lyashkov added a comment - 15/May/18 4:22 PM

Alex,

yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function.

        /* Only need update last_used oid file, seq file will only be update
         * during seq rollover */
        if (fid_is_idif((last_fid)))
                osi->osi_id = fid_idif_id(fid_seq(last_fid),
                                          fid_oid(last_fid), fid_ver(last_fid));
        else
                osi->osi_id = fid_oid(last_fid);
        osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off,
                           &osi->osi_id, d->opd_index);

        rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb,
                             &osi->osi_off, th);

this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code.

Alexey Lyashkov added a comment - 15/May/18 4:22 PM Alex, yes, osp_precreate_get_fid get a right fid assignment to the osp object, But what about last used FID store to the disk? it's on bottom osp_object_create (osp_create) function. /* Only need update last_used oid file, seq file will only be update * during seq rollover */ if (fid_is_idif((last_fid))) osi->osi_id = fid_idif_id(fid_seq(last_fid), fid_oid(last_fid), fid_ver(last_fid)); else osi->osi_id = fid_oid(last_fid); osp_objid_buf_prep(&osi->osi_lb, &osi->osi_off, &osi->osi_id, d->opd_index); rc = dt_record_write(env, d->opd_last_used_oid_file, &osi->osi_lb, &osi->osi_off, th); this code take a last_fid and store into buffer, but dt_record_write calls can be reordered due different reasons - like fast / slow path in block reading or get_write_access code.

Alex Zhuravlev added a comment - 15/May/18 11:47 AM

> I think problem is osp_create_object isn't have protection against parallel execution

not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.

Alex Zhuravlev added a comment - 15/May/18 11:47 AM > I think problem is osp_create_object isn't have protection against parallel execution not sure what exactly you mean, but osp_precreate_get_fid() does serialise FID assignment using opd_pre_lock.

Alexey Lyashkov added a comment - 15/May/18 9:13 AM

I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading.

Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided.
It allow to use less locking (a specially for ldiskfs case), mutex will be need for zfs anyway but need more work and dt_record_write prototype change.

Alexey Lyashkov added a comment - 15/May/18 9:13 AM I think problem is osp_create_object isn't have protection against parallel execution, it allow to delay a first thread due waiting a reading from disk, while second thread take a fast path for this reading. Solution can be easy - use a mutex to protect a write a whole like i_mutex does, but it problem from performance perspective. Other solution move a write buffer to the osd layer and callback to fill buffer provided. It allow to use less locking (a specially for ldiskfs case), mutex will be need for zfs anyway but need more work and dt_record_write prototype change.

Alexey Lyashkov added a comment - 15/May/18 8:58 AM

some additional info

[root@snx11205n000 #test-dir.0]# lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004
mdtest_tree.19.0/file.mdtest.19.2004
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 2438339247 0x91561aaf 0x240000400
1 2443066678 0x919e3d36 0x2c0000400
3 2429947051 0x90d60cab 0x280000400
2 2428369630 0x90bdfade 0x200000400

from kern log:

Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999
Apr 24 12:52:23 snx11205n005 kernel: Lustre: snx11205-OST0001: deleting orphan objects from 0x2c0000400:2443066688 to 0x2c0000400:2443071739
Apr 24 12:52:23 snx11205n004 kernel: Lustre: snx11205-OST0000: deleting orphan objects from 0x240000400:2438339247 to 0x240000400:2438353837
Apr 24 12:52:23 snx11205n006 kernel: Lustre: snx11205-OST0002: deleting orphan objects from 0x200000400:2428369639 to 0x200000400:2428385256

problem files objects:
ost0 objects 2438339247 - 2438339257 <<< these objects were deleted

$ ./a.out mdt1.pre_mnt.lov_objid
91561aae
919e3d3f
90bdfae6
90d60cb3

Alexey Lyashkov added a comment - 15/May/18 8:58 AM some additional info [root@snx11205n000 #test-dir.0] # lfs getstripe mdtest_tree.19.0/file.mdtest.19.2004 mdtest_tree.19.0/file.mdtest.19.2004 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 2438339247 0x91561aaf 0x240000400 1 2443066678 0x919e3d36 0x2c0000400 3 2429947051 0x90d60cab 0x280000400 2 2428369630 0x90bdfade 0x200000400 from kern log: Apr 24 12:52:23 snx11205n007 kernel: Lustre: snx11205-OST0003: deleting orphan objects from 0x280000400:2429947060 to 0x280000400:2429962999 Apr 24 12:52:23 snx11205n005 kernel: Lustre: snx11205-OST0001: deleting orphan objects from 0x2c0000400:2443066688 to 0x2c0000400:2443071739 Apr 24 12:52:23 snx11205n004 kernel: Lustre: snx11205-OST0000: deleting orphan objects from 0x240000400:2438339247 to 0x240000400:2438353837 Apr 24 12:52:23 snx11205n006 kernel: Lustre: snx11205-OST0002: deleting orphan objects from 0x200000400:2428369639 to 0x200000400:2428385256 problem files objects: ost0 objects 2438339247 - 2438339257 <<< these objects were deleted $ ./a.out mdt1.pre_mnt.lov_objid 91561aae 919e3d3f 90bdfae6 90d60cb3

People

Assignee:: Alexey Lyashkov

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/May/18 8:51 AM

Updated:: 15/Oct/19 9:23 PM

Resolved:: 17/Nov/18 6:42 AM