[LU-15300] mirror resync can cause EIO to unrelated applications Created: 01/Dec/21 Updated: 01/Dec/23 Resolved: 22/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | failing_tests, flr-improvement | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Epic Link: | FLR tech debt review | ||||||||||||
| Description |
|
I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings. first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases). in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping. AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST). so far I've traced all this to the race between two processes:
I will cite the logs in a subsequent comment. |
| Comments |
| Comment by Alex Zhuravlev [ 01/Dec/21 ] |
|
on behalf on lfs: 00000004:80000000:1.0:1638375473.821528:0:5874:0:(lod_object.c:7859:lod_declare_update_write_pending()) [0x200000401:0x9:0x0]: found primary 1 00000004:80000000:1.0:1638375473.821530:0:5874:0:(lod_object.c:7641:lod_prepare_resync()) [0x200000401:0x9:0x0]: instantiate all stale components in [0x0, 0x1000000) 00000004:00080000:1.0:1638375473.821975:0:5874:0:(osp_sync.c:1652:osp_sync_add_commit_cb()) lustre-OST0001-osc-MDT0000: add commit cb at 384577134151ns, next at 327591490501ns, rc = 0 00010000:00010000:1.0:1638375473.822823:0:5874:0:(ldlm_lock.c:675:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-lustre-MDT0000_UUID lock: 00000000e2a9dd24/0x1c7d5bd8e17632f1 lrc: 2/0,1 mode: EX/EX res: [0x200000401:0x9:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 5874 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1638375473.822833:0:5874:0:(ldlm_request.c:373:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-lustre-MDT0000_UUID lock: 00000000e2a9dd24/0x1c7d5bd8e17632f1 lrc: 3/0,1 mode: EX/EX res: [0x200000401:0x9:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT gid 0 flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 5874 timeout: 0 lvb_type: 0 then multiop gets the lock and proceed:
00010000:00010000:0.0:1638375473.854743:0:7394:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CW) ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: --/CW res: [0x200000401:0x9:0x0].0x0 bits 0x0/0x0 rrc: 8 type: IBT gid 0 flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1638375473.854748:0:7394:0:(ldlm_lock.c:1064:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 8 type: IBT gid 0 flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1638375473.854752:0:7394:0:(ldlm_request.c:530:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 8 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00000004:00200000:0.0:1638375473.854762:0:7394:0:(mdt_handler.c:912:mdt_pack_attr2body()) [0x200000401:0x9:0x0]: returning size 9159104
but for a reason it's still old generation 70: 00000002:00010000:0.0:1638375473.856175:0:21772:0:(mdc_locks.c:852:mdc_finish_enqueue()) ### layout lock returned by: open, lvb_len: 1704 ns: lustre-MDT0000-mdc-ffff987d0c6e0000 lock: 000000002dd382c3/0x1c7d5bd8e17632f8 lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 3 type: IBT gid 0 flags: 0x0 nid: local remote: 0x1c7d5bd8e176333e expref: -99 pid: 21772 timeout: 0 lvb_type: 0 ... 00000080:00200000:0.0:1638375473.856255:0:21772:0:(file.c:5744:ll_layout_conf()) [0x200000401:0x9:0x0]: layout version change: 4294967294 -> 70 |
| Comment by Alex Zhuravlev [ 01/Dec/21 ] |
|
given in some scenarion LVM is used to return the layout, shouldn't we maintain LVB properly? |
| Comment by Zhenyu Xu [ 02/Dec/21 ] |
|
Will https://review.whamcloud.com/45663 (Subject: LU-15278 lod: protect LOD layout cache in layout change) help this issue? In that patch, layout change is reflected in a lod_object flag, and as lock2 enqueued got its turn to be granted, ldlm_handle_enqueue0()->ldlm_lvbo_fill() to fill the LVB, which retrieve from the trusted.lov EA from LOD, and that patch, lod_xattr_get() needs to wait for changed layout being set in the EA, and then Iock2 can be returned with updated layout. |
| Comment by Alex Zhuravlev [ 06/Dec/21 ] |
|
well, further investigation brought me the following scenario:
like discussed before, in this case (at least) we don't really need to write pages right away - only the generation has changed, nothing else. I tried to skip cl_object_prune(), but that doesn't work as OSC links the pages to osc_object structure which is released when the new layout is being instantiated. probably this can be changed just update generation fields in lov_stripe_md struct, but looks very tricky. well, and so far it's still not clear to me when we need to drop the pages from writeback cache - they've been reported to the userspace as written, no hardware failures have happened to the cluster, they must be saved in the end. |
| Comment by Alex Zhuravlev [ 27/Jan/22 ] |
|
there are few scenarios with the same result. one of those is when ll_intent_file_open() is used, which return a new layout with a new layout version, but doesn't pass the new layout to LOV, so the stack still operates with the previous layout version. |
| Comment by Alex Zhuravlev [ 28/Jan/22 ] |
|
sometimes new layout comes as an LVB, but MDT doesn't update resource's LVB after changes to layout like OFD does. |
| Comment by Alex Zhuravlev [ 31/Jan/22 ] |
|
here is another (IMO) very important finding: |
| Comment by Alex Zhuravlev [ 01/Feb/22 ] |
|
with a trival workaround to reload LOVEA under layout lock it's much harder to fail sanity-flr/200, but still possible due to a race on the client side: 00020000:00080000:1.0:1643710303.583904:0:79300:0:(lov_object.c:1305:lov_layout_change()) [0x200000401:0x3:0x0]Apply new layout lov 00000000860b79c3, type 2 00020000:00080000:1.0:1643710303.584023:0:79300:0:(lov_object.c:1440:lov_conf_set()) lsm lv = 76 ... 00020000:00080000:1.0:1643710303.584256:0:79305:0:(lov_object.c:1305:lov_layout_change()) [0x200000401:0x3:0x0]Apply new layout lov 00000000860b79c3, type 2 00020000:00080000:1.0:1643710303.584342:0:79305:0:(lov_object.c:1440:lov_conf_set()) lsm lv = 75 i.e. two processes race appliyng the new layouts and smaller one wins. |
| Comment by Gerrit Updater [ 01/Feb/22 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413 |
| Comment by Alex Zhuravlev [ 07/Feb/22 ] |
|
got stuck a bit .. currently MDT fetches LOVEA, then decides whether LL is needed. for example, if LOVEA doesn't exist (empty layout), then MDT enqueues LL to create a new layout, so I can't just move LOVEA fetching after LL enqueue.. the simplest would be to re-fetch LOVEA with LL, but this does sound like a performance penalty. |
| Comment by Alex Zhuravlev [ 14/Feb/22 ] |
|
with the patch above I can't reproduce the failure in sanity-flr/200 (while it's still trivial with master branch). though still thinking how to avoid (or make cheaper at least) double dt_xattr_get() to fetch LOVEA. |
| Comment by Alex Zhuravlev [ 16/Feb/22 ] |
|
mdt_object_open_lock() uses LOVEA to check for DoM and then request appropriate lock. |
| Comment by Alex Zhuravlev [ 21/Feb/22 ] |
|
benchmarked with sanity-benchmark: before: open 259215 samples [usecs] 21 121174 47616701 490111842189 after: open 257128 samples [usecs] 23 154632 52546372 733013412240 looks like open takes 11% more time with the patch. |
| Comment by Alex Zhuravlev [ 21/Feb/22 ] |
|
Mike suggested to take INODELOCK_DOM always and then drop this bit if no DOM-component is presented like mdt_getattr_name_lock() does (look for ldlm_inodebits_drop()) |
| Comment by Alex Zhuravlev [ 22/Feb/22 ] |
|
got a prototype implementing Mike's idea:
clean master:
open 259215 samples [usecs] 21 121174 47616701 490111842189
open 262119 samples [usecs] 23 187783 49285028 637234901782
= 183 and 188 usec/open
double getxattr:
open 257224 samples [usecs] 24 178664 54603761 775008064791
open 253302 samples [usecs] 23 151484 48624675 554342411475
= 212 and 191 usec/open
late xattr + dropping DOM bit from ldlm lock:
open 250363 samples [usecs] 23 123033 48744046 613086712776
open 256359 samples [usecs] 22 180023 47671587 620789426793
= 194 and 185 usec/open
the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock() |
| Comment by Gerrit Updater [ 22/Feb/22 ] |
|
|
| Comment by Nikitas Angelinas [ 19/May/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/32deb408-a813-480f-a6f3-1ae41c34ab56 |
| Comment by Colin Faber [ 25/Jul/22 ] |
|
Hi bzzz What's next for this issue? |
| Comment by Alex Zhuravlev [ 26/Jul/22 ] |
review and landing hopefully, the patch has been in local testing for months.. |
| Comment by Andreas Dilger [ 29/Mar/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413 |
| Comment by Gerrit Updater [ 22/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/ |
| Comment by Peter Jones [ 22/Apr/23 ] |
|
Landed for 2.16 |