[LU-15300] mirror resync can cause EIO to unrelated applications Created: 01/Dec/21  Updated: 01/Dec/23  Resolved: 22/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: failing_tests, flr-improvement

Issue Links:
Related
is related to LU-12656 sanity-flr test 200 fails with 'faile... Resolved
is related to LU-15269 sanity-flr/200 to generate new tmp fi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807
Epic Link: FLR tech debt review

 Description   

I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings.

first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases).

in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping.

AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST).

so far I've traced all this to the race between two processes:

  • lfs doing resync and changing layout generation
  • another process (say, multiop) doing regular write

I will cite the logs in a subsequent comment.



 Comments   
Comment by Alex Zhuravlev [ 01/Dec/21 ]

on behalf on lfs:

00000004:80000000:1.0:1638375473.821528:0:5874:0:(lod_object.c:7859:lod_declare_update_write_pending()) [0x200000401:0x9:0x0]: found primary 1
00000004:80000000:1.0:1638375473.821530:0:5874:0:(lod_object.c:7641:lod_prepare_resync()) [0x200000401:0x9:0x0]: instantiate all stale components in [0x0, 0x1000000)
00000004:00080000:1.0:1638375473.821975:0:5874:0:(osp_sync.c:1652:osp_sync_add_commit_cb()) lustre-OST0001-osc-MDT0000: add commit cb at 384577134151ns, next at 327591490501ns, rc = 0
00010000:00010000:1.0:1638375473.822823:0:5874:0:(ldlm_lock.c:675:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-lustre-MDT0000_UUID lock: 00000000e2a9dd24/0x1c7d5bd8e17632f1 lrc: 2/0,1 mode: EX/EX res: [0x200000401:0x9:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 5874 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1638375473.822833:0:5874:0:(ldlm_request.c:373:ldlm_blocking_ast_nocheck()) ### Lock still has references, will be cancelled later ns: mdt-lustre-MDT0000_UUID lock: 00000000e2a9dd24/0x1c7d5bd8e17632f1 lrc: 3/0,1 mode: EX/EX res: [0x200000401:0x9:0x0].0x0 bits 0x8/0x0 rrc: 7 type: IBT gid 0 flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 5874 timeout: 0 lvb_type: 0

then multiop gets the lock and proceed:

00010000:00010000:0.0:1638375473.854743:0:7394:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CW) ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: --/CW res: [0x200000401:0x9:0x0].0x0 bits 0x0/0x0 rrc: 8 type: IBT gid 0 flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1638375473.854748:0:7394:0:(ldlm_lock.c:1064:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 8 type: IBT gid 0 flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1638375473.854752:0:7394:0:(ldlm_request.c:530:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0000_UUID lock: 00000000b7dfcfcc/0x1c7d5bd8e176333e lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 8 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 7394 timeout: 0 lvb_type: 0
00000004:00200000:0.0:1638375473.854762:0:7394:0:(mdt_handler.c:912:mdt_pack_attr2body()) [0x200000401:0x9:0x0]: returning size 9159104

but for a reason it's still old generation 70:

00000002:00010000:0.0:1638375473.856175:0:21772:0:(mdc_locks.c:852:mdc_finish_enqueue()) ### layout lock returned by: open, lvb_len: 1704 ns: lustre-MDT0000-mdc-ffff987d0c6e0000 lock: 000000002dd382c3/0x1c7d5bd8e17632f8 lrc: 3/0,1 mode: CW/CW res: [0x200000401:0x9:0x0].0x0 bits 0xd/0x0 rrc: 3 type: IBT gid 0 flags: 0x0 nid: local remote: 0x1c7d5bd8e176333e expref: -99 pid: 21772 timeout: 0 lvb_type: 0
...
00000080:00200000:0.0:1638375473.856255:0:21772:0:(file.c:5744:ll_layout_conf()) [0x200000401:0x9:0x0]: layout version change: 4294967294 -> 70
Comment by Alex Zhuravlev [ 01/Dec/21 ]

given in some scenarion LVM is used to return the layout, shouldn't we maintain LVB properly?
for example, if lock1 is taken to modify the layout and lock2 is enqueued awaiting for lock1 completion, then ->lvb_init() for lock2 can be called before changes to the layout itself? and then lock2 will be returned with stale layout?

Comment by Zhenyu Xu [ 02/Dec/21 ]

Will https://review.whamcloud.com/45663 (Subject: LU-15278 lod: protect LOD layout cache in layout change) help this issue?

In that patch, layout change is reflected in a lod_object flag, and as lock2 enqueued got its turn to be granted, ldlm_handle_enqueue0()->ldlm_lvbo_fill() to fill the LVB, which retrieve from the trusted.lov EA from LOD, and that patch, lod_xattr_get() needs to wait for changed layout being set in the EA, and then Iock2 can be returned with updated layout.

Comment by Alex Zhuravlev [ 06/Dec/21 ]

well, further investigation brought me the following scenario:

  • lfs (mirror resync) flushes pages from all the clients
  • another client C opens the file for write and do few writes to the local cache (with layout generation=X)
  • lfs enqueues LL lock, changes layout generation to X+1 and sends OST_SETATTR to update layout generation on OSTs
  • client C opens the file again and gets the new layout with generation=X+1, lov_layout_change() wants to get rid of cached pages (via cl_object_prune()) which are dirty and need to be written, so layout generation=X is used for this IO
  • OST_WRITE gets -ESTALE

like discussed before, in this case (at least) we don't really need to write pages right away - only the generation has changed, nothing else.

I tried to skip cl_object_prune(), but that doesn't work as OSC links the pages to osc_object structure which is released when the new layout is being instantiated. probably this can be changed just update generation fields in lov_stripe_md struct, but looks very tricky.

well, and so far it's still not clear to me when we need to drop the pages from writeback cache - they've been reported to the userspace as written, no hardware failures have happened to the cluster, they must be saved in the end.

Comment by Alex Zhuravlev [ 27/Jan/22 ]

there are few scenarios with the same result. one of those is when ll_intent_file_open() is used, which return a new layout with a new layout version, but doesn't pass the new layout to LOV, so the stack still operates with the previous layout version.

Comment by Alex Zhuravlev [ 28/Jan/22 ]

sometimes new layout comes as an LVB, but MDT doesn't update resource's LVB after changes to layout like OFD does.

Comment by Alex Zhuravlev [ 31/Jan/22 ]

here is another (IMO) very important finding:
mdt_reint_open() calls into mdt_get_attr_complex() and get LOVEA and only then grabs LDLM (including LAYOUT) lock on the object.
obviousl this race and sometimes client gets a stale layout with a layout lock.

Comment by Alex Zhuravlev [ 01/Feb/22 ]

with a trival workaround to reload LOVEA under layout lock it's much harder to fail sanity-flr/200, but still possible due to a race on the client side:

00020000:00080000:1.0:1643710303.583904:0:79300:0:(lov_object.c:1305:lov_layout_change()) [0x200000401:0x3:0x0]Apply new layout lov 00000000860b79c3, type 2
00020000:00080000:1.0:1643710303.584023:0:79300:0:(lov_object.c:1440:lov_conf_set()) lsm lv = 76
...
00020000:00080000:1.0:1643710303.584256:0:79305:0:(lov_object.c:1305:lov_layout_change()) [0x200000401:0x3:0x0]Apply new layout lov 00000000860b79c3, type 2
00020000:00080000:1.0:1643710303.584342:0:79305:0:(lov_object.c:1440:lov_conf_set()) lsm lv = 75

i.e. two processes race appliyng the new layouts and smaller one wins.

Comment by Gerrit Updater [ 01/Feb/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413
Subject: LU-15300 mdt: do not return LOVEA
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6a04b6146bc7bd4c7eb195ab3a8f4311e618f573

Comment by Alex Zhuravlev [ 07/Feb/22 ]

got stuck a bit .. currently MDT fetches LOVEA, then decides whether LL is needed. for example, if LOVEA doesn't exist (empty layout), then MDT enqueues LL to create a new layout, so I can't just move LOVEA fetching after LL enqueue.. the simplest would be to re-fetch LOVEA with LL, but this does sound like a performance penalty.

Comment by Alex Zhuravlev [ 14/Feb/22 ]

with the patch above I can't reproduce the failure in sanity-flr/200 (while it's still trivial with master branch). though still thinking how to avoid (or make cheaper at least) double dt_xattr_get() to fetch LOVEA.

Comment by Alex Zhuravlev [ 16/Feb/22 ]

mdt_object_open_lock() uses LOVEA to check for DoM and then request appropriate lock.

Comment by Alex Zhuravlev [ 21/Feb/22 ]

benchmarked with sanity-benchmark:

before:
open                      259215 samples [usecs] 21 121174 47616701 490111842189

after:
open                      257128 samples [usecs] 23 154632 52546372 733013412240

looks like open takes 11% more time with the patch.

Comment by Alex Zhuravlev [ 21/Feb/22 ]

Mike suggested to take INODELOCK_DOM always and then drop this bit if no DOM-component is presented like mdt_getattr_name_lock() does (look for ldlm_inodebits_drop())

Comment by Alex Zhuravlev [ 22/Feb/22 ]

got a prototype implementing Mike's idea:

clean master:
open                      259215 samples [usecs] 21 121174 47616701 490111842189
open                      262119 samples [usecs] 23 187783 49285028 637234901782
= 183 and 188 usec/open

double getxattr:
open                      257224 samples [usecs] 24 178664 54603761 775008064791
open                      253302 samples [usecs] 23 151484 48624675 554342411475
= 212 and 191 usec/open

late xattr + dropping DOM bit from ldlm lock:
open                      250363 samples [usecs] 23 123033 48744046 613086712776
open                      256359 samples [usecs] 22 180023 47671587 620789426793
= 194 and 185 usec/open

the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock()

Comment by Gerrit Updater [ 22/Feb/22 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46580
Subject: LU-15300 mdt: fetch LOVEA after LL
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2f868486c9ad6f884b682173705e93d68ab6385a

Comment by Nikitas Angelinas [ 19/May/22 ]

+1 on master: https://testing.whamcloud.com/test_sets/32deb408-a813-480f-a6f3-1ae41c34ab56

Comment by Colin Faber [ 25/Jul/22 ]

Hi bzzz 

What's next for this issue?

Comment by Alex Zhuravlev [ 26/Jul/22 ]

What's next for this issue?

review and landing hopefully, the patch has been in local testing for months..

Comment by Andreas Dilger [ 29/Mar/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413
Subject: LU-15300 mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: master
Current Patch Set: 69
Commit: efbe0f63eff8a9a7b192607382f6859e3b0088b8

Comment by Gerrit Updater [ 22/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/
Subject: LU-15300 mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13557aa86904376e48a5e43256d5c1ab32c1c2d6

Comment by Peter Jones [ 22/Apr/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:17:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.