[LU-15300] mirror resync can cause EIO to unrelated applications - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0, Lustre 2.15.7
Affects Version/s: Lustre 2.15.3
Labels:
- failing_tests
- flr-improvement

Severity:
3
Rank (Obsolete):
9223372036854775807
Epic Link:
FLR tech debt review

Description

I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings.

first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases).

in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping.

AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST).

so far I've traced all this to the race between two processes:

lfs doing resync and changing layout generation
another process (say, multiop) doing regular write

I will cite the logs in a subsequent comment.

Attachments

Issue Links

is duplicated by

LU-14966 sanity-flr test_200: FAIL: checksum error for mirror 2: lfs mirror: '/mnt/lustre/f200.sanity-flr' llapi_mirror_resync_many: Input/output error

Resolved

LU-17320 sanity-flr test_200: read failed (-ESTALE)

Resolved

is related to

LU-12656 sanity-flr test 200 fails with 'failed writing to *:*’

Resolved

LU-17070 sanity-flr test_200b: vvp_vmpage_error()) LBUG

Resolved

LU-18476 interop replay-single test_202: FAIL: layout gen changed: 2 -> 0

Resolved

LU-15269 sanity-flr/200 to generate new tmp files each time

Resolved

is related to

LU-18416 Data corruption/miscompare observed during 48hr FOFB

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 is related to, 1 is related to , 16 mentioned in)

Activity

[LU-15300] mirror resync can cause EIO to unrelated applications

Gerrit Updater added a comment - 22/Jan/25 6:50 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55464/
Subject: ~~LU-15300~~ mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 9287b2c34d3c7c4d94d9db3a5a622d89be31ec6b

Gerrit Updater added a comment - 22/Jan/25 6:50 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55464/ Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9287b2c34d3c7c4d94d9db3a5a622d89be31ec6b

Gerrit Updater added a comment - 18/Jun/24 4:52 PM

"Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55464
Subject: ~~LU-15300~~ mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 272180003988ecd0d786392cd0fee1a800d5e336

Gerrit Updater added a comment - 18/Jun/24 4:52 PM "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55464 Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 272180003988ecd0d786392cd0fee1a800d5e336

Peter Jones added a comment - 22/Apr/23 6:32 PM

Landed for 2.16

Peter Jones added a comment - 22/Apr/23 6:32 PM Landed for 2.16

Gerrit Updater added a comment - 22/Apr/23 5:27 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/
Subject: ~~LU-15300~~ mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13557aa86904376e48a5e43256d5c1ab32c1c2d6

Gerrit Updater added a comment - 22/Apr/23 5:27 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/ Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13557aa86904376e48a5e43256d5c1ab32c1c2d6

Andreas Dilger added a comment - 29/Mar/23 4:21 AM

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413
Subject: ~~LU-15300~~ mdt: refresh LOVEA with LL granted
Project: fs/lustre-release
Branch: master
Current Patch Set: 69
Commit: efbe0f63eff8a9a7b192607382f6859e3b0088b8

Andreas Dilger added a comment - 29/Mar/23 4:21 AM "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413 Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: master Current Patch Set: 69 Commit: efbe0f63eff8a9a7b192607382f6859e3b0088b8

Alex Zhuravlev added a comment - 26/Jul/22 9:07 AM

What's next for this issue?

review and landing hopefully, the patch has been in local testing for months..

Alex Zhuravlev added a comment - 26/Jul/22 9:07 AM What's next for this issue? review and landing hopefully, the patch has been in local testing for months..

Colin Faber added a comment - 25/Jul/22 2:47 PM

Hi bzzz

What's next for this issue?

Colin Faber added a comment - 25/Jul/22 2:47 PM Hi bzzz What's next for this issue?

Nikitas Angelinas added a comment - 19/May/22 7:55 PM

+1 on master: https://testing.whamcloud.com/test_sets/32deb408-a813-480f-a6f3-1ae41c34ab56

Nikitas Angelinas added a comment - 19/May/22 7:55 PM +1 on master: https://testing.whamcloud.com/test_sets/32deb408-a813-480f-a6f3-1ae41c34ab56

Gerrit Updater added a comment - 22/Feb/22 3:36 PM - edited

~~"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch:~~ https://review.whamcloud.com/46580
~~Subject: ~~LU-15300~~ mdt: fetch LOVEA after LL~~
~~Project: fs/lustre-release~~
~~Branch: master~~
~~Current Patch Set: 1~~
~~Commit: 2f868486c9ad6f884b682173705e93d68ab6385a~~

Gerrit Updater added a comment - 22/Feb/22 3:36 PM - edited "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46580 Subject: LU-15300 mdt: fetch LOVEA after LL Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f868486c9ad6f884b682173705e93d68ab6385a

Alex Zhuravlev added a comment - 22/Feb/22 2:36 PM

got a prototype implementing Mike's idea:

clean master:
open                      259215 samples [usecs] 21 121174 47616701 490111842189
open                      262119 samples [usecs] 23 187783 49285028 637234901782
= 183 and 188 usec/open

double getxattr:
open                      257224 samples [usecs] 24 178664 54603761 775008064791
open                      253302 samples [usecs] 23 151484 48624675 554342411475
= 212 and 191 usec/open

late xattr + dropping DOM bit from ldlm lock:
open                      250363 samples [usecs] 23 123033 48744046 613086712776
open                      256359 samples [usecs] 22 180023 47671587 620789426793
= 194 and 185 usec/open

the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock()

Alex Zhuravlev added a comment - 22/Feb/22 2:36 PM got a prototype implementing Mike's idea: clean master: open 259215 samples [usecs] 21 121174 47616701 490111842189 open 262119 samples [usecs] 23 187783 49285028 637234901782 = 183 and 188 usec/open double getxattr: open 257224 samples [usecs] 24 178664 54603761 775008064791 open 253302 samples [usecs] 23 151484 48624675 554342411475 = 212 and 191 usec/open late xattr + dropping DOM bit from ldlm lock: open 250363 samples [usecs] 23 123033 48744046 613086712776 open 256359 samples [usecs] 22 180023 47671587 620789426793 = 194 and 185 usec/open the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock()

People

Assignee:: Alex Zhuravlev

Reporter:: Alex Zhuravlev

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 01/Dec/21 4:51 PM

Updated:: 15/May/25 2:04 PM

Resolved:: 22/Apr/23 6:32 PM