[LU-13131] Partial writes on multi-client strided files - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0, Lustre 2.12.5
Affects Version/s: Lustre 2.10.4, Lustre 2.10.6, Lustre 2.12.2
Labels:
- LTS12
- zfs
Environment:
CentOS 7.6, Lustre/ZFS 2.10.4, 2.10.6, 2.12.2, socklnd and o2ibd

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Issue:

Client partial writes during multi-node N:1 2MB strided writes to striped files. Problem occurs at scale, triggering starts around 50 concurrent files where each file is getting 2MB strided writes from 128 jobs spread across two Lustre client nodes.

Sampling of client side errors:

08-Jan 08:36:26 WARNING: partial write   1048576 != 4194304 at offset 543099453440 - skywalker09
08-Jan 08:36:26 WARNING: write of   4194304 failed at offset 318150541312 Input/output error [retry 0] - skywalker09
08-Jan 08:37:13 WARNING: partial write   2097152 != 4194304 at offset 144921591808 - skywalker28
08-Jan 08:37:13 WARNING: write of   4194304 failed at offset 356448731136 Input/output error [retry 0] - skywalker28 
08-Jan 08:35:48 WARNING: partial write   1048576 != 4194304 at offset 356121575424 - skywalker19
08-Jan 08:35:48 WARNING: write of   4194304 failed at offset 281496518656 Input/output error [retry 0] - skywalker19 
08-Jan 08:39:59 WARNING: close failed - Input/output error - skywalker37
08-Jan 08:34:32 WARNING: partial write   2097152 != 4194304 at offset 464540139520 - skywalker87
08-Jan 08:34:32 WARNING: write of   4194304 failed at offset 506952941568 Input/output error [retry 0] - skywalker74
08-Jan 08:38:06 WARNING: write of   2097152 failed at offset 183297376256 Input/output error [retry 0] - skywalker74 
08-Jan 08:38:06 WARNING: partial write   2097152 != 4194304 at offset 138403643392 - skywalker63 
08-Jan 08:40:10 WARNING: close failed - Input/output error - skywalker63

Lustre Servers:

Lustre + ZFS. 2.10.4, 2.10.6, 2.12.2

Lustre Clients:

2.10.4, 2.10.6, 2.12.2

Tested fabrics:

Socklnd (bonded 40GbE), o2iblnd (FDR & EDR), direct connect and routed o2ib->tcp. 1MB, 4MB and 16MB RPC sizes tested.

References to other JIRA tickets that are or may be related:

~~LU-6389~~

HP-259

Details:

A single 600GB file is written to by 128 jobs in a 2MB strided write format. The 128 jobs are spread across two Lustre client nodes.

	Start 1	Start 2	Start 3	Start 4	Start Last
job0	0b	256MB	512MB	768MB	556794MB
job1	2MB	258MB	514MB	770MB	556796MB
job2	4MB	260MB	518MB	772MB	556798MB
job127	254MB	510MB	766MB	1022MB	556800MB

The file is written with Lustre stripe options of 4-8 OSTs and a stripe_size of 1MB.

With a total number of concurrent files being written being ≤ 30 the partial writes are not experienced. In the case of 30 files being written this would reflect 60 Lustre clients grouped by two with 64 jobs (threads) with each pair writing a different strided file. Using the same scenario but scaled up to concurrent files written ≥ 60 which equates to 120 (or more) Lustre clients in pairs performing strided writes to unique files triggers the partial write scenario.

Making an observation, the combination of the 2MB strided writes and the 1MB OST stripe would result in a significant number of objects per file which scales up as the number of files and concurrent writers increases. Perhaps there is a trigger (I’m spitballing).

Error messages witnessed:

There have been randomly occurring client side error messages. It is not known whether they are resulting in the partial write events or occurring as a result of the partial writes.

tgt_grant_check lfs-OSTxxxx cli xxxxxxxxx claims X GRANT, real grant Y
ldlm_cancel from xx.xx.xx.xx@tcp
(lib-move.c:4183:lnet_parse()) xx.xx.xx.xx@o2ib, src xx.xx.xx.xx@o2ib: Bad dest nid xx.xx.xx.xx@tcp (it's my nid but on a different network)

No notable server side (MDS / OSS) error messages are observed that would coincide with the partial write errors. The OSS nodes are in an HA configuration and no failovers have occurred as a result of the issue, which if either OSS were becoming overloaded then corosync messages would likely miss and trigger failover.

Additional information:

With 2MB strided writes and 1MB OST stripe size it would seem that certain larger RPC sizes / brw_size / ZFS recordsize combinations could trigger significant write amplification. Perhaps enough to exacerbate the issue.

I have tried to trigger the issue on a testbed filesystem and have ran through permutations of server-side / client-side versions. socklnd and o2ibd direct, o2ibd lnet routed to socklnd, 1 / 4 / 16MB RPC sizes and I cannot trigger the partial write events. Sixteen 64c AMD Rome clients (eight concurrent files written) isn’t enough to trigger it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

client.txt
9 kB
03/Feb/20 8:12 AM
lustrelog_mds00_158033093x.txt.gz
18 kB
30/Jan/20 8:03 AM
lustrelog_oss00_158033093x.txt.gz
40.32 MB
30/Jan/20 8:04 AM
lustrelog_oss01_158033093x_A.txt.gz
1.25 MB
30/Jan/20 8:03 AM
lustrelog_oss01_158033093x_B.txt.gz
38.98 MB
30/Jan/20 8:04 AM
lustrelog_r4i5n12_158033093x.txt.gz
1.19 MB
30/Jan/20 8:03 AM
lustrelog_r4i6n3_158033093x.txt.gz
910 kB
30/Jan/20 8:03 AM
lustrelog_r4i6n5_158033093x.txt.gz
556 kB
30/Jan/20 8:03 AM
messages_oss00_158033093x.txt.gz
0.4 kB
30/Jan/20 8:03 AM
messages_oss01_158033093x.txt.gz
101 kB
30/Jan/20 8:03 AM
messages_r4i5n12_158033093x.txt.gz
1 kB
30/Jan/20 8:03 AM
messages_r4i6n3_158033093x.txt.gz
2 kB
30/Jan/20 8:03 AM
messages_r4i6n5_158033093x.txt.gz
1 kB
30/Jan/20 8:03 AM
oss.txt
4 kB
03/Feb/20 8:12 AM
oss00_stat.log.gz
0.9 kB
30/Jan/20 8:03 AM
oss01_stat.log.gz
0.9 kB
30/Jan/20 8:03 AM
palpatine20-lustredbg.txt.gz
121 kB
16/Jan/20 10:09 PM
testbed01-oss00-lustredbg.txt.gz
46 kB
15/Jan/20 11:31 PM
testbed01-oss00-lustredbg-FULL.txt.gz
2.66 MB
16/Jan/20 10:09 PM

Issue Links

is related to

LU-12832 soft lockup in ldlm_bl_xx threads at read for a single shared strided file

Resolved

Activity

[LU-13131] Partial writes on multi-client strided files

Gerrit Updater added a comment - 23/May/20 7:57 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38610/
Subject: ~~LU-13131~~ osc: Ensure immediate departure of sync write pages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0c6454503e1fede795d9b094ee92c91f4290924b

Gerrit Updater added a comment - 23/May/20 7:57 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38610/ Subject: LU-13131 osc: Ensure immediate departure of sync write pages Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 0c6454503e1fede795d9b094ee92c91f4290924b

Gerrit Updater added a comment - 23/May/20 7:57 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38672/
Subject: ~~LU-13131~~ osc: Do not wait for grants for too long
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0dda74eec7a29c98c7b6ee9a99e54c7dbefcabca

Gerrit Updater added a comment - 23/May/20 7:57 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38672/ Subject: LU-13131 osc: Do not wait for grants for too long Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 0dda74eec7a29c98c7b6ee9a99e54c7dbefcabca

Peter Jones added a comment - 21/May/20 4:17 PM

Nice!

Peter Jones added a comment - 21/May/20 4:17 PM Nice!

Jeff Johnson added a comment - 21/May/20 4:15 PM

Overnight runs of 300+ client nodes. Never failed. Stick a fork in it, this one's done. #bravo

Jeff Johnson added a comment - 21/May/20 4:15 PM Overnight runs of 300+ client nodes. Never failed. Stick a fork in it, this one's done. #bravo

Gerrit Updater added a comment - 21/May/20 6:08 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37468/
Subject: ~~LU-13131~~ osc: Always send all HP RPCs requests
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 05dc36b07fc1ac40988f7994cd37cd239d9a8986

Gerrit Updater added a comment - 21/May/20 6:08 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37468/ Subject: LU-13131 osc: Always send all HP RPCs requests Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 05dc36b07fc1ac40988f7994cd37cd239d9a8986

Oleg Drokin added a comment - 20/May/20 5:06 PM

Thank you for the update, great to hear that.

Oleg Drokin added a comment - 20/May/20 5:06 PM Thank you for the update, great to hear that.

Peter Jones added a comment - 20/May/20 2:23 PM

Thanks for the update Jeff - this is encouraging news.

Peter Jones added a comment - 20/May/20 2:23 PM Thanks for the update Jeff - this is encouraging news.

Jeff Johnson added a comment - 20/May/20 2:22 PM

So far the latest patch set has run the reproducer very stable over two days. First runs at 70 clients, subsequent runs at 200+ clients. Have never seen this level of stability running the reproducer. Gathering more nodes to scale up higher but this is looking good.

Jeff Johnson added a comment - 20/May/20 2:22 PM So far the latest patch set has run the reproducer very stable over two days. First runs at 70 clients, subsequent runs at 200+ clients. Have never seen this level of stability running the reproducer. Gathering more nodes to scale up higher but this is looking good.

Peter Jones added a comment - 20/May/20 1:56 PM

All patches landed for 2.14. aeonjeff we're still keen to get your report back on the effectiveness of the fixes in your testing

Peter Jones added a comment - 20/May/20 1:56 PM All patches landed for 2.14. aeonjeff we're still keen to get your report back on the effectiveness of the fixes in your testing

Gerrit Updater added a comment - 20/May/20 8:28 AM

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38672
Subject: ~~LU-13131~~ osc: Do not wait for grants for too long
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 319b13d947a6c7a5f8d77813804b6a0721e80b27

Gerrit Updater added a comment - 20/May/20 8:28 AM Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38672 Subject: LU-13131 osc: Do not wait for grants for too long Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 319b13d947a6c7a5f8d77813804b6a0721e80b27

People

Assignee:: Oleg Drokin

Reporter:: Jeff Johnson

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 14/Jan/20 3:30 AM

Updated:: 01/Aug/20 12:42 AM

Resolved:: 20/May/20 1:56 PM