Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.4, Lustre 2.10.6, Lustre 2.12.2
-
CentOS 7.6, Lustre/ZFS 2.10.4, 2.10.6, 2.12.2, socklnd and o2ibd
-
3
-
9223372036854775807
Description
Issue:
Client partial writes during multi-node N:1 2MB strided writes to striped files. Problem occurs at scale, triggering starts around 50 concurrent files where each file is getting 2MB strided writes from 128 jobs spread across two Lustre client nodes.
Sampling of client side errors:
08-Jan 08:36:26 WARNING: partial write 1048576 != 4194304 at offset 543099453440 - skywalker09 08-Jan 08:36:26 WARNING: write of 4194304 failed at offset 318150541312 Input/output error [retry 0] - skywalker09 08-Jan 08:37:13 WARNING: partial write 2097152 != 4194304 at offset 144921591808 - skywalker28 08-Jan 08:37:13 WARNING: write of 4194304 failed at offset 356448731136 Input/output error [retry 0] - skywalker28 08-Jan 08:35:48 WARNING: partial write 1048576 != 4194304 at offset 356121575424 - skywalker19 08-Jan 08:35:48 WARNING: write of 4194304 failed at offset 281496518656 Input/output error [retry 0] - skywalker19 08-Jan 08:39:59 WARNING: close failed - Input/output error - skywalker37 08-Jan 08:34:32 WARNING: partial write 2097152 != 4194304 at offset 464540139520 - skywalker87 08-Jan 08:34:32 WARNING: write of 4194304 failed at offset 506952941568 Input/output error [retry 0] - skywalker74 08-Jan 08:38:06 WARNING: write of 2097152 failed at offset 183297376256 Input/output error [retry 0] - skywalker74 08-Jan 08:38:06 WARNING: partial write 2097152 != 4194304 at offset 138403643392 - skywalker63 08-Jan 08:40:10 WARNING: close failed - Input/output error - skywalker63
Lustre Servers:
Lustre + ZFS. 2.10.4, 2.10.6, 2.12.2
Lustre Clients:
2.10.4, 2.10.6, 2.12.2
Tested fabrics:
Socklnd (bonded 40GbE), o2iblnd (FDR & EDR), direct connect and routed o2ib->tcp. 1MB, 4MB and 16MB RPC sizes tested.
References to other JIRA tickets that are or may be related:
Details:
A single 600GB file is written to by 128 jobs in a 2MB strided write format. The 128 jobs are spread across two Lustre client nodes.
Start 1 | Start 2 | Start 3 | Start 4 | Start Last | |
job0 | 0b | 256MB | 512MB | 768MB | 556794MB |
job1 | 2MB | 258MB | 514MB | 770MB | 556796MB |
job2 | 4MB | 260MB | 518MB | 772MB | 556798MB |
job127 | 254MB | 510MB | 766MB | 1022MB | 556800MB |
The file is written with Lustre stripe options of 4-8 OSTs and a stripe_size of 1MB.
With a total number of concurrent files being written being ≤ 30 the partial writes are not experienced. In the case of 30 files being written this would reflect 60 Lustre clients grouped by two with 64 jobs (threads) with each pair writing a different strided file. Using the same scenario but scaled up to concurrent files written ≥ 60 which equates to 120 (or more) Lustre clients in pairs performing strided writes to unique files triggers the partial write scenario.
Making an observation, the combination of the 2MB strided writes and the 1MB OST stripe would result in a significant number of objects per file which scales up as the number of files and concurrent writers increases. Perhaps there is a trigger (I’m spitballing).
Error messages witnessed:
There have been randomly occurring client side error messages. It is not known whether they are resulting in the partial write events or occurring as a result of the partial writes.
tgt_grant_check lfs-OSTxxxx cli xxxxxxxxx claims X GRANT, real grant Y ldlm_cancel from xx.xx.xx.xx@tcp (lib-move.c:4183:lnet_parse()) xx.xx.xx.xx@o2ib, src xx.xx.xx.xx@o2ib: Bad dest nid xx.xx.xx.xx@tcp (it's my nid but on a different network)
No notable server side (MDS / OSS) error messages are observed that would coincide with the partial write errors. The OSS nodes are in an HA configuration and no failovers have occurred as a result of the issue, which if either OSS were becoming overloaded then corosync messages would likely miss and trigger failover.
Additional information:
With 2MB strided writes and 1MB OST stripe size it would seem that certain larger RPC sizes / brw_size / ZFS recordsize combinations could trigger significant write amplification. Perhaps enough to exacerbate the issue.
I have tried to trigger the issue on a testbed filesystem and have ran through permutations of server-side / client-side versions. socklnd and o2ibd direct, o2ibd lnet routed to socklnd, 1 / 4 / 16MB RPC sizes and I cannot trigger the partial write events. Sixteen 64c AMD Rome clients (eight concurrent files written) isn’t enough to trigger it.
Attachments
Issue Links
- is related to
-
LU-12832 soft lockup in ldlm_bl_xx threads at read for a single shared strided file
- Resolved