[LU-1380] Potential problem with compiles using Lustre file systems Created: 05/May/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Joe Mervini Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Redsky (Sun Blade) cluster using both mdadm based and DDN 9900 based lustre file systems.


Severity: 3
Rank (Obsolete): 10248

 Description   

One of our users is making the following claim. I would like to know if there is any validity to his claim:

A new created file on a lustre filesystem has its write lock removed before the contents of the file have appeared (likely due to lustre, for performance reasons, delaying writing the contents of the file to service other higher priority requests). We've seen the erroneous behavior in multiple contexts.

In the example problem I sent, the Intel compiler had just finished generating an object file (ichos.o) on /gscratch2 and the next step was to invoke the linker to use that object file to link an executable (ichos.exe). The Intel compiler waited until the file system returned indicating that file ichos.o had been written to the disk and the write lock had been removed. The linker immediately fired up and tried to read this file, ichos.o, but it was still being written to the file system concurrent with the linker trying to read the file from the file system, hence the internal error.

Other contexts where we've seen this exact same problem is copying large numbers of smaller files (order 1k byte sized) across file systems (writing to a lustre filesystem) and, bizarrely, one or two files will be created with size zero. The source copy of the file was fine, the data was never put into the target file though the target file was created. The only difference here is that the file contents never show up, in the Intel compiler case above, the file contents are written some period of time after the files write lock has been released.



 Comments   
Comment by Andreas Dilger [ 05/May/12 ]

The behaviour described here should never happen with Lustre. When a client is writing to the file, the client does get a write lock on the file, but it cannot be dropped while there is dirty data in the client cache. Also, unless there is contention on this file from another node (e.g. parallel compile on multiple nodes?) there is little incentive for the client do drop the lock and cached pages. The client should still be able to red newly-written file data.

Comment by Peter Jones [ 06/May/12 ]

Oleg

Could you please comment on this one?

Thanks

Peter

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:16:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.