[LU-1974] File corruptions when running with LU-1442 patch, LU-1703 patch is also required. Created: 18/Sep/12  Updated: 19/Nov/12  Resolved: 19/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexandre Louvet Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6320

 Description   

This bug is mainly open for information and to let the community aware, just in case ...

So, here is the story, running with our/Bull Lustre 2.1.2 version, customer started to report files corruptions where a Client can successfully create/write/re-read files until an other Client tries to access the same file. At this time the files content became corrupted for all (missing blocks of data or zéro file-size ...) !!

On the other hand, the corruptions have been identified to occur only on OSTs/OSCs where the file-creator Client had no more grant ("/proc/fs/lustre/osc/<OST-import>/cur_

{dirty|grant|lost_grant}

_bytes" are Null) and also they seem to never be automatically recovered but only when we run a small program doing O_DIRECT writes to these OSTs ...

Finally doing a full Lustre-trace of a program/command on a Client writing to these "zero-grant" OSTs, we found that the -EDQUOT was coming up during the cached-write standard path/routines and then the direct-IO path was attempted but ended with -EALREADY finally substitued with a successfull Null/0 return-value and no page written/flushed to the Server at all nor new grants recovered !!!

Having a look to the concerned source-code, this would only occur if the written page(s) was not set Dirty ...

Finally we found that this behavior/bug (missing "set_page_dirty()" vvp_io_commit_write() in case of -EDQUOT return from cl_page_add_cache() and implicit switch to direct-IO path/vvp_page_sync_io()) was not in Lustre v2.1.2 Base but has been introduced by patch from LU-1442 that our R&D included due to its high-level critical ...

This bug has been since fixed by LU-1703 that we (I mean Bull R&D) need to integrate asap !!!!

Also, I think that an explicit link/comment has to be added in LU-1442 to detail its running-dependency with LU-1703 patch.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 18/Sep/12 ]

glad you have found the root cause.

Comment by Peter Jones [ 18/Sep/12 ]

Thanks Bruno. I have added a link between the tickets. Both of these fixes are included in 2.1.3 so hopefully this is not a widespread issue. Do you need any further action or can we close this ticket?

Comment by Bruno Faccini (Inactive) [ 19/Sep/12 ]

Yes, for sure ticket can be closed, I don't there are anything else to be done.

Generated at Sat Feb 10 01:21:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.