[LU-2238] Client "unstable" pages potentially lost on unmount (?) Created: 25/Oct/12  Updated: 07/Jun/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: llnl

Rank (Obsolete): 5296

 Description   

What happens when a client has dirty and/or "unstable" pages in its cache, but unmount is run? I'm doing some single node testing, and it appears as though the mount does not wait for these pages to reach stable storage. Thus, the client unmounts with unstable pages in its cache, leaving the door open for these pages to be dropped in the case of a server failure. Replay is no longer possible because the client has already disconnected, right? This seems like a data integrity issue..?



 Comments   
Comment by Peter Jones [ 26/Oct/12 ]

Alex

Who should look into this one?

Peter

Comment by Prakash Surya (Inactive) [ 26/Oct/12 ]

After talking with Andreas and Alex, there is a possibility for data to be lost at unmount time due to RPCs being completed prior to being committed to stable when using async commits.

The "right" fix is to send a sync call to the server and wait for that to complete, to ensure the pages are safe on disk. The issue with using a "hard" sync, is causing a "sync storm" on the servers due to a high number of clients all sending the sync at the same time, due to a coordinated unmount. Thus, a "soft" sync should be used instead, and the client will wait for last_committed to be piggy backed on a ping RPC prior to disconnecting.

Comment by Alex Zhuravlev [ 24/Dec/12 ]

some work to be done in CLIO, I think. then a bit of code implementing "soft" sync on OSS.

Generated at Sat Feb 10 01:23:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.