[LU-2238] Client "unstable" pages potentially lost on unmount (?) Created: 25/Oct/12 Updated: 07/Jun/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Rank (Obsolete): | 5296 |
| Description |
|
What happens when a client has dirty and/or "unstable" pages in its cache, but unmount is run? I'm doing some single node testing, and it appears as though the mount does not wait for these pages to reach stable storage. Thus, the client unmounts with unstable pages in its cache, leaving the door open for these pages to be dropped in the case of a server failure. Replay is no longer possible because the client has already disconnected, right? This seems like a data integrity issue..? |
| Comments |
| Comment by Peter Jones [ 26/Oct/12 ] |
|
Alex Who should look into this one? Peter |
| Comment by Prakash Surya (Inactive) [ 26/Oct/12 ] |
|
After talking with Andreas and Alex, there is a possibility for data to be lost at unmount time due to RPCs being completed prior to being committed to stable when using async commits. The "right" fix is to send a sync call to the server and wait for that to complete, to ensure the pages are safe on disk. The issue with using a "hard" sync, is causing a "sync storm" on the servers due to a high number of clients all sending the sync at the same time, due to a coordinated unmount. Thus, a "soft" sync should be used instead, and the client will wait for last_committed to be piggy backed on a ping RPC prior to disconnecting. |
| Comment by Alex Zhuravlev [ 24/Dec/12 ] |
|
some work to be done in CLIO, I think. then a bit of code implementing "soft" sync on OSS. |