[LU-8396] ll_dirty_page_discard_warn dropping pages during re-connection. Created: 14/Jul/16 Updated: 14/Jul/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Evan Felix | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Servers: Clients: ipoib network |
||
| Epic/Theme: | dataloss |
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During some client IO's being done by a gridftp transfer, we get some files that wind up shorter than they should. In the log files we see this set of messages: Jul 13 03:17:30 gridftp01 kernel: Lustre: 2937:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 3 previous similar messages using that FID you can usually retrace the file and see that it is shorter than it should be. we know this because the file size when it is created is stored in an external database, and errors are shown when it is read and the file checksums are incorrect, and the filesize is wrong. File sizes are in the 30-50 MB range. I've tried working back from the dirty_page error in the code to determine why this specific IO does not get retried after the reconnection that happens almost immediately. we are trying to figure out why the reconnections are happening, but it would be nice to know why these IO's dont complete properly. |
| Comments |
| Comment by Andreas Dilger [ 14/Jul/16 ] |
|
The message from ll_dirty_page_discard_warn() is a symptom and not a cause. The root cause is that the client was evicted and it is unsafe for these pages to be written after the client has lost the lock, and this is just telling the admin/user that this has happened so you are aware of it. You need to look at the OST to see why the client was evicted, either because it was unresponsive due to load, or due to network problems. |
| Comment by Evan Felix [ 14/Jul/16 ] |
|
I recreated this error overnight, corrupting about 30 files. Messages on an ost: sar shows CPU load on ost was about 10% system, 4% user during the time issues were seen. |