[LU-274] Client delayed file status (cache meta-data) causing job failures Created: 04/May/11 Updated: 06/Nov/13 Resolved: 30/May/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Steven Woods | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
servers running 1.8.5.54 and 1.8.5.55 both also have a patch to increase the number of MDS threads |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5021 | ||||||||
| Description |
|
A site key user while running a weather forcasting model that as part of its post processing and gathering of information from lustre noticed that sometimes his job would fail or get errors from tar because of zero length files. Unpon further investigation it was noticed that you could see basically what was going on from a simple "ls -l" command. The first instance showed files with zero length. After a short period of time (10-15 seconds and it varies) repeating the "ls -l" would show the correct size. Needless to say this gives the impression of data corruption since the tar file can't be trusted. The data in reality is there but not appearant to tar. If you place a "sync" command between steps 2 and 3 you appear to be able to avoid the problem.However, this is not acceptable because since we can't be expected to tell all users they need to issue a sync before certain operations or they may not get all their data. An example of the ls -l output >> ls -ltr Wait an undetermined amount of time repeat: >> ls -ltr Data size is now correct. I have made a couple of attempts at a local test using lustre 1.8.4 and so far am unable to reproduce the problem but not convinced this is conclusive. Will be running more tests. On the Oracle bugzilla basically bug 24501 is the same problem and also searching their DB it appears that bug 24458 might also be basically the same issue. I will attach a test case that has the basic commands used. The actual test data is available but is extremely large and I would need to be pointed to a ftp for that. |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 04/May/11 ] |
|
hm, we landed the patch from bug 23174 in 1.8.5. filter_intent_policy() used to update the LVB from disk when the AST failed (whatever the error is), while it is done only for -EINVAL now .. it might be related. Could you please give a try to your reproducer with this patch reverted? |
| Comment by Steven Woods [ 04/May/11 ] |
|
the site did try to drop back to lustre 1.8.4 plus the following patches Their initial tests seemed to show a similar issue. Still gathering information to see the validity of it. |
| Comment by Johann Lombardi (Inactive) [ 04/May/11 ] |
|
Did they downgrade all the OSSs? |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
The problem description mentions "Execute a mppcombine program that takes this data and creates a set of joined output data". Just to clarify, this does not actually use the experimental Lustre "file join" functionality, I hope? Would it be possible to get the source code for the mppcombine program. Unfortunately, the binary doesn't really help me debug the problem. |
| Comment by Steven Woods [ 04/May/11 ] |
|
Will try and get the src.. |
| Comment by Johann Lombardi (Inactive) [ 04/May/11 ] |
|
Looker closer at the patch from bug 23174, it seems that this patch breaks glimpse for server-side /*
* ->l_glimpse_ast() for DLM extent locks acquired on the server-side. See
* comment in filter_intent_policy() on why you may need this.
*/
int ldlm_glimpse_ast(struct ldlm_lock *lock, void *reqp)
{
/*
* Returning -ELDLM_NO_LOCK_DATA actually works, but the reason for
* that is rather subtle: with OST-side locking, it may so happen that
* _all_ extent locks are held by the OST. If client wants to obtain
* current file size it calls ll{,u}_glimpse_size(), and (as locks are
* on the server), dummy glimpse callback fires and does
* nothing. Client still receives correct file size due to the
* following fragment in filter_intent_policy():
*
* rc = l->l_glimpse_ast(l, NULL); // this will update the LVB
* if (rc != 0 && res->lr_namespace->ns_lvbo &&
* res->lr_namespace->ns_lvbo->lvbo_update) {
* res->lr_namespace->ns_lvbo->lvbo_update(res, NULL, 0, 1);
* }
*
* that is, after glimpse_ast() fails, filter_lvbo_update() runs, and
* returns correct file size to the client.
*/
return -ELDLM_NO_LOCK_DATA;
}
However, the patch from bug 23174 removed the LVB update in filter_intent_policy() That said, the server-side lock is released just after i/o completion and the LVB is |
| Comment by Andreas Dilger [ 04/May/11 ] |
|
For consistency sake, it probably is better to add a "local glimpse" callback that the OST thread registers when it enqueues the lock. That will make the glimpse mechanism work correctly without special-casing the caller. |
| Comment by Peter Kjellstrom (Inactive) [ 05/May/11 ] |
|
[EDIT: we now doubt the usefulness of our initial reproducer] This sounds very much like bz#24458. We've contributed a way to reproduce this (at least in our env.) in that bz entry. By coincidence the problem was at our site first seen by weather forecasting people running tar... |
| Comment by Peter Jones [ 05/May/11 ] |
|
Niu Could you please look into this one? Johann suggests trying to reproduce this issue on Toro as a first step. Thanks Peter |
| Comment by Steven Woods [ 05/May/11 ] |
|
Peter let me know if you need the complete dataset to help reproduce it. Just need to know where to ftp it to. |
| Comment by Peter Jones [ 05/May/11 ] |
|
Thanks Steve. Will do. – |
| Comment by Steven Woods [ 05/May/11 ] |
|
Here is the src for the mppcombine program. |
| Comment by Andreas Dilger [ 05/May/11 ] |
|
I was able to hit this problem on my home Lustre filesystem. It client1$ create files (just normal writes) {wait}client2$ find /myth/tv -size 0 -ls {list of zero-length files}client2$ ls -l {files}some of the files now returned a non-zero size. I confirmed by The problem definitely appears to be glimpse related, since the I think for the short term (maybe still 1.8.6 if possible) the |
| Comment by Niu Yawei (Inactive) [ 05/May/11 ] |
|
The comment #11 of bz23766 looks interesting to me, it said that file size will always be updated on MDS when close file, but in 1.8, the file size is not updated on every close, it's only updated along with ctime udpate. I think that could probably be the reason that we can't see this problem in 1.6, maybe the updating size on every close just hide the problem. |
| Comment by Niu Yawei (Inactive) [ 05/May/11 ] |
|
It looks to me there is a race in the size glimpse: When a client received glimpse callback, it get the lock by ldlm_handle2lock_ns() in ldlm_callback_handler(), and just before calling the ll_glimpse_callback(), this lock is happen to be canceled locally, and the kms is shrinked on cancel, then ll_glimpse_callback() will tell server a wrong size (less than actual size). Did I miss anything? |
| Comment by Johann Lombardi (Inactive) [ 05/May/11 ] |
|
> The comment #11 of bz23766 looks interesting to me, it said that file size will always The file size on the MDS is just a hint for e2scan, it is not used anywhere in Lustre. > I think for the short term (maybe still 1.8.6 if possible) the Andreas, i'm all for this (although i suspect that re-adding the LVB refresh in |
| Comment by Niu Yawei (Inactive) [ 06/May/11 ] |
|
> The file size on the MDS is just a hint for e2scan, it is not used anywhere in Lustre. Hi, Johann |
| Comment by Johann Lombardi (Inactive) [ 06/May/11 ] |
|
> When a client received glimpse callback, it get the lock by ldlm_handle2lock_ns() It seems that you are right. ll_glimpse_callback() can race with ll_extent_lock_cancel_cb(). This race has been around for a while and is probably not the root cause of the problem reported here since it appeared in 1.8.5. Let's file a new bug for this. Good catch!!! Actually, while looking at ll_glimpse_callback(), I have realized that beyond the lockless i/o case, clients can return ELDLM_NO_LOCK_DATA to glimpse in many cases. Since we no longer update LVB from disk for this error, that's likely related to our problem. |
| Comment by Johann Lombardi (Inactive) [ 06/May/11 ] |
|
> It seems that you are right. ll_glimpse_callback() can race with ll_extent_lock_cancel_cb(). Actually, we already have a flag (i.e. LDLM_FL_KMS_IGNORE) to notify that kms is being updated because the lock is being cancelled. We could just check for this flag in ll_glimpse_callback()and return EINVAL to the glimpse. The server should then update the LVB from disk and get the right object size since kms is updated on the client once all dirty data has been pushed to the server already. |
| Comment by Niu Yawei (Inactive) [ 06/May/11 ] |
|
> Actually, while looking at ll_glimpse_callback(), I have realized that beyond the lockless i/o case, clients can return Right, I just reproduced the problem in my virtual machine by Andreas's way. I saw client return -ELDLM_NO_LOCK_DATA on glimpse callback. (inode *inode = ll_inode_from_lock(lock), got NULL inode in ll_glimpse_callback(), because the inode is already be clear now) |
| Comment by James A Simmons [ 06/May/11 ] |
|
Will the patch be posted here or under |
| Comment by Niu Yawei (Inactive) [ 06/May/11 ] |
|
Hi, James The patch is posted in Gerrit. http://review.whamcloud.com/507 |
| Comment by Johann Lombardi (Inactive) [ 09/May/11 ] |
|
FYI, I've just merged the fix in b1_8. |
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Build Master (Inactive) [ 09/May/11 ] |
|
Integrated in Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
|
| Comment by Peter Kjellstrom (Inactive) [ 11/May/11 ] |
|
I've updated bz#24458 with a new reproducer (a lot simpler only two clients and a few seconds) and "lctl dk" logs. |
| Comment by Johann Lombardi (Inactive) [ 11/May/11 ] |
|
Peter, could you please check that your test case fails to reproduce the problem with the patch we just landed? |
| Comment by Peter Kjellstrom (Inactive) [ 12/May/11 ] |
|
We're working on that but changing server side stuff is a bit more time consuming (ie. can't be done on production system). |
| Comment by Peter Kjellstrom (Inactive) [ 12/May/11 ] |
|
Johann: We've now built and tested 1.8 from git The next question is how should we deploy this fix... Try to apply the "
|
| Comment by Johann Lombardi (Inactive) [ 12/May/11 ] |
|
Hi Peter. I would recommend that you run with your patched 1.8.5 until there is a 1.8.x release containing this fix. |
| Comment by Peter Kjellstrom (Inactive) [ 13/May/11 ] |
|
Johann: we don't have a patched 1.8.5 (unless you're referring to git b1_8 2ecce620a67478365c2d0c57ca580c32914c09fe). Except for that build we're considering backporting "LVB from disk when glimpse callback return error" onto either 1.8.5-release or 1.8.5.55-wc. |
| Comment by Johann Lombardi (Inactive) [ 13/May/11 ] |
|
Since our b1_8 branch has not gone through a full test cycle yet, i think it is more reasonable to patch 1.8.5 for now. |
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
|
| Comment by Niu Yawei (Inactive) [ 30/May/11 ] |
|
Landed in b1_8 and master. |