[LU-274] Client delayed file status (cache meta-data) causing job failures Created: 04/May/11  Updated: 06/Nov/13  Resolved: 30/May/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6

Type: Bug Priority: Major
Reporter: Steven Woods Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

servers running 1.8.5.54 and 1.8.5.55 both also have a patch to increase the number of MDS threads
OS version CentOS 5.5
Clients experiencing the problem are running 1.8.5 release from Oracle.


Attachments: Text File mppnccombine.c     File tar_fail.tar    
Issue Links:
Duplicate
is duplicated by LU-2347 lustre client sometimes reports wrong... Closed
Severity: 3
Rank (Obsolete): 5021

 Description   

A site key user while running a weather forcasting model that as part of its post processing and gathering of information from lustre noticed that sometimes his job would fail or get errors from tar because of zero length files. Unpon further investigation it was noticed that you could see basically what was going on from a simple "ls -l" command. The first instance showed files with zero length. After a short period of time (10-15 seconds and it varies) repeating the "ls -l" would show the correct size.
So the user basic sequence that is run is:
1. Base data is in directory from completed run
2. Execute a mppcombine program that takes this data and creates a set of joined output data
3. create a list of output files from mppcombine run to feed to tar using "ls -l" of this newly created data
4. Take that list and begin tarring up the data
5. If you see "xt5-widow2-mpp-test.o406112:tar: atmos_scalar.nc: file changed as we read it" bingo you hit the problem.

Needless to say this gives the impression of data corruption since the tar file can't be trusted. The data in reality is there but not appearant to tar. If you place a "sync" command between steps 2 and 3 you appear to be able to avoid the problem.However, this is not acceptable because since we can't be expected to tell all users they need to issue a sync before certain operations or they may not get all their data.

An example of the ls -l output

>> ls -ltr
.
.
rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
rw------ 1 XX.XXXX user 0 2011-04-12 22:01 time_FS.o564927
rw------ 1 XX.XXXX user 0 2011-04-12 22:36 time_FS.o565015

Wait an undetermined amount of time repeat:

>> ls -ltr
.
.
rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
rw------ 1 XX.XXXX user 130991 2011-04-12 22:01 time_FS.o564927
rw------ 1 XX.XXXX user 130956 2011-04-12 22:36 time_FS.o565015

Data size is now correct.

I have made a couple of attempts at a local test using lustre 1.8.4 and so far am unable to reproduce the problem but not convinced this is conclusive. Will be running more tests.

On the Oracle bugzilla basically bug 24501 is the same problem and also searching their DB it appears that bug 24458 might also be basically the same issue.

I will attach a test case that has the basic commands used. The actual test data is available but is extremely large and I would need to be pointed to a ftp for that.



 Comments   
Comment by Johann Lombardi (Inactive) [ 04/May/11 ]

hm, we landed the patch from bug 23174 in 1.8.5. filter_intent_policy() used to update the LVB from disk when the AST failed (whatever the error is), while it is done only for -EINVAL now .. it might be related.

Could you please give a try to your reproducer with this patch reverted?

Comment by Steven Woods [ 04/May/11 ]

the site did try to drop back to lustre 1.8.4 plus the following patches
patches included are:
bz-23549-att-31345.patch
bz-23766-att-32153.patch
bz-23766-att-32236.patch
bz-23766-att-32323.patch
bz-24226-att-32486.patch
bz-24226-att-32582.patch
lnet_talk_to_gnilnd.diff

Their initial tests seemed to show a similar issue. Still gathering information to see the validity of it.

Comment by Johann Lombardi (Inactive) [ 04/May/11 ]

Did they downgrade all the OSSs?

Comment by Andreas Dilger [ 04/May/11 ]

The problem description mentions "Execute a mppcombine program that takes this data and creates a set of joined output data". Just to clarify, this does not actually use the experimental Lustre "file join" functionality, I hope?

Would it be possible to get the source code for the mppcombine program. Unfortunately, the binary doesn't really help me debug the problem.

Comment by Steven Woods [ 04/May/11 ]

Will try and get the src..

Comment by Johann Lombardi (Inactive) [ 04/May/11 ]

Looker closer at the patch from bug 23174, it seems that this patch breaks glimpse for server-side
locks (e.g. used in lockless i/o including direct i/o).

/*
 * ->l_glimpse_ast() for DLM extent locks acquired on the server-side. See
 * comment in filter_intent_policy() on why you may need this.
 */
int ldlm_glimpse_ast(struct ldlm_lock *lock, void *reqp)
{
        /*   
         * Returning -ELDLM_NO_LOCK_DATA actually works, but the reason for
         * that is rather subtle: with OST-side locking, it may so happen that
         * _all_ extent locks are held by the OST. If client wants to obtain
         * current file size it calls ll{,u}_glimpse_size(), and (as locks are
         * on the server), dummy glimpse callback fires and does
         * nothing. Client still receives correct file size due to the
         * following fragment in filter_intent_policy():
         *
         * rc = l->l_glimpse_ast(l, NULL); // this will update the LVB
         * if (rc != 0 && res->lr_namespace->ns_lvbo &&
         *     res->lr_namespace->ns_lvbo->lvbo_update) {
         *         res->lr_namespace->ns_lvbo->lvbo_update(res, NULL, 0, 1);
         * }
         *
         * that is, after glimpse_ast() fails, filter_lvbo_update() runs, and
         * returns correct file size to the client.
         */
        return -ELDLM_NO_LOCK_DATA;
}

However, the patch from bug 23174 removed the LVB update in filter_intent_policy()

That said, the server-side lock is released just after i/o completion and the LVB is
updated in ost_brw_lock_put(). So the window is very short and the symptoms described
in this bug looks different.

Comment by Andreas Dilger [ 04/May/11 ]

For consistency sake, it probably is better to add a "local glimpse" callback that the OST thread registers when it enqueues the lock. That will make the glimpse mechanism work correctly without special-casing the caller.

Comment by Peter Kjellstrom (Inactive) [ 05/May/11 ]

[EDIT: we now doubt the usefulness of our initial reproducer]

This sounds very much like bz#24458. We've contributed a way to reproduce this (at least in our env.) in that bz entry. By coincidence the problem was at our site first seen by weather forecasting people running tar...

Comment by Peter Jones [ 05/May/11 ]

Niu

Could you please look into this one? Johann suggests trying to reproduce this issue on Toro as a first step.

Thanks

Peter

Comment by Steven Woods [ 05/May/11 ]

Peter let me know if you need the complete dataset to help reproduce it. Just need to know where to ftp it to.

Comment by Peter Jones [ 05/May/11 ]

Thanks Steve. Will do.


Peter Jones
Whamcloud, Inc.
www.whamcloud.com

Comment by Steven Woods [ 05/May/11 ]

Here is the src for the mppcombine program.

Comment by Andreas Dilger [ 05/May/11 ]

I was able to hit this problem on my home Lustre filesystem. It
wasn't a strictly controlled experiment, but something like:

client1$ create files (just normal writes)

{wait}

client2$ find /myth/tv -size 0 -ls

{list of zero-length files}

client2$ ls -l

{files}

some of the files now returned a non-zero size. I confirmed by
inspecting the objects that the files that were still zero bytes
had zero-sized objects.

The problem definitely appears to be glimpse related, since the
size returned on the subsequent "ls -l" was correct.

I think for the short term (maybe still 1.8.6 if possible) the
LVB refresh in filter_intent_policy() and ldlm_cb_interpret()
should be restored.

Comment by Niu Yawei (Inactive) [ 05/May/11 ]

The comment #11 of bz23766 looks interesting to me, it said that file size will always be updated on MDS when close file, but in 1.8, the file size is not updated on every close, it's only updated along with ctime udpate. I think that could probably be the reason that we can't see this problem in 1.6, maybe the updating size on every close just hide the problem.

Comment by Niu Yawei (Inactive) [ 05/May/11 ]

It looks to me there is a race in the size glimpse:

When a client received glimpse callback, it get the lock by ldlm_handle2lock_ns() in ldlm_callback_handler(), and just before calling the ll_glimpse_callback(), this lock is happen to be canceled locally, and the kms is shrinked on cancel, then ll_glimpse_callback() will tell server a wrong size (less than actual size). Did I miss anything?

Comment by Johann Lombardi (Inactive) [ 05/May/11 ]

> The comment #11 of bz23766 looks interesting to me, it said that file size will always
> be updated on MDS when close file, but in 1.8, the file size is not updated on every
> close

The file size on the MDS is just a hint for e2scan, it is not used anywhere in Lustre.
Niu, could you please try to reproduce the problem and to re-add the LVB update in
filter_intent_policy()? If this fixes the problem, I am curious to know what what
kind of error is hit during the glimpse.

> I think for the short term (maybe still 1.8.6 if possible) the
> LVB refresh in filter_intent_policy() and ldlm_cb_interpret()
> should be restored

Andreas, i'm all for this (although i suspect that re-adding the LVB refresh in
filter_intent_policy() might be enough). I just would like someone to give this
a try and to confirm that it indeed fixes the problem.

Comment by Niu Yawei (Inactive) [ 06/May/11 ]

> The file size on the MDS is just a hint for e2scan, it is not used anywhere in Lustre.
> Niu, could you please try to reproduce the problem and to re-add the LVB update in
> filter_intent_policy()? If this fixes the problem, I am curious to know what what
> kind of error is hit during the glimpse.

Hi, Johann
Ah, you are right, the OBD_MD_FLSIZE isn't set on getattr. There isn't any free node on Toro at this moment, I will try to reproduce the problem as soon as there is any nodes available.

Comment by Johann Lombardi (Inactive) [ 06/May/11 ]

> When a client received glimpse callback, it get the lock by ldlm_handle2lock_ns()
> in ldlm_callback_handler(), and just before calling the ll_glimpse_callback(),
> this lock is happen to be canceled locally, and the kms is shrinked on cancel,
> then ll_glimpse_callback() will tell server a wrong size (less than actual size).
> Did I miss anything?

It seems that you are right. ll_glimpse_callback() can race with ll_extent_lock_cancel_cb(). This race has been around for a while and is probably not the root cause of the problem reported here since it appeared in 1.8.5. Let's file a new bug for this. Good catch!!!

Actually, while looking at ll_glimpse_callback(), I have realized that beyond the lockless i/o case, clients can return ELDLM_NO_LOCK_DATA to glimpse in many cases. Since we no longer update LVB from disk for this error, that's likely related to our problem.

Comment by Johann Lombardi (Inactive) [ 06/May/11 ]

> It seems that you are right. ll_glimpse_callback() can race with ll_extent_lock_cancel_cb().
> This race has been around for a while and is probably not the root cause of the problem
> reported here since it appeared in 1.8.5. Let's file a new bug for this. Good catch!!!

Actually, we already have a flag (i.e. LDLM_FL_KMS_IGNORE) to notify that kms is being updated because the lock is being cancelled. We could just check for this flag in ll_glimpse_callback()and return EINVAL to the glimpse. The server should then update the LVB from disk and get the right object size since kms is updated on the client once all dirty data has been pushed to the server already.

Comment by Niu Yawei (Inactive) [ 06/May/11 ]

> Actually, while looking at ll_glimpse_callback(), I have realized that beyond the lockless i/o case, clients can return
> ELDLM_NO_LOCK_DATA to glimpse in many cases. Since we no longer update LVB from disk for this error, that's likely related > to our problem.

Right, I just reproduced the problem in my virtual machine by Andreas's way. I saw client return -ELDLM_NO_LOCK_DATA on glimpse callback. (inode *inode = ll_inode_from_lock(lock), got NULL inode in ll_glimpse_callback(), because the inode is already be clear now)
After add the lvb update back in filter_intent_policy(), the problem is gone. Will provide the patch for review soon.

Comment by James A Simmons [ 06/May/11 ]

Will the patch be posted here or under LU-287?

Comment by Niu Yawei (Inactive) [ 06/May/11 ]

Hi, James

The patch is posted in Gerrit. http://review.whamcloud.com/507

Comment by Johann Lombardi (Inactive) [ 09/May/11 ]

FYI, I've just merged the fix in b1_8.

Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Build Master (Inactive) [ 09/May/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #46
LU-274 Update LVB from disk when glimpse callback return error

Johann Lombardi : 8cf8ac002a32b872677b8d9064e35f8eda436a28
Files :

  • lustre/obdfilter/filter.c
Comment by Peter Kjellstrom (Inactive) [ 11/May/11 ]

I've updated bz#24458 with a new reproducer (a lot simpler only two clients and a few seconds) and "lctl dk" logs.

Comment by Johann Lombardi (Inactive) [ 11/May/11 ]

Peter, could you please check that your test case fails to reproduce the problem with the patch we just landed?

Comment by Peter Kjellstrom (Inactive) [ 12/May/11 ]

We're working on that but changing server side stuff is a bit more time consuming (ie. can't be done on production system).

Comment by Peter Kjellstrom (Inactive) [ 12/May/11 ]

Johann: We've now built and tested 1.8 from git on the servers (using 2.6.18-238.9.1.el5 kernel). We are not able to reproduce the problem anymore. This is on our test file system still but it has survived around 100 times the previous mean time between failure.

The next question is how should we deploy this fix... Try to apply the "LU-274 Update LVB from disk when glimpse callback return error"-patch on 1.8.5-release or run a git-snapshot in production.

commit 2ecce620a67478365c2d0c57ca580c32914c09fe
LU-231 sanity_test_170: FAIL: expected 248 bad lines, but got 249

Comment by Johann Lombardi (Inactive) [ 12/May/11 ]

Hi Peter. I would recommend that you run with your patched 1.8.5 until there is a 1.8.x release containing this fix.

Comment by Peter Kjellstrom (Inactive) [ 13/May/11 ]

Johann: we don't have a patched 1.8.5 (unless you're referring to git b1_8 2ecce620a67478365c2d0c57ca580c32914c09fe). Except for that build we're considering backporting "LVB from disk when glimpse callback return error" onto either 1.8.5-release or 1.8.5.55-wc.

Comment by Johann Lombardi (Inactive) [ 13/May/11 ]

Since our b1_8 branch has not gone through a full test cycle yet, i think it is more reasonable to patch 1.8.5 for now.

Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Build Master (Inactive) [ 25/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #136
LU-274 Update LVB from disk when glimpse callback return error

Oleg Drokin : 0533e7bcd47e31257f09ac630e75c86462747361
Files :

  • lustre/ldlm/ldlm_lockd.c
Comment by Niu Yawei (Inactive) [ 30/May/11 ]

Landed in b1_8 and master.

Generated at Sat Feb 10 01:05:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.