[LU-68] write_disjoint: invalid file size Created: 09/Feb/11  Updated: 29/Mar/11  Resolved: 29/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 23,175
Rank (Obsolete): 5096

 Description   

+ su mpiuser sh -c "/opt/mpich/ch-p4/bin/mpirun -np 12 -machinefile /tmp/parallel-scale.machines
/usr/lib64/lustre/tests/write_disjoint -f /mnt/lustre/d0.write_disjoint/file -n 10000 "
[0] MPI Abort by user Aborting program !
loop 0: chunk_size 103399
loop 544: chunk_size 113838, file size was 1366056
rank 0, loop 545: invalid file size 528737 instead of 576804 = 48067 * 12
[0] Aborting program!
p4_error: latest msg from perror: Resource temporarily unavailable

Reproduced at Oracle and I also have seen similar failures locally.
Could be related to the LU-67 issue



 Comments   
Comment by Andreas Dilger [ 09/Feb/11 ]

This looks very similar to https://bugzilla.lustre.org/show_bug.cgi?id=3523.

Comment by Oleg Drokin [ 10/Feb/11 ]

I have a log for this one now from my local testing.
This is a case of page not sent to the server, so it might cause this or LU-67 depending on if the page did not make at the end of file or in the middle.
Something to do with incorrect kms it seems, still digging.

Comment by Oleg Drokin [ 14/Mar/11 ]

I think I definitely foudn what the problem is finally
in osc_lock_detach
               /* Update the kms. Need to loop all granted locks.
                * Not a problem for the client */
               attr->cat_kms = ldlm_extent_shift_kms(dlmlock, old_kms);
               unlock_res_and_lock(dlmlock);
<<HERE>>        
               cl_object_attr_lock(obj);
               cl_object_attr_set(env, obj, attr, CAT_KMS);
               cl_object_attr_unlock(obj);

for the discussed case the ldlm_shift_kms found an existing lock with bigger offset and returned old kms. As soon as we unlock, in comes another thread and updates kms (in our case it is write updating size in commit_write), then we proceed to write stale kms in the original thread and as a result the last page of the write is not reflected in kms and is lost.

the problem did not happen in 1.8 because there the ldlm_extent_shift_kms was called under lov lock, but not anymore.

Comment by Build Master (Inactive) [ 14/Mar/11 ]

Integrated in reviews-centos5 #448
LU-68 Fix a race between lock cancel and write

Oleg Drokin : 186df50693a7a0fd9e20b4ac0ac08d523f5473be
Files :

  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 16/Mar/11 ]

Integrated in lustre-master-centos5 #151
LU-68 Fix a race between lock cancel and write

Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
Files :

  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 16/Mar/11 ]

Integrated in reviews-rhel6 #33
LU-68 Fix a race between lock cancel and write

Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
Files :

  • lustre/osc/osc_lock.c
Comment by Build Master (Inactive) [ 16/Mar/11 ]

Integrated in reviews-centos5 #483
LU-68 Fix a race between lock cancel and write

Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
Files :

  • lustre/osc/osc_lock.c
Comment by Peter Jones [ 24/Mar/11 ]

James

When do you think that you might be able to try out your reproducer with the latest code?

Please advise

Peter

Comment by Peter Jones [ 29/Mar/11 ]

Believed resolved. ORNL will reopen or open a new ticket if their reproducer still has issues

Generated at Sat Feb 10 01:03:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.