[LU-68] write_disjoint: invalid file size Created: 09/Feb/11 Updated: 29/Mar/11 Resolved: 29/Mar/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0, Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Bugzilla ID: | 23,175 |
| Rank (Obsolete): | 5096 |
| Description |
|
+ su mpiuser sh -c "/opt/mpich/ch-p4/bin/mpirun -np 12 -machinefile /tmp/parallel-scale.machines Reproduced at Oracle and I also have seen similar failures locally. |
| Comments |
| Comment by Andreas Dilger [ 09/Feb/11 ] |
|
This looks very similar to https://bugzilla.lustre.org/show_bug.cgi?id=3523. |
| Comment by Oleg Drokin [ 10/Feb/11 ] |
|
I have a log for this one now from my local testing. |
| Comment by Oleg Drokin [ 14/Mar/11 ] |
|
I think I definitely foudn what the problem is finally for the discussed case the ldlm_shift_kms found an existing lock with bigger offset and returned old kms. As soon as we unlock, in comes another thread and updates kms (in our case it is write updating size in commit_write), then we proceed to write stale kms in the original thread and as a result the last page of the write is not reflected in kms and is lost. the problem did not happen in 1.8 because there the ldlm_extent_shift_kms was called under lov lock, but not anymore. |
| Comment by Build Master (Inactive) [ 14/Mar/11 ] |
|
Integrated in Oleg Drokin : 186df50693a7a0fd9e20b4ac0ac08d523f5473be
|
| Comment by Build Master (Inactive) [ 16/Mar/11 ] |
|
Integrated in Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
|
| Comment by Build Master (Inactive) [ 16/Mar/11 ] |
|
Integrated in Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
|
| Comment by Build Master (Inactive) [ 16/Mar/11 ] |
|
Integrated in Oleg Drokin : d2dbff42e78d7ebca4db534df7e1c19f6b674a22
|
| Comment by Peter Jones [ 24/Mar/11 ] |
|
James When do you think that you might be able to try out your reproducer with the latest code? Please advise Peter |
| Comment by Peter Jones [ 29/Mar/11 ] |
|
Believed resolved. ORNL will reopen or open a new ticket if their reproducer still has issues |