[LU-2277] Text file busy Created: 05/Nov/12 Updated: 17/Mar/14 Resolved: 11/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | Lustre 2.1.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server-2.1.3-1nas (centos6.3) |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 5443 | ||||||||||||
| Description |
see attached client and server debug traces. |
| Comments |
| Comment by Oleg Drokin [ 05/Nov/12 ] |
|
This is supposed to happen when somebody holds a file opened for write. How was a.out produced? Was somebody holding it open still? |
| Comment by Mahmoud Hanafi [ 06/Nov/12 ] |
|
a.out was compiled in placed. It seems that error only occers on some of the clients. |
| Comment by Oleg Drokin [ 06/Nov/12 ] |
|
Is there any NFS reexport going on by any chance? I also see that you have standard debug enabled, if you can still reproduce this and there is no NFS involved, please increase debug level on your mds as follows: try to minimize all other activity oing on on the cluster echo -1 >/proc/sys/lnet debug ; echo - trace >/proc/sys/lnet/debug do the reproducer
echo "YOUR SAVED VALUE FROM ABOVE" >/proc/sys/lnet/debug Thanks |
| Comment by Mahmoud Hanafi [ 07/Nov/12 ] |
|
The Lustre filesyste is not reexported via nfs. There NFS mounts on the client for home dir, etc. I was able to track this issue to the stripe width of the directory. If the stripe width is > 54 it will reproduce the error. how it was reproduced. mhanafi@pfe1:/nobackupp5/mhanafi/60> icpc -O test.c++ see attached debug logs. |
| Comment by Nathaniel Clark [ 07/Dec/12 ] |
|
Have reproduction case: All TCP interconnect In mounted FS: Expected Results: Actual Results: Other Test Results:
With 2 Clients (both CentOS 6.3 - Lustre 2.1.3-2.6.32_279.2.1.el6.x86_64) WAG of cause: |
| Comment by Nathaniel Clark [ 10/Dec/12 ] |
|
Attached is C File which can recreate this bug in directories with a default stripe width greater than 53, by just doing a modified copy of an executable. What causes the executable to be in a bad state is the following order of events: |
| Comment by Nathaniel Clark [ 13/Dec/12 ] |
|
After more careful inspection of the logs, it appears that in the failing cases, the second open is being processed twice: This is log lines that match the following grep expression: '((mdt_close(|mdt_mfd_open).*leaving|incoming)' from logs that are of a double open copy to a directory with a given stripe size:
Notice the repeated x1421167733816588 in the 54 stripe case with accompanying mdt_mfd_open but missing mdt_close(). mdt_close is skipped. This means that a request is being processed twice. |
| Comment by Nathaniel Clark [ 17/Dec/12 ] |
|
Note on what goes over the wire: For stripe sizes 50-53: (example is stripe 50) For stripe sizes 54-59: (example is stripe 54) |
| Comment by Nathaniel Clark [ 17/Dec/12 ] |
|
Client seems to be re-sending open create because response data is large and reply_in_callback() registers it as truncated and resends. So the solution I'm going to look to implement is to correctly handle the replay of the open/create for write/exec in mdt_mfd_open(). |
| Comment by Nathaniel Clark [ 18/Dec/12 ] |
| Comment by Nathaniel Clark [ 11/Jan/13 ] |
|
Original Commit is now just testing updates: Following are cherry-picked commits that are dependent on 4848, 5002 is unchanged, 5003 needed a merge conflict resolved. |
| Comment by Nathaniel Clark [ 11/Mar/13 ] |
|
Patches picked to branch |