[LU-4520] Text file busy error -- mainline 3.12 client Created: 21/Jan/14 Updated: 22/Sep/19 Resolved: 22/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Roland Fehrenbacher | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
|
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 12363 | ||||||||||||
| Description |
|
When executing the following simple script, the in-kernel client fails with error "Text file busy", while the out-of-kernel 2.4.2 client compiled against 2.6.32 vanilla works. ----------- rm -f ./test.sh The following script works in all cases: ----------- rm -f ./test.sh |
| Comments |
| Comment by Roland Fehrenbacher [ 22/Jan/14 ] |
|
Update: The issue is also present in current 3.14 kernel.org git |
| Comment by Cédric Dufour [ 23/Jan/14 ] |
|
Hello, I've been running tcpdump on that issue (see attached files). Shortly put:
Could that be the reason why the MDT gets confused ? (hope it helps) Best, Cédric |
| Comment by Roland Fehrenbacher [ 23/Jan/14 ] |
|
Might be interesting to geta tcpdump of the working 2.6.32/2.4.2 client. |
| Comment by Cédric Dufour [ 24/Jan/14 ] |
|
Hello, Attached the packets captured from the working 2.6.32/2.4.2 client. As you can see, the read pattern corresponding to the "./test.sh" is quite different, with consecutive "Concurrent read" separated by "MDS_CLOSE". (hope it helps), Cédric |
| Comment by Cédric Dufour [ 24/Jan/14 ] |
|
Ha, yes, maybe I should have said that attached packets captures corresponds only to the "./test.sh; echo 'echo foo' > ./test.sh" part of the test case. |
| Comment by Cédric Dufour [ 28/Jan/14 ] |
|
Hello, I've attached the 'strace' and corresponding 'lctl dk ...' for the failing "./test.sh; echo 'echo foo' > ./test.sh" part of the test case. Can someone make sense of it (iow. why we bump into that "(file.c:436:ll_intent_file_open()) lock enqueue: err: -26" error) ? Thanks in advance, Cédric |
| Comment by Cédric Dufour [ 29/Jan/14 ] |
|
Hello John, Thank you for for links. $ LUSTRE_DIR=/idiap/temp/cdufour/bug9615-LU4520
$ rm -f "${LUSTRE_DIR}"/test.sh
$ touch "${LUSTRE_DIR}"/test.sh
$ chmod a+x "${LUSTRE_DIR}"/test.sh
$ echo 'echo foo' > "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
foo
$ echo 'echo foo' > "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
bash: /idiap/temp/cdufour/bug9615-LU4520/test.sh: Text file busy
$ "${LUSTRE_DIR}"/test.sh
foo
Is server-side patch mentioned in @Roland: I checked GIT 2.4.2: for-mentioned patch is not applied Best, Cédric |
| Comment by Cédric Dufour [ 29/Jan/14 ] |
|
ADDENDUM: if two write operations are performed before the script execution, no more ETXTBSY: $ touch "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
bash: /idiap/temp/cdufour/bug9615-LU4520/test.sh: Text file busy
$ "${LUSTRE_DIR}"/test.sh
foo
$ touch "${LUSTRE_DIR}"/test.sh
$ touch "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
foo
$ "${LUSTRE_DIR}"/test.sh
foo
|
| Comment by Cédric Dufour [ 30/Jan/14 ] |
|
Hello Again, I've picked up test case from Patch from I'm no LDLM guru but I'm surprised that the 'touch' steps leaves a CR lock behind, which is "picked up" by the 'cp' step afterwards. Shouldn't all locks concerning a given file be cleared after a process is done with it (e.g. 'touch') ? Cédric |
| Comment by John Hammond [ 30/Jan/14 ] |
|
Hello Cédric, You might try the patch from LDLM locks are intended to be cached so the fact that these locks are still held after close does not indicate an error. Can you print the CR lock left behind and describe the 'cp' step. Thanks, John |
| Comment by Roland Fehrenbacher [ 30/Jan/14 ] |
|
Hi John, your patch is for the MDT. Is it plausible that the problem originates from there, given the fact Roland |
| Comment by John Hammond [ 30/Jan/14 ] |
|
Indeed. There are some changes to lookup (with the introduction of atomic_open) between 2.6.32 and 3.12 which may account for the difference. But I have not checked that this is in fact the case here. |
| Comment by Cédric Dufour [ 30/Jan/14 ] |
|
(to answer John's question above) 'error' test case (the 'success' test case is identical, save for the 'touch' step which is ommitted) rm -f "${LUSTRE_DIR}"/echo
rm -f "${LUSTRE_DIR}"/echo # Needed to make sure everything is cleaned-up
rm -f "${LUSTRE_DIR}"/echo # Needed to make really sure everything is cleaned-up
touch "${LUSTRE_DIR}"/echo # 'touch' step
cp -p /bin/echo "${LUSTRE_DIR}"/echo # 'cp' step
"${LUSTRE_DIR}"/echo 'Hello World!' # 'echo' step -> ETXTBSY
As for the "left-behind" CR lock, your explanation about caching makes sense; it is picked up during the 'cp' step: (ldlm_resource.c:1406:ldlm_resource_dump()) ### ### ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11e00/0x251fdce27c789bc6 lrc: 1/0,0 mode: CR/CR res: [0x200000e36:0x4:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x20000000000 nid: local remote: 0x3ce47269da805bef expref: -99 pid: 14005 timeout: 0 lvb_type: 0 and - if I get correctly - "re"-used for futher CR locking. Now, the other notorious difference between the 'error' and the 'success' test case is regarding the CR lock requested for the initial 'getattr' operation in the 'cp' step; In the 'error' case: (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0] (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0 (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0 (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000 (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0 (ldlm_request.c:606:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x3ce47269da805cdd expref: -99 pid: 14007 timeout: 0 lvb_type: 0 While in the 'success' case: (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0] (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0 (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000 (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (ldlm_request.c:535:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (this leads to no ETXTBSY; but is the "ABORTED" lock attempt normal?) NOTE: the attached If Roland is to apply the |
| Comment by Andreas Dilger [ 10/Feb/14 ] |
|
What about the patch for |
| Comment by Roland Fehrenbacher [ 30/Mar/14 ] |
|
Hi John, your patch ( Roland |
| Comment by Roland Fehrenbacher [ 02/Apr/14 ] |
|
I should have added, that it is indeed also necessary to include |
| Comment by Peter Jones [ 02/Apr/14 ] |
|
Roland Wasn't the Peter |
| Comment by Roland Fehrenbacher [ 02/Apr/14 ] |
|
Peter, yes, it is. But we're working with the in-kernel client based on 3.12 and for that we needed to make a backport. The Roland |
| Comment by Peter Jones [ 02/Apr/14 ] |
|
Ah yes - thanks for clarifying! |
| Comment by James A Simmons [ 22/Sep/19 ] |
|
Fixed a long time ago |