[LU-4520] Text file busy error -- mainline 3.12 client Created: 21/Jan/14  Updated: 22/Sep/19  Resolved: 22/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roland Fehrenbacher Assignee: WC Triage
Resolution: Fixed Votes: 1
Labels: None
Environment:
  • Client mainline kernel 3.12.5 with patches (mentioned in LU-4400) / lustre-utils 2.4.2
  • Servers lustre 2.4.2/ZFS OSDs and 2.4.2/ldiskfs OSDs
  • ko2iblnd an ksocklnd

Attachments: File LU-4398.tar.bz2     Text File LU-4520.ETXTBSY.lctl-dk.txt     Text File LU-4520.ETXTBSY.strace.txt     Text File LU-4520.ok-vanilla.pcap.txt     Text File LU-4520.pcap.txt     File LU4429+4520.patch    
Issue Links:
Related
is related to LU-4398 mdt_object_open_lock() may not flush ... Resolved
is related to LU-4429 clients leaking open handles/bad lock... Resolved
Severity: 3
Rank (Obsolete): 12363

 Description   

When executing the following simple script, the in-kernel client fails with error "Text file busy", while the out-of-kernel 2.4.2 client compiled against 2.6.32 vanilla works.

-----------
#!/bin/bash

rm -f ./test.sh
touch ./test.sh && chmod a+x ./test.sh
echo "echo foo" > ./test.sh
./test.sh
echo "echo foo" > ./test.sh
------------

The following script works in all cases:

-----------
#!/bin/bash

rm -f ./test.sh
touch ./test.sh && chmod a+x ./test.sh
echo "echo foo" > ./test.sh
bash ./test.sh
echo "echo foo" > ./test.sh
------------



 Comments   
Comment by Roland Fehrenbacher [ 22/Jan/14 ]

Update: The issue is also present in current 3.14 kernel.org git

Comment by Cédric Dufour [ 23/Jan/14 ]

Hello,

I've been running tcpdump on that issue (see attached files). Shortly put:

  • erroneous (ETXTBSY) case shows two consecutive read of the 'test.sh' file, one "Protected read" followed by a "Concurrent read"
  • while the working case shows only a single "Concurrent read"

Could that be the reason why the MDT gets confused ?

(hope it helps)

Best,

Cédric

Comment by Roland Fehrenbacher [ 23/Jan/14 ]

Might be interesting to geta tcpdump of the working 2.6.32/2.4.2 client.

Comment by Cédric Dufour [ 24/Jan/14 ]

Hello,

Attached the packets captured from the working 2.6.32/2.4.2 client.

As you can see, the read pattern corresponding to the "./test.sh" is quite different, with consecutive "Concurrent read" separated by "MDS_CLOSE".

(hope it helps),

Cédric

Comment by Cédric Dufour [ 24/Jan/14 ]

Ha, yes, maybe I should have said that attached packets captures corresponds only to the "./test.sh; echo 'echo foo' > ./test.sh" part of the test case.

Comment by Cédric Dufour [ 28/Jan/14 ]

Hello,

I've attached the 'strace' and corresponding 'lctl dk ...' for the failing "./test.sh; echo 'echo foo' > ./test.sh" part of the test case.

Can someone make sense of it (iow. why we bump into that "(file.c:436:ll_intent_file_open()) lock enqueue: err: -26" error) ?

Thanks in advance,

Cédric

Comment by Cédric Dufour [ 29/Jan/14 ]

Hello John,

Thank you for for links.
I have backported your patch from LU-4429 to the mainline kernel (see attachment).
It helps, but a ETXTBSY is still spawned (at a different step), though it can now be recovered without 'rm'-ing the file:

$ LUSTRE_DIR=/idiap/temp/cdufour/bug9615-LU4520
$ rm -f "${LUSTRE_DIR}"/test.sh
$ touch "${LUSTRE_DIR}"/test.sh
$ chmod a+x "${LUSTRE_DIR}"/test.sh
$ echo 'echo foo' > "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
foo
$ echo 'echo foo' > "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
bash: /idiap/temp/cdufour/bug9615-LU4520/test.sh: Text file busy
$ "${LUSTRE_DIR}"/test.sh
foo

Is server-side patch mentioned in LU-4152 also required to entirely fix this issue ?

@Roland: I checked GIT 2.4.2: for-mentioned patch is not applied

Best,

Cédric

Comment by Cédric Dufour [ 29/Jan/14 ]

ADDENDUM: if two write operations are performed before the script execution, no more ETXTBSY:

$ touch "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
bash: /idiap/temp/cdufour/bug9615-LU4520/test.sh: Text file busy
$ "${LUSTRE_DIR}"/test.sh
foo
$ touch "${LUSTRE_DIR}"/test.sh
$ touch "${LUSTRE_DIR}"/test.sh
$ "${LUSTRE_DIR}"/test.sh
foo
$ "${LUSTRE_DIR}"/test.sh
foo
Comment by Cédric Dufour [ 30/Jan/14 ]

Hello Again,

I've picked up test case from LU-4398 and 'lctl dk'-ed it (without patch from LU-4429).

Patch from LU-4429 does not help in any way (it just allows to get rid of the additional 'rm' in the test case).

I'm no LDLM guru but I'm surprised that the 'touch' steps leaves a CR lock behind, which is "picked up" by the 'cp' step afterwards. Shouldn't all locks concerning a given file be cleared after a process is done with it (e.g. 'touch') ?

Cédric

Comment by John Hammond [ 30/Jan/14 ]

Hello Cédric,

You might try the patch from LU-4398, see http://review.whamcloud.com/#/c/9063/. If you have the time then please do report any findings.

LDLM locks are intended to be cached so the fact that these locks are still held after close does not indicate an error. Can you print the CR lock left behind and describe the 'cp' step.

Thanks, John

Comment by Roland Fehrenbacher [ 30/Jan/14 ]

Hi John,

your patch is for the MDT. Is it plausible that the problem originates from there, given the fact
that the stock 2.4.2 client works flawlessly in this regard (see the first message in this thread)?

Roland

Comment by John Hammond [ 30/Jan/14 ]

Indeed. There are some changes to lookup (with the introduction of atomic_open) between 2.6.32 and 3.12 which may account for the difference. But I have not checked that this is in fact the case here.

Comment by Cédric Dufour [ 30/Jan/14 ]

(to answer John's question above)

'error' test case (the 'success' test case is identical, save for the 'touch' step which is ommitted)

rm -f "${LUSTRE_DIR}"/echo
rm -f "${LUSTRE_DIR}"/echo            # Needed to make sure everything is cleaned-up
rm -f "${LUSTRE_DIR}"/echo            # Needed to make really sure everything is cleaned-up
touch "${LUSTRE_DIR}"/echo            # 'touch' step
cp -p /bin/echo "${LUSTRE_DIR}"/echo  # 'cp' step
"${LUSTRE_DIR}"/echo 'Hello World!'   # 'echo' step -> ETXTBSY

As for the "left-behind" CR lock, your explanation about caching makes sense; it is picked up during the 'cp' step:

(ldlm_resource.c:1406:ldlm_resource_dump()) ### ### ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11e00/0x251fdce27c789bc6 lrc: 1/0,0 mode: CR/CR res: [0x200000e36:0x4:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x20000000000 nid: local remote: 0x3ce47269da805bef expref: -99 pid: 14005 timeout: 0 lvb_type: 0

and - if I get correctly - "re"-used for futher CR locking.

Now, the other notorious difference between the 'error' and the 'success' test case is regarding the CR lock requested for the initial 'getattr' operation in the 'cp' step;

In the 'error' case:

(lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0]
(lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0
(ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0
(ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
(ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0
(ldlm_request.c:606:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x3ce47269da805cdd expref: -99 pid: 14007 timeout: 0 lvb_type: 0

While in the 'success' case:

(lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0]
(lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0
(ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0
(ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
(ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0
(ldlm_request.c:535:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0

(this leads to no ETXTBSY; but is the "ABORTED" lock attempt normal?)

NOTE: the attached LU-4398.tar.bz2 file contains the details of the test cases and the corresponding 'lctl dk' outputs (for each step)

If Roland is to apply the LU-4398 patch to the server, must it be done jointly with LU-4429 patch on the client ?

Comment by Andreas Dilger [ 10/Feb/14 ]

What about the patch for LU-4430 "mdt: check for MDS_FMODE_EXEC in mdt_mfd_open()" (http://review.whamcloud.com/8719)? It could be the error handling of the kernel trying to open a file for execute is leaking the reference on the file?

Comment by Roland Fehrenbacher [ 30/Mar/14 ]

Hi John,

your patch (LU-4398, http://review.whamcloud.com/#/c/9063/) applied to 2.5.1 indeed fixes the issue. The problem still exists in 2.5.1 without the patch. Great job.

Roland

Comment by Roland Fehrenbacher [ 02/Apr/14 ]

I should have added, that it is indeed also necessary to include LU-4429 on the client.

Comment by Peter Jones [ 02/Apr/14 ]

Roland

Wasn't the LU-4429 fix already in 2.5.1?

Peter

Comment by Roland Fehrenbacher [ 02/Apr/14 ]

Peter,

yes, it is. But we're working with the in-kernel client based on 3.12 and for that we needed to make a backport. The LU-4429 fix is also included in linux-next, so will probably be in 3.15.

Roland

Comment by Peter Jones [ 02/Apr/14 ]

Ah yes - thanks for clarifying!

Comment by James A Simmons [ 22/Sep/19 ]

Fixed a long time ago

Generated at Sat Feb 10 01:43:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.