Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4520

Text file busy error -- mainline 3.12 client

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.2
    • None
    • 3
    • 12363

    Description

      When executing the following simple script, the in-kernel client fails with error "Text file busy", while the out-of-kernel 2.4.2 client compiled against 2.6.32 vanilla works.

      -----------
      #!/bin/bash

      rm -f ./test.sh
      touch ./test.sh && chmod a+x ./test.sh
      echo "echo foo" > ./test.sh
      ./test.sh
      echo "echo foo" > ./test.sh
      ------------

      The following script works in all cases:

      -----------
      #!/bin/bash

      rm -f ./test.sh
      touch ./test.sh && chmod a+x ./test.sh
      echo "echo foo" > ./test.sh
      bash ./test.sh
      echo "echo foo" > ./test.sh
      ------------

      Attachments

        1. LU-4398.tar.bz2
          118 kB
          Cédric Dufour
        2. LU4429+4520.patch
          3 kB
          Cédric Dufour
        3. LU-4520.ETXTBSY.lctl-dk.txt
          436 kB
          Cédric Dufour
        4. LU-4520.ETXTBSY.strace.txt
          20 kB
          Cédric Dufour
        5. LU-4520.ok-vanilla.pcap.txt
          5 kB
          Cédric Dufour
        6. LU-4520.pcap.txt
          4 kB
          Cédric Dufour

        Issue Links

          Activity

            [LU-4520] Text file busy error -- mainline 3.12 client

            Fixed a long time ago

            simmonsja James A Simmons added a comment - Fixed a long time ago
            pjones Peter Jones added a comment -

            Ah yes - thanks for clarifying!

            pjones Peter Jones added a comment - Ah yes - thanks for clarifying!
            rfehren Roland Fehrenbacher added a comment - - edited

            Peter,

            yes, it is. But we're working with the in-kernel client based on 3.12 and for that we needed to make a backport. The LU-4429 fix is also included in linux-next, so will probably be in 3.15.

            Roland

            rfehren Roland Fehrenbacher added a comment - - edited Peter, yes, it is. But we're working with the in-kernel client based on 3.12 and for that we needed to make a backport. The LU-4429 fix is also included in linux-next, so will probably be in 3.15. Roland
            pjones Peter Jones added a comment -

            Roland

            Wasn't the LU-4429 fix already in 2.5.1?

            Peter

            pjones Peter Jones added a comment - Roland Wasn't the LU-4429 fix already in 2.5.1? Peter

            I should have added, that it is indeed also necessary to include LU-4429 on the client.

            rfehren Roland Fehrenbacher added a comment - I should have added, that it is indeed also necessary to include LU-4429 on the client.

            Hi John,

            your patch (LU-4398, http://review.whamcloud.com/#/c/9063/) applied to 2.5.1 indeed fixes the issue. The problem still exists in 2.5.1 without the patch. Great job.

            Roland

            rfehren Roland Fehrenbacher added a comment - Hi John, your patch ( LU-4398 , http://review.whamcloud.com/#/c/9063/ ) applied to 2.5.1 indeed fixes the issue. The problem still exists in 2.5.1 without the patch. Great job. Roland

            What about the patch for LU-4430 "mdt: check for MDS_FMODE_EXEC in mdt_mfd_open()" (http://review.whamcloud.com/8719)? It could be the error handling of the kernel trying to open a file for execute is leaking the reference on the file?

            adilger Andreas Dilger added a comment - What about the patch for LU-4430 "mdt: check for MDS_FMODE_EXEC in mdt_mfd_open()" ( http://review.whamcloud.com/8719)? It could be the error handling of the kernel trying to open a file for execute is leaking the reference on the file?

            (to answer John's question above)

            'error' test case (the 'success' test case is identical, save for the 'touch' step which is ommitted)

            rm -f "${LUSTRE_DIR}"/echo
            rm -f "${LUSTRE_DIR}"/echo            # Needed to make sure everything is cleaned-up
            rm -f "${LUSTRE_DIR}"/echo            # Needed to make really sure everything is cleaned-up
            touch "${LUSTRE_DIR}"/echo            # 'touch' step
            cp -p /bin/echo "${LUSTRE_DIR}"/echo  # 'cp' step
            "${LUSTRE_DIR}"/echo 'Hello World!'   # 'echo' step -> ETXTBSY
            

            As for the "left-behind" CR lock, your explanation about caching makes sense; it is picked up during the 'cp' step:

            (ldlm_resource.c:1406:ldlm_resource_dump()) ### ### ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11e00/0x251fdce27c789bc6 lrc: 1/0,0 mode: CR/CR res: [0x200000e36:0x4:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x20000000000 nid: local remote: 0x3ce47269da805bef expref: -99 pid: 14005 timeout: 0 lvb_type: 0
            

            and - if I get correctly - "re"-used for futher CR locking.

            Now, the other notorious difference between the 'error' and the 'success' test case is regarding the CR lock requested for the initial 'getattr' operation in the 'cp' step;

            In the 'error' case:

            (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0]
            (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0
            (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0
            (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
            (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0
            (ldlm_request.c:606:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x3ce47269da805cdd expref: -99 pid: 14007 timeout: 0 lvb_type: 0
            

            While in the 'success' case:

            (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0]
            (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0
            (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0
            (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
            (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0
            (ldlm_request.c:535:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0
            

            (this leads to no ETXTBSY; but is the "ABORTED" lock attempt normal?)

            NOTE: the attached LU-4398.tar.bz2 file contains the details of the test cases and the corresponding 'lctl dk' outputs (for each step)

            If Roland is to apply the LU-4398 patch to the server, must it be done jointly with LU-4429 patch on the client ?

            cdufour Cédric Dufour added a comment - (to answer John's question above) 'error' test case (the 'success' test case is identical, save for the 'touch' step which is ommitted) rm -f "${LUSTRE_DIR}"/echo rm -f "${LUSTRE_DIR}"/echo # Needed to make sure everything is cleaned-up rm -f "${LUSTRE_DIR}"/echo # Needed to make really sure everything is cleaned-up touch "${LUSTRE_DIR}"/echo # 'touch' step cp -p /bin/echo "${LUSTRE_DIR}"/echo # 'cp' step "${LUSTRE_DIR}"/echo 'Hello World!' # 'echo' step -> ETXTBSY As for the "left-behind" CR lock, your explanation about caching makes sense; it is picked up during the 'cp' step: (ldlm_resource.c:1406:ldlm_resource_dump()) ### ### ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11e00/0x251fdce27c789bc6 lrc: 1/0,0 mode: CR/CR res: [0x200000e36:0x4:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x20000000000 nid: local remote: 0x3ce47269da805bef expref: -99 pid: 14005 timeout: 0 lvb_type: 0 and - if I get correctly - "re"-used for futher CR locking. Now, the other notorious difference between the 'error' and the 'success' test case is regarding the CR lock requested for the initial 'getattr' operation in the 'cp' step; In the 'error' case: (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0] (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0 (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0 (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000 (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14007 timeout: 0 lvb_type: 0 (ldlm_request.c:606:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff88018cf1cc00/0x251fdce27c789bcd lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x3ce47269da805cdd expref: -99 pid: 14007 timeout: 0 lvb_type: 0 While in the 'success' case: (lmv_intent.c:304:lmv_intent_lock()) INTENT LOCK 'getattr' for 'echo' on [0x200000d85:0xf:0x0] (lmv_intent.c:263:lmv_intent_lookup()) LOOKUP_INTENT with fid1=[0x200000d85:0xf:0x0], fid2=[0x0:0x0:0x0], name='echo' -> mds #0 (ldlm_lock.c:758:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x0 rrc: 2 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (ldlm_request.c:898:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000 (ldlm_request.c:960:ldlm_cli_enqueue()) ### sending request ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 3/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (ldlm_request.c:535:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-2-MDT0000-mdc-ffff880138d22800 lock: ffff880034f11200/0x251fdce27c789c21 lrc: 4/1,0 mode: --/CR res: [0x200000d85:0xf:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 14015 timeout: 0 lvb_type: 0 (this leads to no ETXTBSY; but is the "ABORTED" lock attempt normal?) NOTE: the attached LU-4398 .tar.bz2 file contains the details of the test cases and the corresponding 'lctl dk' outputs (for each step) If Roland is to apply the LU-4398 patch to the server, must it be done jointly with LU-4429 patch on the client ?
            jhammond John Hammond added a comment -

            Indeed. There are some changes to lookup (with the introduction of atomic_open) between 2.6.32 and 3.12 which may account for the difference. But I have not checked that this is in fact the case here.

            jhammond John Hammond added a comment - Indeed. There are some changes to lookup (with the introduction of atomic_open) between 2.6.32 and 3.12 which may account for the difference. But I have not checked that this is in fact the case here.

            Hi John,

            your patch is for the MDT. Is it plausible that the problem originates from there, given the fact
            that the stock 2.4.2 client works flawlessly in this regard (see the first message in this thread)?

            Roland

            rfehren Roland Fehrenbacher added a comment - Hi John, your patch is for the MDT. Is it plausible that the problem originates from there, given the fact that the stock 2.4.2 client works flawlessly in this regard (see the first message in this thread)? Roland

            People

              wc-triage WC Triage
              rfehren Roland Fehrenbacher
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: