Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10457

open_by_handle_at() in write mode triggers ETXTBSY

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      If open_by_handle_at() is called in O_WRONLY or O_RDWR mode and then the file descriptor is closed, other lustre clients will still report ETXTBSY.

      Example:

      On cn16
      =======
      bschubert@cn16 ~>sudo ~/src/test/open-test /mnt/lustre_client-ES24/bschubert/ime/test7 1
      Opened /mnt/lustre_client-ES24/bschubert/ime/test7/test7, fd: 4
      Closed d: 4

      Now on cn41
      =========
      bschubert@cn41 ~>/mnt/lustre_client-ES24/bschubert/ime//test7
      -bash: /mnt/lustre_client-ES24/bschubert/ime//test7: Text file busy

      test7 is just any file which has the the execution bit set.

      Attachments

        Issue Links

          Activity

            [LU-10457] open_by_handle_at() in write mode triggers ETXTBSY

            Hi all, I think there another implication of this issue. Our customer is complaining that quotas are not correctly released. We have basically mostly worked around the ETXTBSY issue, but I don't think we can do anything about quotas on our side.
            Looking at the patches, I think this patch https://review.whamcloud.com/32020 will not help, as it will try to release conflicting locks on an O_EXEC attempt. The alternative patch from Pattrick  https://review.whamcloud.com/#/c/31304/ should work, as it always sends an mds close from the client, if the file was opened in write mode. Is there any side effect? It should just remove an NFS optimization?

            aakef Bernd Schubert added a comment - Hi all, I think there another implication of this issue. Our customer is complaining that quotas are not correctly released. We have basically mostly worked around the ETXTBSY issue, but I don't think we can do anything about quotas on our side. Looking at the patches, I think  this patch https://review.whamcloud.com/32020  will not help, as it will try to release conflicting locks on an O_EXEC attempt. The alternative patch from Pattrick   https://review.whamcloud.com/#/c/31304/  should work, as it always sends an mds close from the client, if the file was opened in write mode. Is there any side effect? It should just remove an NFS optimization?

            Hi all, I just resubmit LU-4398 (https://review.whamcloud.com/32020) as Jinshan suggested, with it applied, the problem is gone, and with some simple tests, no significant regression found, but still, please feel free to try and test it more, thanks.

            cengku9660 Gu Zheng (Inactive) added a comment - Hi all, I just resubmit  LU-4398 ( https://review.whamcloud.com/32020 ) as Jinshan suggested, with it applied, the problem is gone, and with some simple tests, no significant regression found, but still, please feel free to try and test it more, thanks.

            Oleg pointed me at this, I reported a duplicate and contributed a patch and test case:
            https://review.whamcloud.com/#/c/31304/

            If we limited my patch to executable files as Oleg suggested, that might fit the bill. Curious what others think. I'll refresh tomorrow.

            paf Patrick Farrell (Inactive) added a comment - Oleg pointed me at this, I reported a duplicate and contributed a patch and test case: https://review.whamcloud.com/#/c/31304/ If we limited my patch to executable files as Oleg suggested, that might fit the bill. Curious what others think. I'll refresh tomorrow.

            Seems https://review.whamcloud.com/#/c/9063/ (LU-4398 mdt: acquire an open lock for write or execute) can resolve the problem, after applied it back on latest master, never reproduced the issue.
            [root@vm3 ~]# ./open_test /mnt/lustre/file 1
            Opened /mnt/lustre/file/file, fd: 4
            Closed d: 4

            [root@vm6 ~]# /mnt/lustre/file
            hello lustre
            [root@vm6 ~]# /mnt/lustre/file
            hello lustre
            [root@vm6 ~]# /mnt/lustre/file
            hello lustre
            [root@vm6 ~]# /mnt/lustre/file
            hello lustre
            [root@vm6 ~]# /mnt/lustre/file
            hello lustre

            cengku9660 Gu Zheng (Inactive) added a comment - Seems https://review.whamcloud.com/#/c/9063/ ( LU-4398 mdt: acquire an open lock for write or execute) can resolve the problem, after applied it back on latest master, never reproduced the issue. [root@vm3 ~] # ./open_test /mnt/lustre/file 1 Opened /mnt/lustre/file/file, fd: 4 Closed d: 4 [root@vm6 ~] # /mnt/lustre/file hello lustre [root@vm6 ~] # /mnt/lustre/file hello lustre [root@vm6 ~] # /mnt/lustre/file hello lustre [root@vm6 ~] # /mnt/lustre/file hello lustre [root@vm6 ~] # /mnt/lustre/file hello lustre

            > Which Lustre version is this with? 

            I was testing on master branch. I guess you are using IEEL3 (2.7)? Something might have been changed between them.

            lixi Li Xi (Inactive) added a comment - > Which Lustre version is this with?  I was testing on master branch. I guess you are using IEEL3 (2.7)? Something might have been changed between them.
            jhammond John Hammond added a comment -

            This is resolved by https://review.whamcloud.com/#/c/9063/ (LU-4398 mdt: acquire an open lock for write or execute). But that change was reverted from master due to the metadata performance impact described by DDN in LU-5197.

            Perhaps the exiting functionality of open leases could be used in the open by handle path to address this issue without incurring the performance drop.

            jhammond John Hammond added a comment - This is resolved by https://review.whamcloud.com/#/c/9063/ ( LU-4398 mdt: acquire an open lock for write or execute). But that change was reverted from master due to the metadata performance impact described by DDN in LU-5197 . Perhaps the exiting functionality of open leases could be used in the open by handle path to address this issue without incurring the performance drop.
            green Oleg Drokin added a comment -

            Please note that /mnt/lustre and /mnt/lustre 2 are different mountpoints, so they act the same as two different nodes, just more convenient to test.

            green Oleg Drokin added a comment - Please note that /mnt/lustre and /mnt/lustre 2 are different mountpoints, so they act the same as two different nodes, just more convenient to test.

            Ah, actually Li was also using two different nodes. Sorry, I only saw server17-el7 and didn't notice the differentiation between -vm1 and -vm3.

            Which Lustre version is this with? On the systems I tested with, it would never succeed to execute on the other node, until either

            • the node the had opened the file would execute it itself
            • the node the had opened the file would unmount lustre
            • I would be patient and wait for a very long time (> 30min)
            aakef Bernd Schubert added a comment - Ah, actually Li was also using two different nodes. Sorry, I only saw server17-el7 and didn't notice the differentiation between -vm1 and -vm3. Which Lustre version is this with? On the systems I tested with, it would never succeed to execute on the other node, until either the node the had opened the file would execute it itself the node the had opened the file would unmount lustre I would be patient and wait for a very long time (> 30min)

            Hmm, I can't imagine how that this works as it is supposed to, even in the NFS case. Maybe I should have pointed this out in more detail, but in my initial example I used two different nodes.
            For NFS or any other overlay file system, one can expect that there are multiple nodes involved. For NFS users typically would create/modify files on their desktop to later on execute them natively on Lustre.
            For the IME use case, the file is opened for multiple reasons in RW mode on the ime server, but users also later on want to use the files natively on Lustre.

            aakef Bernd Schubert added a comment - Hmm, I can't imagine how that this works as it is supposed to, even in the NFS case. Maybe I should have pointed this out in more detail, but in my initial example I used two different nodes . For NFS or any other overlay file system, one can expect that there are multiple nodes involved. For NFS users typically would create/modify files on their desktop to later on execute them natively on Lustre. For the IME use case, the file is opened for multiple reasons in RW mode on the ime server, but users also later on want to use the files natively on Lustre.
            green Oleg Drokin added a comment -

            I guess I did not read it far enough, yes there's one ETXTBUSY report due to the open lock.

            it appears that the name_to_handle_at/open_by_handle_at use nfs-encoded export operation leading to the nfs detecting logic to trigger so the system sort of operates as designed.

            It's going to be tricky to separate real nfs from these users I guess and we don't want the extra lock hit when opening the file. I guess the new downgrade logic might help us here to get a bigger lock and then just drop the unneeded bit.

            green Oleg Drokin added a comment - I guess I did not read it far enough, yes there's one ETXTBUSY report due to the open lock. it appears that the name_to_handle_at/open_by_handle_at use nfs-encoded export operation leading to the nfs detecting logic to trigger so the system sort of operates as designed. It's going to be tricky to separate real nfs from these users I guess and we don't want the extra lock hit when opening the file. I guess the new downgrade logic might help us here to get a bigger lock and then just drop the unneeded bit.
            green Oleg Drokin added a comment -

            Hm, I tested this on the latest master on rhel7.2 (disregard the centos6 in the hostname) and don't see any problems, what version are you testing on, what kernel:

            [root@centos6-16 tests]# cat /tmp/test.sh
            #!/bin/bash
            
            cp /bin/ls /mnt/lustre
            /mnt/lustre/ls -ld .
            /tmp/open-test /mnt/lustre/ls 2
            
            TIME=0
            while ! /mnt/lustre2/ls -ld . ; do echo nope ; TIME=$((TIME + 1)) ; sleep 1 ; done
            
            echo Waited $TIME seconds for the open to clear
            [root@centos6-16 tests]# bash /tmp/test.sh
            drwxrwxr-x 12 green green 12288 Jan  4 01:37 .
            Opened /mnt/lustre/ls/ls, fd: 4
            Closed d: 4
            /tmp/test.sh: line 8: /mnt/lustre2/ls: Text file busy
            nope
            drwxrwxr-x 12 green green 12288 Jan  4 01:37 .
            Waited 1 seconds for the open to clear
            
            green Oleg Drokin added a comment - Hm, I tested this on the latest master on rhel7.2 (disregard the centos6 in the hostname) and don't see any problems, what version are you testing on, what kernel: [root@centos6-16 tests]# cat /tmp/test.sh #!/bin/bash cp /bin/ls /mnt/lustre /mnt/lustre/ls -ld . /tmp/open-test /mnt/lustre/ls 2 TIME=0 while ! /mnt/lustre2/ls -ld . ; do echo nope ; TIME=$((TIME + 1)) ; sleep 1 ; done echo Waited $TIME seconds for the open to clear [root@centos6-16 tests]# bash /tmp/test.sh drwxrwxr-x 12 green green 12288 Jan 4 01:37 . Opened /mnt/lustre/ls/ls, fd: 4 Closed d: 4 /tmp/test.sh: line 8: /mnt/lustre2/ls: Text file busy nope drwxrwxr-x 12 green green 12288 Jan 4 01:37 . Waited 1 seconds for the open to clear

            People

              green Oleg Drokin
              diegom Diego Moreno (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: