Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • http://github.com/chaos/lustre, version 2.1.1-11chaos
    • 3
    • 6399

    Description

      I've been getting widespread reports that with 2.1 clients users are seeing random ENOENT errors on opens (and maybe stats?).

      Sometimes the file is written, closed, and reopened on the same client node. But the open will report that the file does not exist. A few minutes later the file is definitely there, so the problem is transitory.

      We have also had instances of this where the ENOENT occurs on a node other than where the file was created. One node will create, write, and close the file, and then another will attempt to open it only to get ENOENT.

      Here is an example failure from a simul test:

      09:04:12: Set iteration 4
      09:04:12: Running test #0(iter 0): open, shared mode.
      09:04:12:       Beginning setup
      09:04:12:       Finished setup          (0.001 sec)
      09:04:12:       Beginning test
      09:04:12: Process 177(hype338): FAILED in simul_open, open failed: No such file or directory
      

      There tend to not be any obvious messages in the console logs associated with these events.

      Attachments

        1. hype336-lu1397-1337981358181.llog.gz
          5.60 MB
        2. ior-lustre_debug.diff
          1 kB
        3. open.stp
          0.9 kB
        4. open-v2.stp
          2 kB

        Issue Links

          Activity

            [LU-1397] ENOENT on open()
            pjones Peter Jones added a comment -

            Thanks Prakash. We will track landing this code under LU-506 so I am closing this ticket as a duplicate of that. As to whether this fix will also address the instances you may have observed prior to applying the intially flawed LU-1234 patch, it may well do because the cache mechanism has been altered by this change.

            pjones Peter Jones added a comment - Thanks Prakash. We will track landing this code under LU-506 so I am closing this ticket as a duplicate of that. As to whether this fix will also address the instances you may have observed prior to applying the intially flawed LU-1234 patch, it may well do because the cache mechanism has been altered by this change.

            Thanks for the fix, Lai. We haven't seen any ENOENT failures in the past few days of testing, with patch set 4 applied. This can be marked resolved.

            prakash Prakash Surya (Inactive) added a comment - Thanks for the fix, Lai. We haven't seen any ENOENT failures in the past few days of testing, with patch set 4 applied. This can be marked resolved.

            Ok, Thanks Lai. We should start testing with the new patch set in place today.

            prakash Prakash Surya (Inactive) added a comment - Ok, Thanks Lai. We should start testing with the new patch set in place today.
            laisiyao Lai Siyao added a comment -

            Prakash, yes, the ENOENT failure found this time is introduced by the earlier patch for LU-1234. It's best to enable debuglog trigger as well in your verification test, thanks.

            laisiyao Lai Siyao added a comment - Prakash, yes, the ENOENT failure found this time is introduced by the earlier patch for LU-1234 . It's best to enable debuglog trigger as well in your verification test, thanks.

            Lai, was this issue introduced by one of the earlier versions of that patch? I ask because we have a vague recollection that we saw ENOENT issues prior to applying it, albeit less frequently than we currently do. So, if this specific case was introduced (and now fixed?) by http://review.whamcloud.com/2400, there still may be other issues lurking.

            Either way, I am going to try and get the new revision of the patch applied and tested today.

            And thanks for the detailed explanation Fan Yong! It's very helpful.

            prakash Prakash Surya (Inactive) added a comment - Lai, was this issue introduced by one of the earlier versions of that patch? I ask because we have a vague recollection that we saw ENOENT issues prior to applying it, albeit less frequently than we currently do. So, if this specific case was introduced (and now fixed?) by http://review.whamcloud.com/2400 , there still may be other issues lurking. Either way, I am going to try and get the new revision of the patch applied and tested today. And thanks for the detailed explanation Fan Yong! It's very helpful.
            laisiyao Lai Siyao added a comment -

            Patch is updated: http://review.whamcloud.com/#change,2400.

            Please revert previous patch in your code base and patch against this one.

            laisiyao Lai Siyao added a comment - Patch is updated: http://review.whamcloud.com/#change,2400 . Please revert previous patch in your code base and patch against this one.
            laisiyao Lai Siyao added a comment -

            Yong, you're right, I think it's better to remove d_rehash(), this looks to be a bug from old kernel (from d_splice_alias(), and this line is removed in latest code).

            I'll commit a patch later.

            laisiyao Lai Siyao added a comment - Yong, you're right, I think it's better to remove d_rehash(), this looks to be a bug from old kernel (from d_splice_alias(), and this line is removed in latest code). I'll commit a patch later.

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: