Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12209

cannot create stripe dir: Stale file handle

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.10.7
    • None
    • CentOS 7.6, servers 2.10.7, clients 2.12 or 2.10
    • 3
    • 9223372036854775807

    Description

      I'm facing a new issue on Oak (2.10.7 servers), tried with both 2.10 and 2.12 clients:

      As root:

      # cd /oak/stanford/groups/
      # lfs mkdir -i 1 caiwei
      lfs mkdir: dirstripe error on 'caiwei': Stale file handle
      lfs setdirstripe: cannot create stripe dir 'caiwei': Stale file handle
      
      # lfs getdirstripe .
      lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
      

      does that ring a bell? Oak is only using DNE v1 with statically striped directories. Never seen that before 2.10.7 (we recently upgraded Oak).

      a basic lctl dk doesn't show anything on the MDS but I may have to enable specific debug flags to see more. No other traces found so far.

      Tried with 2.10 and 2.12 clients, with or without idmap.

      Thanks!

      Attachments

        1. oak-md1-s1-MDT1.dk.gz
          2.95 MB
        2. oak-md1-s2-MDT0.dk.gz
          10.15 MB
        3. sh-101-60.dk.gz
          363 kB

        Issue Links

          Activity

            [LU-12209] cannot create stripe dir: Stale file handle
            pjones Peter Jones added a comment -

            Stephane

            You are correct. Yet another illustration as to why it is confusing to having multiple patches tracked under the same Jira ticket spanning release boundaries

            Peter

            pjones Peter Jones added a comment - Stephane You are correct. Yet another illustration as to why it is confusing to having multiple patches tracked under the same Jira ticket spanning release boundaries Peter

            Peter, this patch (https://review.whamcloud.com/#/c/33401/ - LU-11418 llog: refresh remote llog upon -ESTALE) is already available in 2.12.0:

             

            commit 71f409c9b31b90fa432f1f46ad4e612fb65c7fcc
            Author: Lai Siyao <lai.siyao@intel.com>
            Date:   Wed Oct 17 13:29:53 2018 +0800
            
                LU-11418 llog: refresh remote llog upon -ESTALE
            

            But it's not included in 2.10.7 (that we're running on our Oak servers).

            sthiell Stephane Thiell added a comment - Peter, this patch ( https://review.whamcloud.com/#/c/33401/  -  LU-11418 llog: refresh remote llog upon -ESTALE) is already available in 2.12.0:   commit 71f409c9b31b90fa432f1f46ad4e612fb65c7fcc Author: Lai Siyao <lai.siyao@intel.com> Date: Wed Oct 17 13:29:53 2018 +0800 LU-11418 llog: refresh remote llog upon -ESTALE But it's not included in 2.10.7 (that we're running on our Oak servers).
            pjones Peter Jones added a comment -

            Nice! Thanks all. sthiell note that this fix is included in the upcoming 2.12.1

            pjones Peter Jones added a comment - Nice! Thanks all. sthiell note that this fix is included in the upcoming 2.12.1

            Hi Lai,

            We restarted the servers with the patch this morning and the problem is now gone. Thanks!

            sthiell Stephane Thiell added a comment - Hi Lai, We restarted the servers with the patch this morning and the problem is now gone. Thanks!
            laisiyao Lai Siyao added a comment -

            This looks to be the same issue which was fixed by https://review.whamcloud.com/#/c/33401/, can you apply this patch on all MDS's and try again?

            laisiyao Lai Siyao added a comment - This looks to be the same issue which was fixed by https://review.whamcloud.com/#/c/33401/ , can you apply this patch on all MDS's and try again?

            Thanks Patrick for this analysis. I see that obj->opo_stale = 1; only in osp_invalidate()...

            Because it's not impacting production, but just new group creation, we won't failover the MDT today (new groups can wait a bit ). We have some interactive jobs running. But I'll try to find a good time during the weekend to do so. Let me know if you want me to grab more debug info before then.

            sthiell Stephane Thiell added a comment - Thanks Patrick for this analysis. I see that obj->opo_stale = 1; only in osp_invalidate() ... Because it's not impacting production, but just new group creation, we won't failover the MDT today (new groups can wait a bit ). We have some interactive jobs running. But I'll try to find a good time during the weekend to do so. Let me know if you want me to grab more debug info before then.

            People

              laisiyao Lai Siyao
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: