Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15300

mirror resync can cause EIO to unrelated applications

Details

    Description

      I noticed that sometimes sanity-flr/200 hits "checksum error", here are some findings.

      first of all, checksum error is caused by incomplete preceding lfs mirror resync command (which doesn't return an error in some cases).

      in turn, EIO lfs hits is caused by AS_EIO flag on the corresponded mapping.

      AS_EIO is set because of ESTALE to OST_WRITE with incorrect layout version (client's version is smaller than one on OST).

      so far I've traced all this to the race between two processes:

      • lfs doing resync and changing layout generation
      • another process (say, multiop) doing regular write

      I will cite the logs in a subsequent comment.

      Attachments

        Issue Links

          Activity

            [LU-15300] mirror resync can cause EIO to unrelated applications

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55464/
            Subject: LU-15300 mdt: refresh LOVEA with LL granted
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 9287b2c34d3c7c4d94d9db3a5a622d89be31ec6b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55464/ Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 9287b2c34d3c7c4d94d9db3a5a622d89be31ec6b

            "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55464
            Subject: LU-15300 mdt: refresh LOVEA with LL granted
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 272180003988ecd0d786392cd0fee1a800d5e336

            gerrit Gerrit Updater added a comment - "Frederick Dilger <fdilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55464 Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 272180003988ecd0d786392cd0fee1a800d5e336
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/
            Subject: LU-15300 mdt: refresh LOVEA with LL granted
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 13557aa86904376e48a5e43256d5c1ab32c1c2d6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46413/ Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13557aa86904376e48a5e43256d5c1ab32c1c2d6

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413
            Subject: LU-15300 mdt: refresh LOVEA with LL granted
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 69
            Commit: efbe0f63eff8a9a7b192607382f6859e3b0088b8

            adilger Andreas Dilger added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46413 Subject: LU-15300 mdt: refresh LOVEA with LL granted Project: fs/lustre-release Branch: master Current Patch Set: 69 Commit: efbe0f63eff8a9a7b192607382f6859e3b0088b8

            What's next for this issue?

            review and landing hopefully, the patch has been in local testing for months..

            bzzz Alex Zhuravlev added a comment - What's next for this issue? review and landing hopefully, the patch has been in local testing for months..
            cfaber Colin Faber added a comment -

            Hi bzzz 

            What's next for this issue?

            cfaber Colin Faber added a comment - Hi bzzz   What's next for this issue?
            nangelinas Nikitas Angelinas added a comment - +1 on master: https://testing.whamcloud.com/test_sets/32deb408-a813-480f-a6f3-1ae41c34ab56
            gerrit Gerrit Updater added a comment - - edited

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46580
            Subject: LU-15300 mdt: fetch LOVEA after LL
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2f868486c9ad6f884b682173705e93d68ab6385a

            gerrit Gerrit Updater added a comment - - edited "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46580 Subject: LU-15300 mdt: fetch LOVEA after LL Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f868486c9ad6f884b682173705e93d68ab6385a

            got a prototype implementing Mike's idea:

            clean master:
            open                      259215 samples [usecs] 21 121174 47616701 490111842189
            open                      262119 samples [usecs] 23 187783 49285028 637234901782
            = 183 and 188 usec/open
            
            double getxattr:
            open                      257224 samples [usecs] 24 178664 54603761 775008064791
            open                      253302 samples [usecs] 23 151484 48624675 554342411475
            = 212 and 191 usec/open
            
            late xattr + dropping DOM bit from ldlm lock:
            open                      250363 samples [usecs] 23 123033 48744046 613086712776
            open                      256359 samples [usecs] 22 180023 47671587 620789426793
            = 194 and 185 usec/open
            

            the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock()

            bzzz Alex Zhuravlev added a comment - got a prototype implementing Mike's idea: clean master: open 259215 samples [usecs] 21 121174 47616701 490111842189 open 262119 samples [usecs] 23 187783 49285028 637234901782 = 183 and 188 usec/open double getxattr: open 257224 samples [usecs] 24 178664 54603761 775008064791 open 253302 samples [usecs] 23 151484 48624675 554342411475 = 212 and 191 usec/open late xattr + dropping DOM bit from ldlm lock: open 250363 samples [usecs] 23 123033 48744046 613086712776 open 256359 samples [usecs] 22 180023 47671587 620789426793 = 194 and 185 usec/open the latter can be improved if we change LDLM API to return a pointer to ldlm_lock and stop using ldlm_handle2lock()

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: