Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14541

Memory reclaim caused a stale data read

Details

    • 3
    • 9223372036854775807

    Description

      1. Cray don't have a full logs described this problem, but big picture looks clean.
        Client node start a memory reclaim and enter to the ll_releasepage, where seen page is not a busy and have 3 vmpage references. It caused a cl_page_delete call which remove page from own page tree and move to the CPS_FREEDING state. It's fine for the kernels < 2.6.37.
        But 2.6.37 introduce a different way to page free, it is ->freepage callback.
        >>
        commit 6072d13c429373c5d63b69dadbbef40a9b035552
        Author: Linus Torvalds <torvalds@linux-foundation.org>
        Date: Wed Dec 1 13:35:19 2010 -0500

      Call the filesystem back whenever a page is removed from the page cache
      >>
      It introduced because remove_mapping() can prohibit to kill page from page cache due page refcount != 2, or PageDirty reasons. As page in CPS_FREEDING state, cl_page_own is failed to own a page in the blocking ast an code expect some else will free page, but none do it. OOPS. Stale page with uptodate flag set in the page cache - where it can read du fast read code path.
      Some existent logs.
      >>>
      00000008:00100000:10.0:1615300198.692889:0:4147:0:(osc_cache.c:3288:osc_page_gang_lookup()) vvp-page@ffff8800310524e0(1:1) vm@ffffea000119bdd0 10000000000002c 4:0 0 82094 lru
      bad
      00000008:00100000:10.0:1615300198.692873:0:4147:0:(osc_cache.c:3279:osc_page_gang_lookup()) vvp-page@ffff8800310520e0(1:1) vm@ffffea000119be08 10000000000002c 3:0 0 82095 lru
      good
      >>>
      Other logs show it's race between lock cancel (osc_gang_lookup) and kswapd.

      so one more vmpage reference highly likely caused fail.
      based from crash dump in second after it. Page have a two references.
      so likely we have a race with page access.

      Attachments

        Issue Links

          Activity

            [LU-14541] Memory reclaim caused a stale data read
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.16.0 [ 15190 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.15.3 [ 15998 ]

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50599/
            Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 72b5be5ccc1c58ae6edc968fa9106d53578aeccb

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50599/ Subject: Revert " LU-14541 llite: Check vmpage in releasepage" Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 72b5be5ccc1c58ae6edc968fa9106d53578aeccb

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50598/
            Subject: LU-14541 llite: Check for page deletion after fault
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 4a134425bd9d03f5e40e489fd7e9acf4788e9da1

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50598/ Subject: LU-14541 llite: Check for page deletion after fault Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 4a134425bd9d03f5e40e489fd7e9acf4788e9da1

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49654/
            Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e3cfb688ed7116a57b2c7f89a3e4f28291a0b69f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49654/ Subject: Revert " LU-14541 llite: Check vmpage in releasepage" Project: fs/lustre-release Branch: master Current Patch Set: Commit: e3cfb688ed7116a57b2c7f89a3e4f28291a0b69f

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49653/
            Subject: LU-14541 llite: Check for page deletion after fault
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b3d2114e538cf95a7e036f8313e9095fe821da79

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49653/ Subject: LU-14541 llite: Check for page deletion after fault Project: fs/lustre-release Branch: master Current Patch Set: Commit: b3d2114e538cf95a7e036f8313e9095fe821da79

            "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50599
            Subject: LU-14541 revert: "llite: Check vmpage in releasepage"
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 0926e0b3b61c8180024df55223f654080f789a6d

            gerrit Gerrit Updater added a comment - "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50599 Subject: LU-14541 revert: "llite: Check vmpage in releasepage" Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 0926e0b3b61c8180024df55223f654080f789a6d

            "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50598
            Subject: LU-14541 llite: Check for page deletion after fault
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 9cf02b117d1e51d8ad367bae88f8fda4c0a95f49

            gerrit Gerrit Updater added a comment - "Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50598 Subject: LU-14541 llite: Check for page deletion after fault Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 9cf02b117d1e51d8ad367bae88f8fda4c0a95f49
            paf0186 Patrick Farrell made changes -
            Link New: This issue is related to NVDA-131 [ NVDA-131 ]

            "Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49654
            Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 95fb01cb020295531839d8c758320262db6b7bef

            gerrit Gerrit Updater added a comment - "Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49654 Subject: Revert " LU-14541 llite: Check vmpage in releasepage" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 95fb01cb020295531839d8c758320262db6b7bef

            People

              paf0186 Patrick Farrell
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: