Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17364

osc_page_delete LBUG - trying to delete a page under write

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Client crashed on osc_page_delete, and the page is waiting for write

      [2281253.531369] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) extent ffff883c74c7b810@
      
      {[28680 -> 28680/32767], [2|0|-|cache|wi|ffff883ba0bbde00], [28672|1|+|-|ffff884a03601b00|4096| (null)]}
      
      trunc at 28680.
      [2281253.553678] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) ### extent: ffff883c74c7b810 ns: euscrat-OST0004-osc-ffff887a7da55000 lock: ffff884a03601b00/0xb3e57269e5f70d90 lrc: 12/0,1 mode: PW/PW res: [0x480000402:0x1beede4b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x800020000020000 nid: local remote: 0x93336ee709e18f0b expref: -99 pid: 81367 timeout: 0 lvb_type: 1 l_ast_data: 0000000000000000
      [2281253.597632] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) page@ffff8841ea116400[3 ffff883e49200d70 4 1 (null)]
      
      [2281253.613495] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) vvp-page@ffff8841ea116458(1:0) vm@fffff967dd5ace00 2fffff00000835 2:0 ffff8841ea116400 28680 lru
      
      [2281253.613499] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) lov-page@ffff8841ea116498, gen: 0
      
      [2281253.613511] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) osc-page@ffff8841ea1164d0 28680: 1< 0x845fed 2 0 + - > 2< 117473280 0 4096 0x0 0x420 | (null) ffff88622eda49b0 ffff883ba0bbde00 > 3< 0 0 0 > 4< 0 0 16 242233344 - | - - + - > 5< - - + - | 0 - | 3648 - ->
      
      [2281253.613512] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) end page@ffff8841ea116400
      
      [2281253.613514] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) Trying to teardown failed: -16
      [2281253.613515] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) ASSERTION( 0 ) failed:
      [2281253.613516] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) LBUG
      [2281253.613518] Pid: 81367, comm: julia 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023
      [2281253.613518] Call Trace:
      [2281253.613549] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
      [2281253.613560] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [2281253.613568] [<0>] osc_page_delete+0x47e/0x4b0 [osc]
      [2281253.613592] [<0>] cl_page_delete0+0x80/0x220 [obdclass]
      [2281253.613602] [<0>] cl_page_delete+0x33/0x110 [obdclass]
      [2281253.613618] [<0>] ll_invalidatepage+0x87/0x180 [lustre]
      [2281253.613634] [<0>] do_invalidatepage_range+0x7d/0x90
      [2281253.613642] [<0>] truncate_inode_page+0x7f/0x90
      [2281253.613643] [<0>] generic_error_remove_page+0x2a/0x40
      [2281253.613652] [<0>] vvp_page_discard+0x5e/0xd0 [lustre]
      [2281253.613663] [<0>] cl_page_discard+0x4b/0x70 [obdclass]
      [2281253.613675] [<0>] cl_page_list_discard+0x56/0x160 [obdclass]
      [2281253.613682] [<0>] ll_io_read_page+0x3f5/0x890 [lustre]
      [2281253.613688] [<0>] ll_readpage+0xe6/0x820 [lustre]
      [2281253.613693] [<0>] filemap_fault+0x1f8/0x420
      [2281253.613699] [<0>] ll_filemap_fault+0x39/0x70 [lustre]
      [2281253.613706] [<0>] vvp_io_fault_start+0x5fa/0xe50 [lustre]
      [2281253.613718] [<0>] cl_io_start+0x70/0x140 [obdclass]
      [2281253.613729] [<0>] cl_io_loop+0x9f/0x200 [obdclass]
      [2281253.613735] [<0>] ll_fault+0x52d/0x8a0 [lustre]
      [2281253.613746] [<0>] __do_fault.isra.61+0x8a/0x100
      [2281253.613754] [<0>] do_shared_fault.isra.64+0x4c/0x280
      [2281253.613758] [<0>] handle_mm_fault+0x459/0x1190
      [2281253.613765] [<0>] __do_page_fault+0x213/0x510
      [2281253.613766] [<0>] do_page_fault+0x35/0x90
      

      Attachments

        Issue Links

          Activity

            [LU-17364] osc_page_delete LBUG - trying to delete a page under write

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57682
            Subject: LU-17364 llite: don't use stale page.
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 34afcf181c90433ffbe388ef116bf9b8e037bad3

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57682 Subject: LU-17364 llite: don't use stale page. Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 34afcf181c90433ffbe388ef116bf9b8e037bad3
            scadmin SC Admin added a comment -

            we just saw this with 2.15.4 client rhel9.4, servers 2.12.9 rhel7.

            this is the first time we've seen it. we've been running with this combo for months.
            there was memory pressure from numa leading to some mild swapping on the node.

            can we please have this backported to 2.15?

            in the mean time we'll cherry pick this into our client tree.

            scadmin SC Admin added a comment - we just saw this with 2.15.4 client rhel9.4, servers 2.12.9 rhel7. this is the first time we've seen it. we've been running with this combo for months. there was memory pressure from numa leading to some mild swapping on the node. can we please have this backported to 2.15? in the mean time we'll cherry pick this into our client tree.
            bodgerer Mark Dixon added a comment -

            Just noting our experience of this issue.

            We also saw it with a Julia code: the user running it managed to get a 100% kill rate on the lustre clients they were assigned. x86_64 clients on Lustre 2.15.4 / Rocky 8, server on 2.12.x.

            After cherry picking the https://review.whamcloud.com/c/fs/lustre-release/+/53550/ patchset onto 2.15.4, we've had at least one run where the lustre client did not crash/reboot. Doing more tests before adding it to our local tree.

            Thanks!

            bodgerer Mark Dixon added a comment - Just noting our experience of this issue. We also saw it with a Julia code: the user running it managed to get a 100% kill rate on the lustre clients they were assigned. x86_64 clients on Lustre 2.15.4 / Rocky 8, server on 2.12.x. After cherry picking the https://review.whamcloud.com/c/fs/lustre-release/+/53550/ patchset onto 2.15.4, we've had at least one run where the lustre client did not crash/reboot. Doing more tests before adding it to our local tree. Thanks!
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53550/
            Subject: LU-17364 llite: don't use stale page.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dad3bed7617fba895db169facde91856e89c2b08

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53550/ Subject: LU-17364 llite: don't use stale page. Project: fs/lustre-release Branch: master Current Patch Set: Commit: dad3bed7617fba895db169facde91856e89c2b08

            @Patrik,

            this issue exist for any lustre where up2date bit cleared in vvp_page_delete and this is related to the page reclaim.

            shadow Alexey Lyashkov added a comment - @Patrik, this issue exist for any lustre where up2date bit cleared in vvp_page_delete and this is related to the page reclaim.
            wanat Dominika Wanat added a comment - - edited

            @Patrick, next week, we will prepare a complete history of updates, patches, and bug occurrences from the client and server sides. This bug is also nothing new for us - we have been experiencing it on the client nodes since the beginning of April 2022 (and kernel 4.18.0-348.20.1.el8_5.x86_64). Now, we are using clients with kernel 4.18.0-477.27.1.el8_8.x86_64, and the issues seem to occur more often despite the latest application of LU-16043. Interestingly, the presented LBUG is always caused by ams.exe, a part of ADF, but not every job using ADF causes lbug...

            wanat Dominika Wanat added a comment - - edited @Patrick, next week, we will prepare a complete history of updates, patches, and bug occurrences from the client and server sides. This bug is also nothing new for us - we have been experiencing it on the client nodes since the beginning of April 2022 (and kernel 4.18.0-348.20.1.el8_5.x86_64). Now, we are using clients with kernel 4.18.0-477.27.1.el8_8.x86_64, and the issues seem to occur more often despite the latest application of LU-16043 . Interestingly, the presented LBUG is always caused by ams.exe, a part of ADF , but not every job using ADF causes lbug...

            Dominika,

            Do you have any details on your setup?  Was there any particular change you made that seemed linked to the problem occurring, for example, a kernel update?  I ask because the underlying bug has been present for a while and we're trying to track if there's a reason it's now happening more often.

            paf0186 Patrick Farrell added a comment - Dominika, Do you have any details on your setup?  Was there any particular change you made that seemed linked to the problem occurring, for example, a kernel update?  I ask because the underlying bug has been present for a while and we're trying to track if there's a reason it's now happening more often.

            Hi, 

            We are stuck with the same problem with Lustre based on the b2_15 branch. Are you planning to backport the abovementioned patches in the near future?

            Best, 

            Dominika Wanat

            wanat Dominika Wanat added a comment - Hi,  We are stuck with the same problem with Lustre based on the b2_15 branch. Are you planning to backport the abovementioned patches in the near future? Best,  Dominika Wanat

            "Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53550
            Subject: LU-17364 llite: don't use stale page.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c056724ef52cf68b7680a20e8e9a6da0848b3ef5

            gerrit Gerrit Updater added a comment - "Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53550 Subject: LU-17364 llite: don't use stale page. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c056724ef52cf68b7680a20e8e9a6da0848b3ef5

            People

              shadow Alexey Lyashkov
              bobijam Zhenyu Xu
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: