Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15815

fast_read/stale data/reclaim workround causes SIGBUS

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The fast_read stale data workaround from LU-14541 can cause applications to receive a spurious SIGBUS when reclaim runs concurrently with page fault handler for mmaped files.

      Attachments

        Issue Links

          Activity

            [LU-15815] fast_read/stale data/reclaim workround causes SIGBUS

            Resolved by revert of LU-14541.  See LU-14541 for further discussion.

            paf0186 Patrick Farrell added a comment - Resolved by revert of LU-14541 .  See LU-14541 for further discussion.

            So, we landed a patch from John here to resolve the SIGBUS issue by removing the clearpageuptodate() call in vvp_page_delete, ie, reverting LU-14541.  I have more comments on the SIGBUS issue, etc, which I'll put there, but basically, I think this is correct - It's clear from the page fault code in the kernel that we can't unset pageuptodate() without causing problems.  (I was wrong about this previously.)

            So, that means the SIGBUS issue is resolved, but we have to find another way to solve LU-14541.

            So, I'm going to close this ticket as resolved, and let's move the discussion to LU-14541.  panda, I've added you as a watcher there.  I have a few thoughts on it but I'm still hoping you and/or Shadow have a good idea...

            paf0186 Patrick Farrell added a comment - So, we landed a patch from John here to resolve the SIGBUS issue by removing the clearpageuptodate() call in vvp_page_delete, ie, reverting LU-14541 .  I have more comments on the SIGBUS issue, etc, which I'll put there, but basically, I think this is correct - It's clear from the page fault code in the kernel that we can't unset pageuptodate() without causing problems.  (I was wrong about this previously.) So, that means the SIGBUS issue is resolved, but we have to find another way to solve LU-14541 . So, I'm going to close this ticket as resolved, and let's move the discussion to LU-14541 .   panda , I've added you as a watcher there.  I have a few thoughts on it but I'm still hoping you and/or Shadow have a good idea...
            spiechurski Sebastien Piechurski added a comment - - edited

            >> The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),

            >Where? On what branch?

            The site is ECMWF, and it was done on a 2.12.6 basis. I unfortunately don't have the whole history as I was not involved in these at the time and this was handled between the local team and DDN directly.

            >> but we recently found that it was causing the corruptions on mmap'ed pages.

            >Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?

            Reverting it caused the corruption, or at least we get the corruptions with it reverted. As we only understood recently that the crash/hang issues we observed were related to in-memory data corruption, it is actually only an assumption that the revert caused it.

            There is a DDN support case open for this.

            >> So, yes please can we find another solution ?

            >There seems to be a lot of hopeful use of "we" in this ticket.

            Obviously, this last "we" was probably more a "you". As much as I would like to provide solutions, my understanding of the memory management subsystem is too limited for this. 

             

            Regards,

             

            Sebastien.

            spiechurski Sebastien Piechurski added a comment - - edited >> The revert of LU-14541  had already been done several months ago (due to the SIGBUS errors), >Where? On what branch? The site is ECMWF, and it was done on a 2.12.6 basis. I unfortunately don't have the whole history as I was not involved in these at the time and this was handled between the local team and DDN directly. >> but we recently found that it was causing the corruptions on mmap'ed pages. >Do you mean that LU-14541  causes corruption? Or that reverting it causes corruption? Did you open an issue? Reverting it caused the corruption, or at least we get the corruptions with it reverted. As we only understood recently that the crash/hang issues we observed were related to in-memory data corruption, it is actually only an assumption that the revert caused it. There is a DDN support case open for this. >> So, yes please can we find another solution ? >There seems to be a lot of hopeful use of "we" in this ticket. Obviously, this last "we" was probably more a "you". As much as I would like to provide solutions, my understanding of the memory management subsystem is too limited for this.    Regards,   Sebastien.
            jhammond John Hammond added a comment -

            With LU-14541 in place we have the following:

            [  179.609120] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.613492] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.618134] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.621452] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) extent 000000009717064a@{[3 -> 3/1023], [2|0|-|cache|wi|00000000b2cc94b9], [28672|1|+|-|00000000d41a65d9|1024|00000000baa546d7]} trunc at 3.
            [  179.622752] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) ### extent: 000000009717064a ns: lustre-OST0000-osc-ffff8ba637eb9000 lock: 00000000d41a65d9/0xa4961a626f53297d lrc: 4/0,1 mode: PW/PW res: [0x19:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 12288->16383) gid 0 flags: 0x800020000000000 nid: local remote: 0xa4961a626f532984 expref: -99 pid: 11007 timeout: 0 lvb_type: 1
            [  179.622988] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.625126] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) page@00000000285d5729[3 00000000214d642c 5 1 00000000baa546d7]
            [  179.625126] 
            [  179.626325] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) vvp-page@000000002564fdc0(1:0) vm@00000000c9fa97b2 17ffffc0001035 2:0 ffff8ba5f1774140 3 lru
            [  179.626325] 
            [  179.627359] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) lov-page@00000000b1b9d4b5
            [  179.627359] 
            [  179.628033] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) osc-page@00000000a3522c1f 3: 1< 0x845fed 2 + - > 2< 12288 0 4096 0x0 0x40420 | 00000000baa546d7 00000000cd5d95e4 00000000b2cc94b9 > 3< 0 0 0 > 4< 0 0 8 667238400 - | - - + - > 5< - - + - | 0 - | 1 - ->
            [  179.628033] 
            [  179.629656] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) end page@00000000285d5729
            [  179.629656] 
            [  179.629826] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.630301] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16
            [  179.630302] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: 
            [  179.630304] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) LBUG
            [  179.632209] Pid: 11007, comm: mmap_fault_vs_d 4.18.0-348.7.1.el8.x86_64 #1 SMP Thu Mar 3 10:39:00 CST 2022
            [  179.632823] Call Trace TBD:
            [  179.633101] [<0>] libcfs_call_trace+0x8f/0xe0 [libcfs]
            [  179.633455] [<0>] lbug_with_loc+0x53/0xb0 [libcfs]
            [  179.633774] [<0>] osc_page_delete+0x666/0x670 [osc]
            [  179.634124] [<0>] cl_page_delete0+0x9d/0x2d0 [obdclass]
            [  179.634491] [<0>] cl_page_delete+0x3e/0x130 [obdclass]
            [  179.634516] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.634854] [<0>] ll_invalidatepage+0xc0/0x230 [lustre]
            [  179.634862] [<0>] truncate_cleanup_page+0x8d/0x170
            [  179.635861] [<0>] generic_error_remove_page+0x31/0x80
            [  179.636274] [<0>] vvp_page_discard+0x49/0x120 [lustre]
            [  179.636640] [<0>] cl_page_discard+0x5e/0xc0 [obdclass]
            [  179.636993] [<0>] cl_page_list_discard+0x59/0x160 [obdclass]
            [  179.637394] [<0>] ll_io_read_page+0x27c/0xd00 [lustre]
            [  179.637749] [<0>] ll_readpage+0x133/0xb70 [lustre]
            [  179.638076] [<0>] filemap_fault+0x84c/0xa60
            [  179.638369] [<0>] vvp_io_fault_start+0x53f/0x12b0 [lustre]
            [  179.638753] [<0>] cl_io_start+0x66/0x190 [obdclass]
            [  179.638928] mmap_fault_vs_d (11006): drop_caches: 3
            [  179.639107] [<0>] cl_io_loop+0xde/0x2a0 [obdclass]
            [  179.639755] [<0>] ll_fault+0x916/0xc60 [lustre]
            [  179.640066] [<0>] __do_fault+0x48/0x130
            [  179.640323] [<0>] do_fault+0x110/0x660
            [  179.640572] [<0>] __handle_mm_fault+0xa0f/0xdb0
            [  179.640873] [<0>] handle_mm_fault+0x11e/0x310
            [  179.641168] [<0>] __do_page_fault+0x317/0x900
            [  179.641461] [<0>] do_page_fault+0x87/0x300
            [  179.641737] [<0>] page_fault+0x1e/0x30
            

            See mmap_fault_vs_drop_caches3.c to reproduce.

            jhammond John Hammond added a comment - With LU-14541 in place we have the following: [ 179.609120] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.613492] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.618134] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.621452] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) extent 000000009717064a@{[3 -> 3/1023], [2|0|-|cache|wi|00000000b2cc94b9], [28672|1|+|-|00000000d41a65d9|1024|00000000baa546d7]} trunc at 3. [ 179.622752] LustreError: 11007:0:(osc_cache.c:2495:osc_teardown_async_page()) ### extent: 000000009717064a ns: lustre-OST0000-osc-ffff8ba637eb9000 lock: 00000000d41a65d9/0xa4961a626f53297d lrc: 4/0,1 mode: PW/PW res: [0x19:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 12288->16383) gid 0 flags: 0x800020000000000 nid: local remote: 0xa4961a626f532984 expref: -99 pid: 11007 timeout: 0 lvb_type: 1 [ 179.622988] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.625126] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) page@00000000285d5729[3 00000000214d642c 5 1 00000000baa546d7] [ 179.625126] [ 179.626325] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) vvp-page@000000002564fdc0(1:0) vm@00000000c9fa97b2 17ffffc0001035 2:0 ffff8ba5f1774140 3 lru [ 179.626325] [ 179.627359] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) lov-page@00000000b1b9d4b5 [ 179.627359] [ 179.628033] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) osc-page@00000000a3522c1f 3: 1< 0x845fed 2 + - > 2< 12288 0 4096 0x0 0x40420 | 00000000baa546d7 00000000cd5d95e4 00000000b2cc94b9 > 3< 0 0 0 > 4< 0 0 8 667238400 - | - - + - > 5< - - + - | 0 - | 1 - -> [ 179.628033] [ 179.629656] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) end page@00000000285d5729 [ 179.629656] [ 179.629826] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.630301] LustreError: 11007:0:(osc_page.c:182:osc_page_delete()) Trying to teardown failed: -16 [ 179.630302] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: [ 179.630304] LustreError: 11007:0:(osc_page.c:183:osc_page_delete()) LBUG [ 179.632209] Pid: 11007, comm: mmap_fault_vs_d 4.18.0-348.7.1.el8.x86_64 #1 SMP Thu Mar 3 10:39:00 CST 2022 [ 179.632823] Call Trace TBD: [ 179.633101] [<0>] libcfs_call_trace+0x8f/0xe0 [libcfs] [ 179.633455] [<0>] lbug_with_loc+0x53/0xb0 [libcfs] [ 179.633774] [<0>] osc_page_delete+0x666/0x670 [osc] [ 179.634124] [<0>] cl_page_delete0+0x9d/0x2d0 [obdclass] [ 179.634491] [<0>] cl_page_delete+0x3e/0x130 [obdclass] [ 179.634516] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.634854] [<0>] ll_invalidatepage+0xc0/0x230 [lustre] [ 179.634862] [<0>] truncate_cleanup_page+0x8d/0x170 [ 179.635861] [<0>] generic_error_remove_page+0x31/0x80 [ 179.636274] [<0>] vvp_page_discard+0x49/0x120 [lustre] [ 179.636640] [<0>] cl_page_discard+0x5e/0xc0 [obdclass] [ 179.636993] [<0>] cl_page_list_discard+0x59/0x160 [obdclass] [ 179.637394] [<0>] ll_io_read_page+0x27c/0xd00 [lustre] [ 179.637749] [<0>] ll_readpage+0x133/0xb70 [lustre] [ 179.638076] [<0>] filemap_fault+0x84c/0xa60 [ 179.638369] [<0>] vvp_io_fault_start+0x53f/0x12b0 [lustre] [ 179.638753] [<0>] cl_io_start+0x66/0x190 [obdclass] [ 179.638928] mmap_fault_vs_d (11006): drop_caches: 3 [ 179.639107] [<0>] cl_io_loop+0xde/0x2a0 [obdclass] [ 179.639755] [<0>] ll_fault+0x916/0xc60 [lustre] [ 179.640066] [<0>] __do_fault+0x48/0x130 [ 179.640323] [<0>] do_fault+0x110/0x660 [ 179.640572] [<0>] __handle_mm_fault+0xa0f/0xdb0 [ 179.640873] [<0>] handle_mm_fault+0x11e/0x310 [ 179.641168] [<0>] __do_page_fault+0x317/0x900 [ 179.641461] [<0>] do_page_fault+0x87/0x300 [ 179.641737] [<0>] page_fault+0x1e/0x30 See mmap_fault_vs_drop_caches3.c to reproduce.
            jhammond John Hammond added a comment -

            > The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors),

            Where? On what branch?

            > but we recently found that it was causing the corruptions on mmap'ed pages.

            Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue?

            > So, yes please can we find another solution ?

            There seems to be a lot of hopeful use of "we" in this ticket.

            jhammond John Hammond added a comment - > The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors), Where? On what branch? > but we recently found that it was causing the corruptions on mmap'ed pages. Do you mean that LU-14541 causes corruption? Or that reverting it causes corruption? Did you open an issue? > So, yes please can we find another solution ? There seems to be a lot of hopeful use of "we" in this ticket.

            We tested this patch on a 2.12.6 basis, and disabling fast_read is actually catastrophic in terms of performance on our customer code.

            The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors), but we recently found that it was causing the corruptions on mmap'ed pages. Disabling fast_read on top of it is causing a x5 slowdown on the application.

            So, yes please can we find another solution ?

            spiechurski Sebastien Piechurski added a comment - We tested this patch on a 2.12.6 basis, and disabling fast_read is actually catastrophic in terms of performance on our customer code. The revert of LU-14541 had already been done several months ago (due to the SIGBUS errors), but we recently found that it was causing the corruptions on mmap'ed pages. Disabling fast_read on top of it is causing a x5 slowdown on the application. So, yes please can we find another solution ?

            Is that work in progress?

            Yes, it is.

            panda Andrew Perepechko added a comment - Is that work in progress? Yes, it is.
            jhammond John Hammond added a comment -

            > Can we find another solution please?

            panda shared some ideas on LU-15819. Is that work in progress?

            jhammond John Hammond added a comment - > Can we find another solution please? panda shared some ideas on LU-15819 . Is that work in progress?
            pjones Peter Jones added a comment -

            Reopening until this discussion is settled

            pjones Peter Jones added a comment - Reopening until this discussion is settled
            spitzcor Cory Spitz added a comment -

            jhammond, can we re-open this ticket? It makes no sense to me to revert LU-14541 when you've confirmed will reintroduce a data corruption. Can we find another solution please?

            spitzcor Cory Spitz added a comment - jhammond , can we re-open this ticket? It makes no sense to me to revert LU-14541 when you've confirmed will reintroduce a data corruption. Can we find another solution please?

            People

              panda Andrew Perepechko
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: