Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8435

LBUG (osc_cache.c:1290:osc_completion()) ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.7.0
    • Bull Lustre distribution based on Lustre 2.7.2
    • 3
    • 9223372036854775807

    Description

      In the last month one of our customer hit more than 100 times a crash with the following signature:

      [506626.555125] SLUB: Unable to allocate memory on node -1 (gfp=0x80c0)
      [506626.562216]   cache: kvm_mmu_page_header(22:step_batch), object size: 168,
      buffer size: 168, default order: 1, min order: 0
      [506626.574729]   node 0: slabs: 0, objs: 0, free: 0
      [506626.579974]   node 1: slabs: 0, objs: 0, free: 0
      [506626.585219]   node 2: slabs: 60, objs: 2880, free: 0
      [506626.590852]   node 3: slabs: 0, objs: 0, free: 0
      [506626.596112] LustreError: 41604:0:(osc_cache.c:1290:osc_completion())
      ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) ) failed:
      cp_state:0, cmd:1
      [506626.612512] LustreError: 41604:0:(osc_cache.c:1290:osc_completion()) LBUG
      [506626.620186] Pid: 41604, comm: cat
      [506626.623978]
                      Call Trace:
      [506626.628573]  [<ffffffffa05eb853>] libcfs_debug_dumpstack+0x53/0x80
      [libcfs]
      [506626.636448]  [<ffffffffa05ebdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
      [506626.643456]  [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0
      [osc]
      [506626.651526]  [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc]
      [506626.659108]  [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc]
      [506626.666037]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
      [506626.673531]  [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov]
      [506626.680450]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
      [506626.687961]  [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre]
      [506626.694964]  [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750
      [506626.702078]  [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre]
      [506626.709674]  [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass]
      [506626.716785]  [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass]
      [506626.723797]  [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre]
      [506626.731477]  [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre]
      [506626.738962]  [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre]
      [506626.745962]  [<ffffffff811dea1c>] vfs_read+0x9c/0x170
      [506626.751700]  [<ffffffff811df56f>] SyS_read+0x7f/0xe0
      [506626.757345]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      [506626.764138]
      [506626.765990] Kernel panic - not syncing: LBUG
      [506626.770850] CPU: 53 PID: 41604 Comm: cat Tainted: G           OE 
      ------------   3.10.0-327.22.2.el7.x86_64 #1
      [506626.782104] Hardware name: BULL bullx blade/CHPU, BIOS BIOSX07.037.01.003
      10/23/2015
      [506626.790838]  ffffffffa0610ced 000000000f6a3070 ffff8817799eb8c0
      ffffffff816360f4
      [506626.799228]  ffff8817799eb940 ffffffff8162f96a ffffffff00000008
      ffff8817799eb950
      [506626.807618]  ffff8817799eb8f0 000000000f6a3070 ffffffffa0e01466
      0000000000000246
      [506626.816005] Call Trace:
      [506626.818839]  [<ffffffff816360f4>] dump_stack+0x19/0x1b
      [506626.824668]  [<ffffffff8162f96a>] panic+0xd8/0x1e7
      [506626.830128]  [<ffffffffa05ebe5b>] lbug_with_loc+0xab/0xc0 [libcfs]
      [506626.837129]  [<ffffffffa0dea859>] osc_ap_completion.isra.30+0x4d9/0x5b0
      [osc]
      [506626.845192]  [<ffffffffa0df558d>] osc_queue_sync_pages+0x2dd/0x350 [osc]
      [506626.852766]  [<ffffffffa0de750f>] osc_io_submit+0x42f/0x530 [osc]
      [506626.859702]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
      [506626.867184]  [<ffffffffa0b8d257>] lov_io_submit+0x2a7/0x420 [lov]
      [506626.874099]  [<ffffffffa086fbd6>] cl_io_submit_rw+0x66/0x170 [obdclass]
      [506626.881611]  [<ffffffffa0c67f70>] ll_readpage+0x2d0/0x560 [lustre]
      [506626.888609]  [<ffffffff8116af87>] generic_file_aio_read+0x3b7/0x750
      [506626.895721]  [<ffffffffa0c98485>] vvp_io_read_start+0x3c5/0x470 [lustre]
      [506626.903322]  [<ffffffffa086f965>] cl_io_start+0x65/0x130 [obdclass]
      [506626.910418]  [<ffffffffa0872f85>] cl_io_loop+0xa5/0x190 [obdclass]
      [506626.917420]  [<ffffffffa0c34e8c>] ll_file_io_generic+0x5fc/0xae0 [lustre]
      [506626.925091]  [<ffffffffa0c35db2>] ll_file_aio_read+0x192/0x530 [lustre]
      [506626.932575]  [<ffffffffa0c3621b>] ll_file_read+0xcb/0x1e0 [lustre]
      [506626.939569]  [<ffffffff811dea1c>] vfs_read+0x9c/0x170
      [506626.945300]  [<ffffffff811df56f>] SyS_read+0x7f/0xe0
      [506626.950938]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      

      The customer being a black site, we can't provide the crashdump, but will happily provide any text output you would find useful.

      Attachments

        1. crash_output.txt
          24 kB
        2. foreach_bt_merge.txt
          152 kB
        3. struct_analyze1.txt
          50 kB

        Issue Links

          Activity

            [LU-8435] LBUG (osc_cache.c:1290:osc_completion()) ASSERTION( equi(page->cp_state == CPS_PAGEIN, cmd == OBD_BRW_READ) )

            I think it makes sense to disable cgroup testing if client is running version < 4.4, until we know there is a fix in RHEL and what version it is in.

            adilger Andreas Dilger added a comment - I think it makes sense to disable cgroup testing if client is running version < 4.4, until we know there is a fix in RHEL and what version it is in.

            And here is the answer I got for BZ #197075 !!... :

            Kernel memory accounting is totally unusable in 3.10, because it lacks
            dcache reclaim and there are a lot of implementation bugs. The one you
            caught is just one of many that have been fixed since 3.10. That's why
            it stayed disabled by default until 4.x.
            
            So IMHO the best you can do if you really want to use kernel memory
            accounting is upgrade to 4.x or backport all related patches.
            
            bfaccini Bruno Faccini (Inactive) added a comment - And here is the answer I got for BZ #197075 !!... : Kernel memory accounting is totally unusable in 3.10, because it lacks dcache reclaim and there are a lot of implementation bugs. The one you caught is just one of many that have been fixed since 3.10. That's why it stayed disabled by default until 4.x. So IMHO the best you can do if you really want to use kernel memory accounting is upgrade to 4.x or backport all related patches.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hmm, still not a lot of answer to my report at http://www.spinics.net/lists/cgroups/msg18187.html ....

            So, I have submitted #197075 bug report to bugzilla.kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=197075.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hmm, still not a lot of answer to my report at http://www.spinics.net/lists/cgroups/msg18187.html .... So, I have submitted #197075 bug report to bugzilla.kernel.org, https://bugzilla.kernel.org/show_bug.cgi?id=197075 .

            Problem description and 2x patch proposals have been posted/emailed to cgroups@vger.kernel.org.

            bfaccini Bruno Faccini (Inactive) added a comment - Problem description and 2x patch proposals have been posted/emailed to cgroups@vger.kernel.org.

            Hmm, in fact I think the kernel patch ("memcg: don't call memcg_update_all_caches if new cache id fits") that I have indicated in my just previous comment is not likely to fix finally.

            After I have browsed again more recent Kernels source code, problem appears to be fixed in 4.x kernels because now memcg_params is no longer a pointer to but an embedded struct memcg_cache_params inside the kmem_cache! But looks like problem is still in 3.x.

            BTW, I have found a possible other/alternate way to fix it with the following patch :

            [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi+
            --- mm/memcontrol.c     2016-06-09 06:31:12.000000000 -0700
            +++ mm/memcontrol.c.bfi+        2017-09-12 09:24:53.235452071 -0700
            @@ -3213,7 +3213,7 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
             {
                    size_t size = sizeof(struct memcg_cache_params);
             
            -       if (!memcg_kmem_enabled())
            +       if (!memcg_kmem_enabled() && memcg_limited_groups_array_size == 0)
                            return 0;
             
                    if (!memcg)
            [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# 
            

            where memcg_register_cache() routine has been changed by memcg_alloc_cache_params() in some 3.x versions but where same fix applies.

            I will try to propose either patch to the Kernel mailing list.

            bfaccini Bruno Faccini (Inactive) added a comment - Hmm, in fact I think the kernel patch ("memcg: don't call memcg_update_all_caches if new cache id fits") that I have indicated in my just previous comment is not likely to fix finally. After I have browsed again more recent Kernels source code, problem appears to be fixed in 4.x kernels because now memcg_params is no longer a pointer to but an embedded struct memcg_cache_params inside the kmem_cache! But looks like problem is still in 3.x. BTW, I have found a possible other/alternate way to fix it with the following patch : [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi+ --- mm/memcontrol.c 2016-06-09 06:31:12.000000000 -0700 +++ mm/memcontrol.c.bfi+ 2017-09-12 09:24:53.235452071 -0700 @@ -3213,7 +3213,7 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s, { size_t size = sizeof(struct memcg_cache_params); - if (!memcg_kmem_enabled()) + if (!memcg_kmem_enabled() && memcg_limited_groups_array_size == 0) return 0; if (!memcg) [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# where memcg_register_cache() routine has been changed by memcg_alloc_cache_params() in some 3.x versions but where same fix applies. I will try to propose either patch to the Kernel mailing list.

            After browsing Kernel sources in kernel.org, I think this issue is fixed in 4.x and 3.x more recent Kernels, mainly with changes that have been introduced in patch/commit "memcg: don't call memcg_update_all_caches if new cache id fits" available at
            https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/patch/mm/memcontrol.c?id=f3bb3043a092368a255bca5d1c6f4352c96a3b2d, and where ALL kmem_cache are systematically parsed to re-size/initialize a new memcg_params when a kmem mermory-cgroup is being started.

            And again, I need to insist that this bug is very unlikely to be triggered if we keep running with SLUB configured instead to switch back to SLAB.

            bfaccini Bruno Faccini (Inactive) added a comment - After browsing Kernel sources in kernel.org, I think this issue is fixed in 4.x and 3.x more recent Kernels, mainly with changes that have been introduced in patch/commit "memcg: don't call memcg_update_all_caches if new cache id fits" available at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/patch/mm/memcontrol.c?id=f3bb3043a092368a255bca5d1c6f4352c96a3b2d , and where ALL kmem_cache are systematically parsed to re-size/initialize a new memcg_params when a kmem mermory-cgroup is being started. And again, I need to insist that this bug is very unlikely to be triggered if we keep running with SLUB configured instead to switch back to SLAB.

            This problem should have already been addressed by patch at https://patchwork.kernel.org/patch/2850152/.
            But since the RHEL7 Kernel version we have experienced these crashes with already includes it, after digging into the Kernel related source code, I think the following one is required to fully fix this issue :

            [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi
            --- mm/memcontrol.c     2016-06-09 06:31:12.000000000 -0700
            +++ mm/memcontrol.c.bfi 2017-09-08 07:37:18.647281366 -0700
            @@ -3163,7 +3163,15 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
             
                    VM_BUG_ON(s->memcg_params && !s->memcg_params->is_root_cache);
             
            -       if (num_groups > memcg_limited_groups_array_size) {
            +       /* resize/grow existing memcg_params or allocate it if had not already
            +        * been done during kmem_cache creation because none of previously used
            +        * kmem memcg were present at this time (i.e.
            +        * memcg_limited_groups_array_size != 0 but memcg_kmem_enabled()
            +        * returned false). It could not have been necessary if
            +        * memcg_caches_array_size() was not used to anticipate more than slots
            +        * required and if memcg_limited_groups_array_size would simply
            +        * increment upon each new kmem memcg creation.
            +       if (num_groups > memcg_limited_groups_array_size || !s->memcg_params) {
                            int i;
                            ssize_t size = memcg_caches_array_size(num_groups);
             
            @@ -3203,7 +3211,8 @@
                             * bigger than the others. And all updates will reset this
                             * anyway.
                             */
            -               kfree(cur_params);
            +               if (cur_params)
            +                       kfree(cur_params);
                    }
                    return 0;
             }
            [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# 
            

            Oleg, you may want to try it ? And on my side I will check if this problem is still present in recent Kernels and report it if not.

            bfaccini Bruno Faccini (Inactive) added a comment - This problem should have already been addressed by patch at https://patchwork.kernel.org/patch/2850152/ . But since the RHEL7 Kernel version we have experienced these crashes with already includes it, after digging into the Kernel related source code, I think the following one is required to fully fix this issue : [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# diff -urN mm/memcontrol.c mm/memcontrol.c.bfi --- mm/memcontrol.c 2016-06-09 06:31:12.000000000 -0700 +++ mm/memcontrol.c.bfi 2017-09-08 07:37:18.647281366 -0700 @@ -3163,7 +3163,15 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups) VM_BUG_ON(s->memcg_params && !s->memcg_params->is_root_cache); - if (num_groups > memcg_limited_groups_array_size) { + /* resize/grow existing memcg_params or allocate it if had not already + * been done during kmem_cache creation because none of previously used + * kmem memcg were present at this time (i.e. + * memcg_limited_groups_array_size != 0 but memcg_kmem_enabled() + * returned false). It could not have been necessary if + * memcg_caches_array_size() was not used to anticipate more than slots + * required and if memcg_limited_groups_array_size would simply + * increment upon each new kmem memcg creation. + if (num_groups > memcg_limited_groups_array_size || !s->memcg_params) { int i; ssize_t size = memcg_caches_array_size(num_groups); @@ -3203,7 +3211,8 @@ * bigger than the others. And all updates will reset this * anyway. */ - kfree(cur_params); + if (cur_params) + kfree(cur_params); } return 0; } [root@onyx-68 linux-3.10.0-327.22.2.el7-debug]# Oleg, you may want to try it ? And on my side I will check if this problem is still present in recent Kernels and report it if not.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Well, I have been quite puzzled with these crash-dumps, because according to the Kernel source code, the only case where memcg_params could be NULL is when a kmem_cache is being created and !memcg_kmem_enabled() (tested in memcg_register_cache(), this is when no memory group with kmem limit is still active), but then all kmem_cache should have their memcg_params==NULL, and not only the Lustre ones!! I have tested this on a fresh system after reboot with no memory cgroup configured, all kmem_cache, and even the Lustre ones after a delayed Lustre start and modules load, have their memcg_params==NULL.

            Finally, I found a possible scenario where this can happen, and this is confirmed by/in all Kernel logs from the crash-dumps, the sanity/test_411 test (likely to be the only one to create+use a memory group), has to be run at least 2 times with Lustre modules unload in between (causing Lustre kmem_cache to be destroyed and re-created).
            Then, trying to reproduce by doing so on a test system I was not able to make it since !memcg_kmem_enabled() could never be satisfied because of "child/memcg" kmem_cache (created during 1st sub-test run and memory group creation) still being present, and this even being fully empty/freed. But this could be where the SLAB vs SLUB config should have made the difference, because it appears that SLAB mechanism include a regular kmem_cache shrink/drain cleanup that could have caused these kind of leaked kmem_cache to be garbage collected, when this never (or may be a lot later) could happen running with SLUB.

            But anyway, the fact that memcg_params could be NULL for a recently created kmem_cache when no kmem memory cgroup is still active (but some had already existed before!), needs to be handled by current code. I will check if this is already the case in recent Kernels, and push a fix if not.

            Oleg, this looks like a new argument to confirm the need to test Kernels with "unsupported" configs!

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Well, I have been quite puzzled with these crash-dumps, because according to the Kernel source code, the only case where memcg_params could be NULL is when a kmem_cache is being created and !memcg_kmem_enabled() (tested in memcg_register_cache(), this is when no memory group with kmem limit is still active), but then all kmem_cache should have their memcg_params==NULL, and not only the Lustre ones!! I have tested this on a fresh system after reboot with no memory cgroup configured, all kmem_cache, and even the Lustre ones after a delayed Lustre start and modules load, have their memcg_params==NULL. Finally, I found a possible scenario where this can happen, and this is confirmed by/in all Kernel logs from the crash-dumps, the sanity/test_411 test (likely to be the only one to create+use a memory group), has to be run at least 2 times with Lustre modules unload in between (causing Lustre kmem_cache to be destroyed and re-created). Then, trying to reproduce by doing so on a test system I was not able to make it since !memcg_kmem_enabled() could never be satisfied because of "child/memcg" kmem_cache (created during 1st sub-test run and memory group creation) still being present, and this even being fully empty/freed. But this could be where the SLAB vs SLUB config should have made the difference, because it appears that SLAB mechanism include a regular kmem_cache shrink/drain cleanup that could have caused these kind of leaked kmem_cache to be garbage collected, when this never (or may be a lot later) could happen running with SLUB. But anyway, the fact that memcg_params could be NULL for a recently created kmem_cache when no kmem memory cgroup is still active (but some had already existed before!), needs to be handled by current code. I will check if this is already the case in recent Kernels, and push a fix if not. Oleg, this looks like a new argument to confirm the need to test Kernels with "unsupported" configs!
            green Oleg Drokin added a comment -

            noet that I am running unsupported redhat kernel config with a lot of debugging and I also enable SLAB, where as redhat defaults to SLUB.

            This does not make the bugs any less real of course, but at times it takes some time to convince the RH people that it's ok to have unsupported config because it just helps to detect the bugs earlier vs random crashes elsewhere later.

            green Oleg Drokin added a comment - noet that I am running unsupported redhat kernel config with a lot of debugging and I also enable SLAB, where as redhat defaults to SLUB. This does not make the bugs any less real of course, but at times it takes some time to convince the RH people that it's ok to have unsupported config because it just helps to detect the bugs earlier vs random crashes elsewhere later.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Aurelien, I know that you posted this new test to ensure memcg limit do not cause crashes in Lustre code, but with this new kind of crash in Kernel/memcg layer, it seems you should also propose it for the kernel regression tests suite!!

            My first crash-dumps analysis results poins to a possible race during memcg lazy registration to current kmem_caches and concurrent Slab allocations, that triggers the unexpected situation, in __memcg_kmem_get_cache(), where memcg_params has still not been initialized in "ptlrpc_cache" kmem_cache.

            It is also interesting to note that recent auto-tests results of sanity/tests_411 are all success, when all these crashes have occurred during single-node sessions, and that the only kmem_cache in the system that do not have memcg_params initialized are those that have been created in Lustre code.

            More to come.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Aurelien, I know that you posted this new test to ensure memcg limit do not cause crashes in Lustre code, but with this new kind of crash in Kernel/memcg layer, it seems you should also propose it for the kernel regression tests suite!! My first crash-dumps analysis results poins to a possible race during memcg lazy registration to current kmem_caches and concurrent Slab allocations, that triggers the unexpected situation, in __memcg_kmem_get_cache(), where memcg_params has still not been initialized in "ptlrpc_cache" kmem_cache. It is also interesting to note that recent auto-tests results of sanity/tests_411 are all success, when all these crashes have occurred during single-node sessions, and that the only kmem_cache in the system that do not have memcg_params initialized are those that have been created in Lustre code. More to come.

            People

              bfaccini Bruno Faccini (Inactive)
              spiechurski Sebastien Piechurski
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: