Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12525

sanity-flr test 200 and others asertion in osc_page_delete

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      There's a flurry of failures. In maloo it mostly fails in sanity-flr test 200, but I also saw it in racer at least once with a partially similar backtrace.

      Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2
      Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-49vm6@tcp:/lustre /mnt/lustre2
      Lustre: Mounted lustre-client
      Lustre: DEBUG MARKER: mkdir -p /mnt/lustre3
      Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-49vm6@tcp:/lustre /mnt/lustre3
      LustreError: 16418:0:(osc_cache.c:2553:osc_teardown_async_page()) extent ffff89076293af20@{[768 -> 1023/1023], [2|0|-|cache|wi|ffff890756aca8c0], [1703936|123|+|-|ffff890762821d40|256| (null)]} trunc at 848.
      LustreError: 16418:0:(osc_cache.c:2553:osc_teardown_async_page()) ### extent: ffff89076293af20 ns: lustre-OST0001-osc-ffff890767ec5800 lock: ffff890762821d40/0x10751cafee0649b3 lrc: 2/0,0 mode: PW/PW res: [0x175d:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 3473408->3977215) flags: 0x20000020000 nid: local remote: 0x70d1a9e0ebb537b3 expref: -99 pid: 16413 timeout: 0 lvb_type: 1
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) page@ffff89075dd84a00[2 ffff890736741ab0 4 1 (null)]
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff89075dd84a50(0:0) vm@ffffe55201ca5c80 1fffff0000083d 3:0 ffff89075dd84a00 1872 lru
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff89075dd84a90, comp index: 70001, gen: 8
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff89075dd84ac8 848: 1< 0x845fed 258 0 + - > 2< 3473408 0 4096 0x0 0x520 | (null) ffff89076b15c6c0 ffff890756aca8c0 > 3< 0 0 0 > 4< 0 0 8 120979456 - | - - + - > 5< - - + - | 0 - | 123 - ->
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) end page@ffff89075dd84a00
      LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
      LustreError: 16418:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
      LustreError: 16418:0:(osc_page.c:193:osc_page_delete()) LBUG
      Pid: 16418, comm: cat 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019
      Call Trace:
      [<ffffffffc05b78bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [<ffffffffc05b796c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [<ffffffffc0b8086f>] osc_page_delete+0x48f/0x500 [osc]
      [<ffffffffc073cdf0>] cl_page_delete0+0x80/0x220 [obdclass]
      [<ffffffffc073cfc3>] cl_page_delete+0x33/0x110 [obdclass]
      [<ffffffffc0c8c98f>] ll_invalidatepage+0x7f/0x170 [lustre]
      [<ffffffff8ffc6c8d>] do_invalidatepage_range+0x7d/0x90
      [<ffffffff8ffc6d37>] truncate_inode_page+0x77/0x80
      [<ffffffff8ffc6f6a>] truncate_inode_pages_range+0x1ea/0x700
      [<ffffffff8ffc7495>] truncate_inode_pages+0x15/0x20
      [<ffffffffc0ca06db>] vvp_prune+0x7b/0x2b0 [lustre]
      [<ffffffffc073a638>] cl_object_prune+0x58/0x110 [obdclass]
      [<ffffffffc0beecc9>] lov_conf_set+0x509/0xac0 [lov]
      [<ffffffffc073a753>] cl_conf_set+0x63/0x120 [obdclass]
      [<ffffffffc0c598f1>] ll_layout_conf+0x81/0x3f0 [lustre]
      [<ffffffffc0c5a1ad>] ll_layout_refresh+0x54d/0x900 [lustre]
      [<ffffffffc0c9fc97>] vvp_io_init+0x347/0x460 [lustre]
      [<ffffffffc07426ab>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
      [<ffffffffc0742843>] cl_io_init+0x43/0x80 [obdclass]
      [<ffffffffc07428f7>] cl_io_rw_init+0x77/0x210 [obdclass]
      [<ffffffffc0c502be>] ll_file_io_generic+0x17e/0xaf0 [lustre]
      [<ffffffffc0c51a52>] ll_file_aio_read+0x3a2/0x470 [lustre]
      [<ffffffffc0c51c20>] ll_file_read+0x100/0x1c0 [lustre]
      [<ffffffff9004167f>] vfs_read+0x9f/0x170
      [<ffffffff9004253f>] SyS_read+0x7f/0xf0
      [<ffffffff90575ddb>] system_call_fastpath+0x22/0x27
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      example maloo report: https://testing.whamcloud.com/test_sets/4bc4f606-a1e3-11e9-8dbe-52540065bddc

      racer:

      LustreError: 12527:0:(osc_cache.c:2553:osc_teardown_async_page()) extent ffff8800a297e230@{[672 -> 1023/1023], [2|0|-|cache|wi|ffff8800a0bcee60], [1466368|352|+|-|ffff8800b81c0d80|1024| (null)]} trunc at 672.
      LustreError: 12527:0:(osc_cache.c:2553:osc_teardown_async_page()) ### extent: ffff8800a297e230 ns: lustre-OST0001-osc-ffff88010a173800 lock: ffff8800b81c0d80/0xfcc9cb8b08e86ca9 lrc: 2/0,0 mode: PW/PW res: [0x1e9:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2752512->18446744073709551615) flags: 0x20000000000 nid: local remote: 0x4b2ddd8a9f2ac3e expref: -99 pid: 12139 timeout: 0 lvb_type: 1
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) page@ffff8800a3ce2e00[2 ffff8800a231ac58 4 1 (null)]
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff8800a3ce2e50(0:0) vm@ffffea00028cfb48 1fffff0000083d 3:0 ffff8800a3ce2e00 1696 lru
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff8800a3ce2e90, comp index: 10001, gen: 5
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff8800a3ce2ec8 672: 1< 0x845fed 258 0 + - > 2< 2752512 0 4096 0x0 0x520 | (null) ffff8800c8118798 ffff8800a0bcee60 > 3< 0 0 0 > 4< 0 0 8 424652800 - | - - + - > 5< - - + - | 0 - | 352 - ->
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) end page@ffff8800a3ce2e00
      LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
      LustreError: 12527:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
      LustreError: 12527:0:(osc_page.c:193:osc_page_delete()) LBUG
      Pid: 12527, comm: touch 3.10.0-7.6-debug #5 SMP Tue Jun 11 21:26:20 EDT 2019
      Call Trace:
      [<ffffffffa017e8cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [<ffffffffa017e97c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [<ffffffffa08133d9>] osc_page_delete+0x489/0x4f0 [osc]
      [<ffffffffa032dcad>] cl_page_delete0+0x7d/0x210 [obdclass]
      [<ffffffffa032de73>] cl_page_delete+0x33/0x110 [obdclass]
      [<ffffffffa0d6551f>] ll_invalidatepage+0x7f/0x170 [lustre]
      [<ffffffff811bc961>] do_invalidatepage_range+0x71/0x80
      [<ffffffff811bca07>] truncate_inode_page+0x77/0x80
      [<ffffffff811bcc42>] truncate_inode_pages_range+0x1f2/0x770
      [<ffffffff811bd1d5>] truncate_inode_pages+0x15/0x20
      [<ffffffffa0d78abb>] vvp_prune+0x7b/0x2b0 [lustre]
      [<ffffffffa032b595>] cl_object_prune+0x55/0x110 [obdclass]
      [<ffffffffa08d5121>] lov_conf_set+0x501/0xa90 [lov]
      [<ffffffffa032b6b0>] cl_conf_set+0x60/0x120 [obdclass]
      [<ffffffffa0d6f0cb>] cl_file_inode_init+0x12b/0x390 [lustre]
      [<ffffffffa0d44f35>] ll_update_inode+0x2e5/0x610 [lustre]
      [<ffffffffa0d56363>] ll_iget+0x253/0x350 [lustre]
      [<ffffffffa0d4955d>] ll_prep_inode+0x20d/0x9a0 [lustre]
      [<ffffffffa0d5849b>] ll_lookup_it+0xc0b/0x1e90 [lustre]
      [<ffffffffa0d5ad25>] ll_lookup_nd+0x85/0x190 [lustre]
      [<ffffffff8124147d>] lookup_real+0x1d/0x50
      [<ffffffff81241f62>] __lookup_hash+0x42/0x60
      [<ffffffff817b31f4>] lookup_slow+0x42/0xa7
      [<ffffffff8124677f>] link_path_walk+0x81f/0x8c0
      [<ffffffff812478fe>] path_openat+0xae/0x650
      [<ffffffff812492cd>] do_filp_open+0x4d/0xb0
      [<ffffffff81235167>] do_sys_open+0x137/0x240
      [<ffffffff8123528e>] SyS_open+0x1e/0x20
      [<ffffffff817cae15>] system_call_fastpath+0x1c/0x21
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      sample report: http://testing.linuxhacker.ru:3333/lustre-reports/1051/testresults/racer-ldiskfs-DNE-centos7_x86_64-centos7_x86_64 and in maloo: https://testing.whamcloud.com/test_sets/58251f94-a11a-11e9-8dbe-52540065bddc

      This seems to be a very recent crash scenario too appearing just two das ago.

      Coincidentally we just landed this patch n that timeframe: https://review.whamcloud.com/35111
      So I guess it's the prime suspect now.

      Attachments

        Issue Links

          Activity

            [LU-12525] sanity-flr test 200 and others asertion in osc_page_delete

            Closing this since the offending patch was reverted.

            adilger Andreas Dilger added a comment - Closing this since the offending patch was reverted.

            I looked at this yesterday and couldn't figure it out, but I noticed that the crash is always because there are still busy pages when we do the purge step of conf_set.  Looking in the logs, it seems like the sync that is supposed to happen before the invalidate call is not happening.  (talking about vvp_prune here).  There is a cl_sync_file_range, and looking in the debug logs for Oleg's instance, I cannot see it happening.

            But as to why - I'm getting lost in FLR details.  This does seem likely to be exposing an existing bug rather than strictly introducing one.

            pfarrell Patrick Farrell (Inactive) added a comment - I looked at this yesterday and couldn't figure it out, but I noticed that the crash is always because there are still busy pages when we do the purge step of conf_set.  Looking in the logs, it seems like the sync that is supposed to happen before the invalidate call is not happening.  (talking about vvp_prune here).  There is a cl_sync_file_range, and looking in the debug logs for Oleg's instance, I cannot see it happening. But as to why - I'm getting lost in FLR details.  This does seem likely to be exposing an existing bug rather than strictly introducing one.
            green Oleg Drokin added a comment -

            I am going to revert the patch causing this, not sure if Jinshan has any time to look into this, so Bobi, please take a look here as well, since thatp atch is also important and we do want to fix that usecase, just without the crashing hopefully.

            green Oleg Drokin added a comment - I am going to revert the patch causing this, not sure if Jinshan has any time to look into this, so Bobi, please take a look here as well, since thatp atch is also important and we do want to fix that usecase, just without the crashing hopefully.

            hitting this quite often with ldiskfs

            bzzz Alex Zhuravlev added a comment - hitting this quite often with ldiskfs
            green Oleg Drokin added a comment -

            in fact as a further proof - this wa first seen in reviews for that patch: http://testing.linuxhacker.ru:3333/lustre-reports/522/results.html (racer-zfs crash)

            green Oleg Drokin added a comment - in fact as a further proof - this wa first seen in reviews for that patch: http://testing.linuxhacker.ru:3333/lustre-reports/522/results.html (racer-zfs crash)

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: