[LU-12525] sanity-flr test 200 and others asertion in osc_page_delete Created: 09/Jul/19  Updated: 30/Aug/19  Resolved: 30/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 1
Labels: None

Issue Links:
Related
is related to LU-12328 FLR mirroring on 2.12.1-1 not usable ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There's a flurry of failures. In maloo it mostly fails in sanity-flr test 200, but I also saw it in racer at least once with a partially similar backtrace.

Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2
Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-49vm6@tcp:/lustre /mnt/lustre2
Lustre: Mounted lustre-client
Lustre: DEBUG MARKER: mkdir -p /mnt/lustre3
Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-49vm6@tcp:/lustre /mnt/lustre3
LustreError: 16418:0:(osc_cache.c:2553:osc_teardown_async_page()) extent ffff89076293af20@{[768 -> 1023/1023], [2|0|-|cache|wi|ffff890756aca8c0], [1703936|123|+|-|ffff890762821d40|256| (null)]} trunc at 848.
LustreError: 16418:0:(osc_cache.c:2553:osc_teardown_async_page()) ### extent: ffff89076293af20 ns: lustre-OST0001-osc-ffff890767ec5800 lock: ffff890762821d40/0x10751cafee0649b3 lrc: 2/0,0 mode: PW/PW res: [0x175d:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 3473408->3977215) flags: 0x20000020000 nid: local remote: 0x70d1a9e0ebb537b3 expref: -99 pid: 16413 timeout: 0 lvb_type: 1
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) page@ffff89075dd84a00[2 ffff890736741ab0 4 1 (null)]
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff89075dd84a50(0:0) vm@ffffe55201ca5c80 1fffff0000083d 3:0 ffff89075dd84a00 1872 lru
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff89075dd84a90, comp index: 70001, gen: 8
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff89075dd84ac8 848: 1< 0x845fed 258 0 + - > 2< 3473408 0 4096 0x0 0x520 | (null) ffff89076b15c6c0 ffff890756aca8c0 > 3< 0 0 0 > 4< 0 0 8 120979456 - | - - + - > 5< - - + - | 0 - | 123 - ->
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) end page@ffff89075dd84a00
LustreError: 16418:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
LustreError: 16418:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
LustreError: 16418:0:(osc_page.c:193:osc_page_delete()) LBUG
Pid: 16418, comm: cat 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019
Call Trace:
[<ffffffffc05b78bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[<ffffffffc05b796c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[<ffffffffc0b8086f>] osc_page_delete+0x48f/0x500 [osc]
[<ffffffffc073cdf0>] cl_page_delete0+0x80/0x220 [obdclass]
[<ffffffffc073cfc3>] cl_page_delete+0x33/0x110 [obdclass]
[<ffffffffc0c8c98f>] ll_invalidatepage+0x7f/0x170 [lustre]
[<ffffffff8ffc6c8d>] do_invalidatepage_range+0x7d/0x90
[<ffffffff8ffc6d37>] truncate_inode_page+0x77/0x80
[<ffffffff8ffc6f6a>] truncate_inode_pages_range+0x1ea/0x700
[<ffffffff8ffc7495>] truncate_inode_pages+0x15/0x20
[<ffffffffc0ca06db>] vvp_prune+0x7b/0x2b0 [lustre]
[<ffffffffc073a638>] cl_object_prune+0x58/0x110 [obdclass]
[<ffffffffc0beecc9>] lov_conf_set+0x509/0xac0 [lov]
[<ffffffffc073a753>] cl_conf_set+0x63/0x120 [obdclass]
[<ffffffffc0c598f1>] ll_layout_conf+0x81/0x3f0 [lustre]
[<ffffffffc0c5a1ad>] ll_layout_refresh+0x54d/0x900 [lustre]
[<ffffffffc0c9fc97>] vvp_io_init+0x347/0x460 [lustre]
[<ffffffffc07426ab>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
[<ffffffffc0742843>] cl_io_init+0x43/0x80 [obdclass]
[<ffffffffc07428f7>] cl_io_rw_init+0x77/0x210 [obdclass]
[<ffffffffc0c502be>] ll_file_io_generic+0x17e/0xaf0 [lustre]
[<ffffffffc0c51a52>] ll_file_aio_read+0x3a2/0x470 [lustre]
[<ffffffffc0c51c20>] ll_file_read+0x100/0x1c0 [lustre]
[<ffffffff9004167f>] vfs_read+0x9f/0x170
[<ffffffff9004253f>] SyS_read+0x7f/0xf0
[<ffffffff90575ddb>] system_call_fastpath+0x22/0x27
[<ffffffffffffffff>] 0xffffffffffffffff

example maloo report: https://testing.whamcloud.com/test_sets/4bc4f606-a1e3-11e9-8dbe-52540065bddc

racer:

LustreError: 12527:0:(osc_cache.c:2553:osc_teardown_async_page()) extent ffff8800a297e230@{[672 -> 1023/1023], [2|0|-|cache|wi|ffff8800a0bcee60], [1466368|352|+|-|ffff8800b81c0d80|1024| (null)]} trunc at 672.
LustreError: 12527:0:(osc_cache.c:2553:osc_teardown_async_page()) ### extent: ffff8800a297e230 ns: lustre-OST0001-osc-ffff88010a173800 lock: ffff8800b81c0d80/0xfcc9cb8b08e86ca9 lrc: 2/0,0 mode: PW/PW res: [0x1e9:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2752512->18446744073709551615) flags: 0x20000000000 nid: local remote: 0x4b2ddd8a9f2ac3e expref: -99 pid: 12139 timeout: 0 lvb_type: 1
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) page@ffff8800a3ce2e00[2 ffff8800a231ac58 4 1 (null)]
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff8800a3ce2e50(0:0) vm@ffffea00028cfb48 1fffff0000083d 3:0 ffff8800a3ce2e00 1696 lru
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff8800a3ce2e90, comp index: 10001, gen: 5
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff8800a3ce2ec8 672: 1< 0x845fed 258 0 + - > 2< 2752512 0 4096 0x0 0x520 | (null) ffff8800c8118798 ffff8800a0bcee60 > 3< 0 0 0 > 4< 0 0 8 424652800 - | - - + - > 5< - - + - | 0 - | 352 - ->
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) end page@ffff8800a3ce2e00
LustreError: 12527:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
LustreError: 12527:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
LustreError: 12527:0:(osc_page.c:193:osc_page_delete()) LBUG
Pid: 12527, comm: touch 3.10.0-7.6-debug #5 SMP Tue Jun 11 21:26:20 EDT 2019
Call Trace:
[<ffffffffa017e8cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[<ffffffffa017e97c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[<ffffffffa08133d9>] osc_page_delete+0x489/0x4f0 [osc]
[<ffffffffa032dcad>] cl_page_delete0+0x7d/0x210 [obdclass]
[<ffffffffa032de73>] cl_page_delete+0x33/0x110 [obdclass]
[<ffffffffa0d6551f>] ll_invalidatepage+0x7f/0x170 [lustre]
[<ffffffff811bc961>] do_invalidatepage_range+0x71/0x80
[<ffffffff811bca07>] truncate_inode_page+0x77/0x80
[<ffffffff811bcc42>] truncate_inode_pages_range+0x1f2/0x770
[<ffffffff811bd1d5>] truncate_inode_pages+0x15/0x20
[<ffffffffa0d78abb>] vvp_prune+0x7b/0x2b0 [lustre]
[<ffffffffa032b595>] cl_object_prune+0x55/0x110 [obdclass]
[<ffffffffa08d5121>] lov_conf_set+0x501/0xa90 [lov]
[<ffffffffa032b6b0>] cl_conf_set+0x60/0x120 [obdclass]
[<ffffffffa0d6f0cb>] cl_file_inode_init+0x12b/0x390 [lustre]
[<ffffffffa0d44f35>] ll_update_inode+0x2e5/0x610 [lustre]
[<ffffffffa0d56363>] ll_iget+0x253/0x350 [lustre]
[<ffffffffa0d4955d>] ll_prep_inode+0x20d/0x9a0 [lustre]
[<ffffffffa0d5849b>] ll_lookup_it+0xc0b/0x1e90 [lustre]
[<ffffffffa0d5ad25>] ll_lookup_nd+0x85/0x190 [lustre]
[<ffffffff8124147d>] lookup_real+0x1d/0x50
[<ffffffff81241f62>] __lookup_hash+0x42/0x60
[<ffffffff817b31f4>] lookup_slow+0x42/0xa7
[<ffffffff8124677f>] link_path_walk+0x81f/0x8c0
[<ffffffff812478fe>] path_openat+0xae/0x650
[<ffffffff812492cd>] do_filp_open+0x4d/0xb0
[<ffffffff81235167>] do_sys_open+0x137/0x240
[<ffffffff8123528e>] SyS_open+0x1e/0x20
[<ffffffff817cae15>] system_call_fastpath+0x1c/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

sample report: http://testing.linuxhacker.ru:3333/lustre-reports/1051/testresults/racer-ldiskfs-DNE-centos7_x86_64-centos7_x86_64 and in maloo: https://testing.whamcloud.com/test_sets/58251f94-a11a-11e9-8dbe-52540065bddc

This seems to be a very recent crash scenario too appearing just two das ago.

Coincidentally we just landed this patch n that timeframe: https://review.whamcloud.com/35111
So I guess it's the prime suspect now.



 Comments   
Comment by Oleg Drokin [ 09/Jul/19 ]

in fact as a further proof - this wa first seen in reviews for that patch: http://testing.linuxhacker.ru:3333/lustre-reports/522/results.html (racer-zfs crash)

Comment by Alex Zhuravlev [ 09/Jul/19 ]

hitting this quite often with ldiskfs

Comment by Oleg Drokin [ 09/Jul/19 ]

I am going to revert the patch causing this, not sure if Jinshan has any time to look into this, so Bobi, please take a look here as well, since thatp atch is also important and we do want to fix that usecase, just without the crashing hopefully.

Comment by Patrick Farrell (Inactive) [ 10/Jul/19 ]

I looked at this yesterday and couldn't figure it out, but I noticed that the crash is always because there are still busy pages when we do the purge step of conf_set.  Looking in the logs, it seems like the sync that is supposed to happen before the invalidate call is not happening.  (talking about vvp_prune here).  There is a cl_sync_file_range, and looking in the debug logs for Oleg's instance, I cannot see it happening.

But as to why - I'm getting lost in FLR details.  This does seem likely to be exposing an existing bug rather than strictly introducing one.

Comment by Andreas Dilger [ 30/Aug/19 ]

Closing this since the offending patch was reverted.

Generated at Sat Feb 10 02:53:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.