[LU-12462] osc_cache_writeback_range(): ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed Created: 21/Jun/19 Updated: 03/Aug/20 Resolved: 21/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This is a new failure in racer happening only in maloo full testning. First crash recorded was on June 1st right after landing this below bunch of patches. On both patched and patchless master builds: This does not appear to happen in any of my testing, with major difference being in my testing I have DOM disabled, but then again the backtrace is not in DOM: Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl
15[30843]: segfault at 0 ip (null) sp 00007fff2f60f558 error 14 in 15[400000+6000]
19[11605]: segfault at 8 ip 00007fb7cd1f2958 sp 00007fffbcc378f0 error 4 in ld-2.17.so[7fb7cd1e7000+22000]
5[24620]: segfault at 8 ip 00007f432a0fe958 sp 00007fff4e77a210 error 4 in ld-2.17.so[7f432a0f3000+22000]
16[23010]: segfault at 8 ip 00007f4ab1b40958 sp 00007ffebe439030 error 4 in ld-2.17.so[7f4ab1b35000+22000]
LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) extent ffff8eb4a553f4d0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff8eb48a5f3000], [1703936|128|+|-|ffff8eb498770d80|256| (null)]}
LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) ### extent: ffff8eb4a553f4d0 ns: lustre-OST0003-osc-ffff8eb4b8348800 lock: ffff8eb498770d80/0x127dda6c42633b5a lrc: 2/0,0 mode: PW/PW res: [0xb1c:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 524288->18446744073709551615) flags: 0x20080020000 nid: local remote: 0xf6ccd1090fb739f3 expref: -99 pid: 3963 timeout: 0 lvb_type: 1
LustreError: 3977:0:(osc_cache.c:1246:osc_extent_tree_dump0()) Dump object ffff8eb48a5f3000 extents at osc_cache_writeback_range:3035, mppr: 256.
LustreError: 3977:0:(osc_cache.c:1251:osc_extent_tree_dump0()) extent ffff8eb4a553f4d0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff8eb48a5f3000], [1703936|128|+|-|ffff8eb498770d80|256| (null)]} in tree 1.
LustreError: 3977:0:(osc_cache.c:1251:osc_extent_tree_dump0()) ### extent: ffff8eb4a553f4d0 ns: lustre-OST0003-osc-ffff8eb4b8348800 lock: ffff8eb498770d80/0x127dda6c42633b5a lrc: 2/0,0 mode: PW/PW res: [0xb1c:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 524288->18446744073709551615) flags: 0x20080020000 nid: local remote: 0xf6ccd1090fb739f3 expref: -99 pid: 3963 timeout: 0 lvb_type: 1
LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed:
LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) LBUG
Pid: 3977, comm: cat 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
Call Trace:
[<ffffffffc06e47cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[<ffffffffc06e487c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[<ffffffffc0cd0cad>] osc_cache_writeback_range+0xacd/0x1260 [osc]
[<ffffffffc0cbf765>] osc_io_fsync_start+0x85/0x1a0 [osc]
[<ffffffffc087b4c8>] cl_io_start+0x68/0x130 [obdclass]
[<ffffffffc0d10cb7>] lov_io_call.isra.7+0x87/0x140 [lov]
[<ffffffffc0d10e76>] lov_io_start+0x56/0x150 [lov]
[<ffffffffc087b4c8>] cl_io_start+0x68/0x130 [obdclass]
[<ffffffffc087d69c>] cl_io_loop+0xcc/0x1c0 [obdclass]
[<ffffffffc0d77c2b>] cl_sync_file_range+0x2db/0x380 [lustre]
[<ffffffffc0d8d090>] ll_delete_inode+0x150/0x220 [lustre]
[<ffffffff9423c504>] evict+0xb4/0x180
[<ffffffff9423ce0c>] iput+0xfc/0x190
[<ffffffff942376f0>] __dentry_kill+0x120/0x180
[<ffffffff942377f9>] dput+0xa9/0x160
[<ffffffff9422158e>] __fput+0x17e/0x260
[<ffffffff9422175e>] ____fput+0xe/0x10
[<ffffffff940bab8b>] task_work_run+0xbb/0xe0
[<ffffffff9402bc55>] do_notify_resume+0xa5/0xc0
[<ffffffff94725ae4>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
Very first crash: last crash as of the time of this writing: as of now 13 crashes like this were recorded, all in delete_inode path. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 24/Jun/19 ] |
|
OK, I've figured this out - There's a signed vs unsigned issue that only comes up with discards and OBD_OBJECT_EOF. Working on a patch. |
| Comment by Gerrit Updater [ 24/Jun/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35302 |
| Comment by Patrick Farrell (Inactive) [ 24/Jun/19 ] |
|
green, jamesanunez : This failure has only been seen in "full" testing, and I'm not sure how to invoke that for my patch. Can one of you do that? |
| Comment by Gerrit Updater [ 27/Jun/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35339 |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35339/ |
| Comment by Andriy Skulysh [ 15/Jul/19 ] |
|
The failure isn't limited by signed vs unsigned issue. crash> osc_extent.oe_start,oe_end ffff90895c8b7d10 oe_start = 0 oe_end = 255 ci_fsync = {
fi_start = 262144,
fi_end = -1,
fi_fid = 0xffff908956911b18,
fi_mode = CL_FSYNC_DISCARD,
fi_nr_written = 0
},
00000080:00200000:0.0:1560698103.488527:0:6682:0:(file.c:357:ll_file_release()) VFS Op:inode=[0x200000403:0x1350:0x0](ffff908956911b90)
00010000:00010000:0.0:1560698103.490992:0:12953:0:(ldlm_lockd.c:1715:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 3/0,0 mode: PW/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x0 rrc: 3 type: IBT flags: 0x400080010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1560698103.490997:0:12953:0:(ldlm_request.c:1318:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 4/0,0 mode: PW/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x40 rrc: 3 type: IBT flags: 0x408480010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00000008:00010000:0.0:1560698103.491006:0:12953:0:(osc_cache.c:1113:osc_extent_make_ready()) ### extent: ffff9089f89e2a50 ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 4/0,0 mode: PW/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x40 rrc: 3 type: IBT flags: 0x449480010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00000008:00010000:0.0:1560698103.491062:0:12953:0:(osc_cache.c:867:osc_extent_finish()) ### extent: ffff9089f89e2a50 ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 4/0,0 mode: PW/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x40 rrc: 3 type: IBT flags: 0x449480010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00000008:00010000:0.0:1560698103.491085:0:12953:0:(osc_cache.c:489:__osc_extent_remove()) ### extent: ffff9089f89e2a50 ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 4/0,0 mode: PW/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x40 rrc: 3 type: IBT flags: 0x449480010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00010000:00010000:0.0:1560698103.491227:0:12953:0:(ldlm_lockd.c:1761:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: lustre-MDT0000-mdc-ffff90897640f000 lock: ffff9089ec0ee000/0xeb72e01621e5d8a8 lrc: 1/0,0 mode: --/PW res: [0x200000403:0x1350:0x0].0x0 bits 0x40/0x40 rrc: 3 type: IBT flags: 0x4c49480010000 nid: local remote: 0x76683b3d6040c693 expref: -99 pid: 6682 timeout: 0 lvb_type: 0
00000008:00020000:0.0:1560698103.551596:0:6682:0:(osc_cache.c:3074:osc_cache_writeback_range()) extent ffff90895c8b7d10@\{[0 -> 255/255], [1|0|-|cache|wiY|ffff90895ea34500], [1703936|192|+|-|ffff90895de79440|256| (null)]}
00000008:00020000:0.0:1560698103.563094:0:6682:0:(osc_cache.c:3074:osc_cache_writeback_range()) ### extent: ffff90895c8b7d10 ns: lustre-OST0000-osc-ffff90897640f000 lock: ffff90895de79440/0xeb72e01621e5d8fc lrc: 2/0,0 mode: PW/PW res: [0x511:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 262144->18446744073709551615) flags: 0x20000020000 nid: local remote: 0x5d533c3e80fa6ceb expref: -99 pid: 6682 timeout: 0 lvb_type: 1
00000008:00020000:0.0:1560698103.579593:0:6682:0:(osc_cache.c:1216:osc_extent_tree_dump0()) Dump object ffff90895ea34500 extents at osc_cache_writeback_range:3074, mppr: 256.
00000008:00020000:0.0:1560698103.586774:0:6682:0:(osc_cache.c:1221:osc_extent_tree_dump0()) extent ffff90895c8b7d10@\{[0 -> 255/255], [1|0|-|cache|wiY|ffff90895ea34500], [1703936|192|+|-|ffff90895de79440|256| (null)]} in tree 1.
00000008:00020000:0.0:1560698103.595536:0:6682:0:(osc_cache.c:1221:osc_extent_tree_dump0()) ### extent: ffff90895c8b7d10 ns: lustre-OST0000-osc-ffff90897640f000 lock: ffff90895de79440/0xeb72e01621e5d8fc lrc: 2/0,0 mode: PW/PW res: [0x511:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 262144->18446744073709551615) flags: 0x20000020000 nid: local remote: 0x5d533c3e80fa6ceb expref: -99 pid: 6682 timeout: 0 lvb_type: 1
00000008:00040000:0.0:1560698103.611993:0:6682:0:(osc_cache.c:3074:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed:
|
| Comment by Patrick Farrell (Inactive) [ 15/Jul/19 ] |
|
Ah, good catch, askulysh. Do you have a thought on how to fix this? I can think of several approaches, but I'm struggling to think what's best. 1. It's not really necessary for the start of the first OSC extent to be chunk aligned, as long as it's smaller than one chunk. (I think?) I like (1), but it would require making the OSC layer aware that it's the first extent. So that would require passing some info down, might be complicated. tappro, if you have an opinion too... |
| Comment by Mikhail Pershin [ 15/Jul/19 ] |
|
Ah, it is not in chunks but aligned to that, right? OK, what if we will align start/end inside osc_cache_writeback_range() also by cl_chunkbits before using them? |
| Comment by Patrick Farrell (Inactive) [ 15/Jul/19 ] |
|
Oh, that's a good idea! I like that, I'll give it a try. |
| Comment by Gerrit Updater [ 15/Jul/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35525 |
| Comment by Gerrit Updater [ 24/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35525/ |
| Comment by Gerrit Updater [ 21/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35733/ |
| Comment by Peter Jones [ 21/Aug/19 ] |
|
Seems like everything has landed for 2.13 |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36875 |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36876 |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36877 |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36876/ |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36877/ |