Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.13.0, Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
This is a new failure in racer happening only in maloo full testning. First crash recorded was on June 1st right after landing this below bunch of patches. On both patched and patchless master builds:
LU-12034 obdclass: put all service's env on the list (detail / gitweb)
LU-11359 mdt: fix mdt_dom_discard_data() timeouts (detail / gitweb)
LU-9846 lod: Add overstriping support (detail / gitweb)
LU-11893 o2iblnd: add secondary IP address handling (detail / gitweb)
LU-11526 rpc: support maximum 64MB I/O RPC (detail / gitweb)
LU-11872 utils: don't follow link files in default (detail / gitweb)
LU-10894 dom: per-resource ELC for WRITE lock enqueue (detail / gitweb)
LU-10894 dom: mdc_lock_flush() improvement (detail / gitweb)
LU-11946 build: no zlib check during configure --enable-dist (detail / gitweb)
LU-11946 build: no yaml check during configure --enable-dist (detail / gitweb)
LU-12302 lnet: Fix NI status in proc for loopback ni (detail / gitweb)
LU-12333 ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro (detail / gitweb)
LU-12342 spec: mark lsvcgss as a config file in the rpm (detail / gitweb)
LU-12350 tests: Do not use background failover (detail / gitweb)
LU-11089 obd: use wait_event_var() in lu_context_key_degister() (detail / gitweb)
LU-11089 obd: remove lock from key register/degister (detail / gitweb)
LU-6142 ldlm: Fix style issues for ldlm_plain.c (detail / gitweb)
LU-12323 libcfs: check if save_stack_trace_tsk is exported (detail / gitweb)
LU-10467 ptlrpc: discard SVC_SIGNAL and related functions (detail / gitweb)
LU-12345 ldiskfs: optimize nodelalloc mode (detail / gitweb)
LU-11041 kernel: Enable tons of kernel debug options (detail / gitweb)
LU-10948 llite: Revalidate dentries in ll_intent_file_open (detail / gitweb)
This does not appear to happen in any of my testing, with major difference being in my testing I have DOM disabled, but then again the backtrace is not in DOM:
Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=4 OSTCOUNT=8 RACER_ENABLE_REMOTE_DIRS=true RACER_ENABLE_STRIPED_DIRS=true RACER_ENABLE_MIGRATION=false RACER_ENABLE_PFL=true RACER_ENABLE_DOM=true RACER_ENABLE_FLR=true LFS=/usr/bin/lfs LCTL=/usr/sbin/lctl 15[30843]: segfault at 0 ip (null) sp 00007fff2f60f558 error 14 in 15[400000+6000] 19[11605]: segfault at 8 ip 00007fb7cd1f2958 sp 00007fffbcc378f0 error 4 in ld-2.17.so[7fb7cd1e7000+22000] 5[24620]: segfault at 8 ip 00007f432a0fe958 sp 00007fff4e77a210 error 4 in ld-2.17.so[7f432a0f3000+22000] 16[23010]: segfault at 8 ip 00007f4ab1b40958 sp 00007ffebe439030 error 4 in ld-2.17.so[7f4ab1b35000+22000] LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) extent ffff8eb4a553f4d0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff8eb48a5f3000], [1703936|128|+|-|ffff8eb498770d80|256| (null)]} LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) ### extent: ffff8eb4a553f4d0 ns: lustre-OST0003-osc-ffff8eb4b8348800 lock: ffff8eb498770d80/0x127dda6c42633b5a lrc: 2/0,0 mode: PW/PW res: [0xb1c:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 524288->18446744073709551615) flags: 0x20080020000 nid: local remote: 0xf6ccd1090fb739f3 expref: -99 pid: 3963 timeout: 0 lvb_type: 1 LustreError: 3977:0:(osc_cache.c:1246:osc_extent_tree_dump0()) Dump object ffff8eb48a5f3000 extents at osc_cache_writeback_range:3035, mppr: 256. LustreError: 3977:0:(osc_cache.c:1251:osc_extent_tree_dump0()) extent ffff8eb4a553f4d0@{[0 -> 255/255], [1|0|-|cache|wiY|ffff8eb48a5f3000], [1703936|128|+|-|ffff8eb498770d80|256| (null)]} in tree 1. LustreError: 3977:0:(osc_cache.c:1251:osc_extent_tree_dump0()) ### extent: ffff8eb4a553f4d0 ns: lustre-OST0003-osc-ffff8eb4b8348800 lock: ffff8eb498770d80/0x127dda6c42633b5a lrc: 2/0,0 mode: PW/PW res: [0xb1c:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 524288->18446744073709551615) flags: 0x20080020000 nid: local remote: 0xf6ccd1090fb739f3 expref: -99 pid: 3963 timeout: 0 lvb_type: 1 LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed: LustreError: 3977:0:(osc_cache.c:3035:osc_cache_writeback_range()) LBUG Pid: 3977, comm: cat 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 Call Trace: [<ffffffffc06e47cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [<ffffffffc06e487c>] lbug_with_loc+0x4c/0xa0 [libcfs] [<ffffffffc0cd0cad>] osc_cache_writeback_range+0xacd/0x1260 [osc] [<ffffffffc0cbf765>] osc_io_fsync_start+0x85/0x1a0 [osc] [<ffffffffc087b4c8>] cl_io_start+0x68/0x130 [obdclass] [<ffffffffc0d10cb7>] lov_io_call.isra.7+0x87/0x140 [lov] [<ffffffffc0d10e76>] lov_io_start+0x56/0x150 [lov] [<ffffffffc087b4c8>] cl_io_start+0x68/0x130 [obdclass] [<ffffffffc087d69c>] cl_io_loop+0xcc/0x1c0 [obdclass] [<ffffffffc0d77c2b>] cl_sync_file_range+0x2db/0x380 [lustre] [<ffffffffc0d8d090>] ll_delete_inode+0x150/0x220 [lustre] [<ffffffff9423c504>] evict+0xb4/0x180 [<ffffffff9423ce0c>] iput+0xfc/0x190 [<ffffffff942376f0>] __dentry_kill+0x120/0x180 [<ffffffff942377f9>] dput+0xa9/0x160 [<ffffffff9422158e>] __fput+0x17e/0x260 [<ffffffff9422175e>] ____fput+0xe/0x10 [<ffffffff940bab8b>] task_work_run+0xbb/0xe0 [<ffffffff9402bc55>] do_notify_resume+0xa5/0xc0 [<ffffffff94725ae4>] int_signal+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff
Very first crash:
https://testing.whamcloud.com/test_sessions/2d05c3e5-2252-4056-8082-3c9183b4aef4
last crash as of the time of this writing:
https://testing.whamcloud.com/test_sets/35838dd8-93da-11e9-a1a3-52540065bddc
as of now 13 crashes like this were recorded, all in delete_inode path.