Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
9223372036854775807
Description
A site has encountered multiple crashes with same signature/stack+msgs following :
LustreError: 89879:0:(osp_precreate.c:1222:osp_object_truncate()) can't punch object: -11 Lustre: composit-OST0009-osc-MDT0000: Connection to composit-OST0009 (at 10.0.14.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) LBUG Pid: 89879, comm: mdt01_006 Call Trace: [<ffffffffa057e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa057ee97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa266c0af>] lod_ah_init+0x58f/0x5d0 [lod] [<ffffffffa26c7ad3>] mdd_object_make_hint+0x83/0xa0 [mdd] [<ffffffffa26d4502>] mdd_create_data+0x332/0x7d0 [mdd] [<ffffffffa25a93f0>] mdt_finish_open+0x1350/0x19a0 [mdt] [<ffffffffa257e5f4>] ? mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa25a9fbd>] mdt_open_by_fid_lock+0x57d/0x910 [mdt] [<ffffffffa25aabac>] mdt_reint_open+0x56c/0x21a0 [mdt] [<ffffffffa059b14c>] ? upcall_cache_get_entry+0x29c/0x890 [libcfs] [<ffffffffa0983930>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa2572945>] ? mdt_ucred+0x15/0x20 [mdt] [<ffffffffa258f8ec>] ? mdt_root_squash+0x2c/0x410 [mdt] [<ffffffffa123bad6>] ? __req_capsule_get+0x166/0x710 [ptlrpc] [<ffffffffa2593ab1>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa2578f83>] mdt_reint_internal+0x4c3/0x780 [mdt] [<ffffffffa257950e>] mdt_intent_reint+0x1ee/0x520 [mdt] [<ffffffffa2576cee>] mdt_intent_policy+0x3ae/0x770 [mdt] [<ffffffffa11ca2f5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc] [<ffffffffa11f43fb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc] [<ffffffffa25771b6>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa257c84a>] mdt_handle_common+0x52a/0x1470 [mdt] [<ffffffffa25b98f5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa12238d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa05904fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs] [<ffffffffa121c289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffff81057849>] ? __wake_up_common+0x59/0x90 [<ffffffffa122605d>] ptlrpc_main+0xaed/0x1780 [ptlrpc] [<ffffffffa1225570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20
According to existing tickets, I have found that this kind of problem has already (partially?) been addressed in LU-4260, LU-4791 and LU-5346 tickets.
And since both fixes for LU-4260 and LU-4791 are already integrated, this means that we encounter a new situation/problem during OST objects pre-creation, likely to be caused by some specific file meta-data pattern (I have identified as "deferred layout" feature usage with open(, ...|O_LOV_DELAY_CREATE|...,) along with a non-0 truncate() to trigger objects preallocation), leading to trigger a similar case than described in LU-5346 upon error return path that is still not fixed.
BTW, I have also determined that these MDT assert always occurs just following an OSS crash, hence the -EAGAIN/EWOULDBLOCK error in "(osp_precreate.c:1222:osp_object_truncate()) can't punch object: -11" msg just preceding the assert !