Description
If mdd_create_data() fails after calling mdd_object_make_hint() then the lod object will have ldo_stripenr set. A second call to mdd_create_data() will then trigger the failed assertion in lod_ah_init().
This was found by memory allocation fault injection.
llmount.sh cd /mnt/lustre sys_mknod f0 echo /root/lustre-release/lustre/osd-ldiskfs/osd_handler.c:945 > /proc/fs/lustre/alloc_fail # fail to allocate the transaction in osd_trans_create() sys_open f0 w # open("f0", O_WRONLY) sys_open f0 w # open("f0", O_WRONLY)
[ 338.633638] LustreError: 5233:0:(lod_object.c:2606:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: [ 338.635476] LustreError: 5233:0:(lod_object.c:2606:lod_ah_init()) LBUG [ 338.636652] Pid: 5233, comm: mdt00_005 [ 338.637354] [ 338.637355] Call Trace: [ 338.638094] [<ffffffffa02be8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 338.639359] [<ffffffffa02beec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 338.640490] [<ffffffffa0d5e43c>] lod_ah_init+0xadc/0xb40 [lod] [ 338.641591] [<ffffffffa0817339>] mdd_object_make_hint+0x139/0x180 [mdd] [ 338.642804] [<ffffffffa08062a9>] mdd_create_data+0x359/0x7f0 [mdd] [ 338.644047] [<ffffffffa0cc67dd>] mdt_mfd_open+0xc1d/0xf40 [mdt] [ 338.645182] [<ffffffffa0cc6d00>] mdt_finish_open+0x200/0xc50 [mdt] [ 338.646351] [<ffffffffa0cc1255>] ? mdt_object_open_lock+0x345/0x9d0 [mdt] [ 338.647603] [<ffffffffa0cc7cce>] mdt_open_by_fid_lock+0x57e/0x900 [mdt] [ 338.648821] [<ffffffffa0cc8c28>] mdt_reint_open+0x8c8/0x20b0 [mdt] [ 338.649974] [<ffffffff815547cb>] ? _spin_unlock+0x2b/0x40 [ 338.650985] [<ffffffffa02dc60c>] ? upcall_cache_get_entry+0x3dc/0x8a0 [libcfs] [ 338.652347] [<ffffffffa0475890>] ? lu_ucred+0x20/0x30 [obdclass] [ 338.653490] [<ffffffffa0c93b85>] ? mdt_ucred+0x15/0x20 [mdt] [ 338.654580] [<ffffffffa0cac90c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] [ 338.655786] [<ffffffffa06b4236>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc] [ 338.657037] [<ffffffffa0cb07a1>] mdt_reint_rec+0x41/0xe0 [mdt] [ 338.658132] [<ffffffffa0c9baf3>] mdt_reint_internal+0x4c3/0x7c0 [mdt] [ 338.659314] [<ffffffffa0c9bfe6>] mdt_intent_reint+0x1f6/0x520 [mdt] [ 338.660473] [<ffffffffa0c9a6c9>] mdt_intent_policy+0x499/0xca0 [mdt] [ 338.661691] [<ffffffffa0645422>] ldlm_lock_enqueue+0x302/0x920 [ptlrpc] [ 338.662931] [<ffffffffa066d651>] ldlm_handle_enqueue0+0x341/0x11e0 [ptlrpc] [ 338.664249] [<ffffffffa06ec9a2>] tgt_enqueue+0x62/0x1d0 [ptlrpc] [ 338.665403] [<ffffffffa06ebc35>] tgt_request_handle+0x245/0xad0 [ptlrpc] [ 338.666663] [<ffffffffa069ed91>] ptlrpc_main+0xcf1/0x1880 [ptlrpc] [ 338.667841] [<ffffffffa069e0a0>] ? ptlrpc_main+0x0/0x1880 [ptlrpc] [ 338.668982] [<ffffffff8109eab6>] kthread+0x96/0xa0 [ 338.669902] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 338.670806] [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40 [ 338.671868] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 338.672848] [<ffffffff8109ea20>] ? kthread+0x0/0xa0 [ 338.673761] [<ffffffff8100c300>] ? child_rip+0x0/0x20 [ 338.674690]
Attachments
Issue Links
- mentioned in
-
Page No Confluence page found with the given URL.
iirc, we discussed a related issue in the past - currently locks are supposed to be taken "inside" transaction (still an open question how good is that, but benefits are known). this mean that intermediate state between "declare" and "execute" is exposed to other threads. which isn't good, of course. Ideally "declare" should be working with own copy of data (say, stored somewhere in transaction which is private), while "execute" should use that private data and apply changes atomically. this would solve this issue as well because that private data would be released automatically with ->do_trans_stop(). I'll think of that (again, sorry).