[LU-4260] ASSERTION( lc->ldo_stripenr == 0 ) failed: Created: 15/Nov/13 Updated: 29/Apr/14 Resolved: 11/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Minh Diep | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4, sdsc | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 11624 | ||||||||||||
| Description |
|
Servers and clients are 2.4.1 configured active-active failover one out of two clients is 1.8.9 As soon as I wrote a file to remote directory, the second MDS crashed. Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_lov.c:554:lod_generate_and_set_lovea()) rhino-MDT0001-mdtlov: Can not locate [0x640000bd0:0x22:0x0]: rc = -5 |
| Comments |
| Comment by Minh Diep [ 15/Nov/13 ] |
|
is this similar to |
| Comment by Peter Jones [ 15/Nov/13 ] |
|
Di Could you please assist with this one? Thanks Peter |
| Comment by Oleg Drokin [ 16/Nov/13 ] |
|
I suspect this is Or after another look: might be not, but certainly there are several crashes logged in there with different backtraces. |
| Comment by Andreas Dilger [ 16/Nov/13 ] |
|
Minh, could you please try the patches referenced by |
| Comment by Andreas Dilger [ 16/Nov/13 ] |
|
Minh, it isn't possible to use clients < 2.4.0 with multiple MDTs. There definitely shouldn't be an LASSERT() failure on the MDS, but it should return an error to the client. |
| Comment by Di Wang [ 16/Nov/13 ] |
|
Hmm, I checked the debug log here (Minh collected for me), it seem the new created OST sequence is not being inserted into FLDB somehow, which caused some garbage stripe_info(of lod object) left in memory, then the LBUG is being hitted. So we need 1. cleanup the stripe_info of lod object when some error happens. |
| Comment by Minh Diep [ 18/Nov/13 ] |
|
Andreas, clients < 2.4.0 still can be mounted and use the MDT0. Only 2.4.0+ clients can access all MDTs. This is hit when 2.4.1 client accessing the remote directory. |
| Comment by Di Wang [ 19/Nov/13 ] |
|
Those OST sequence seems to be added accidentally during wrong strange upgrade process(though I do not know how to reproduce it Though we still need cleanup the stripe_info of lod object when some error happens, to avoid LBUG. Here is the patch http://review.whamcloud.com/8325 (b2_4) |
| Comment by Peter Jones [ 11/Feb/14 ] |
|
Landed for 2.6 |
| Comment by Jinshan Xiong (Inactive) [ 29/Apr/14 ] |
|
Not sure if this is the same issue, but I'm still seeing the crash with exactly the same call stack on latest master. LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) LBUG Pid: 20580, comm: mdt01_008 Call Trace: [<ffffffffa03a3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03a3e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0e098de>] lod_ah_init+0xaae/0xb80 [lod] [<ffffffffa0ce1b09>] mdd_object_make_hint+0x139/0x180 [mdd] [<ffffffffa0cd12f9>] mdd_create_data+0x359/0x7f0 [mdd] [<ffffffffa075c4e0>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc] [<ffffffffa0d4afdb>] mdt_mfd_open+0xc8b/0xf10 [mdt] [<ffffffffa0e033a3>] ? lod_xattr_get+0x153/0x420 [lod] [<ffffffffa0d4c253>] mdt_finish_open+0x553/0xc20 [mdt] [<ffffffffa0d46383>] ? mdt_object_open_lock+0x2f3/0x9c0 [mdt] [<ffffffffa0d4e76f>] mdt_reint_open+0x12af/0x2130 [mdt] [<ffffffffa03c11c6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs] [<ffffffffa0550310>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa0d36851>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0d1be13>] mdt_reint_internal+0x4c3/0x7c0 [mdt] [<ffffffffa0d1c308>] mdt_intent_reint+0x1f8/0x520 [mdt] [<ffffffffa0d1a9e9>] mdt_intent_policy+0x499/0xca0 [mdt] [<ffffffff81168742>] ? kmem_cache_alloc+0x182/0x190 [<ffffffffa070f809>] ldlm_lock_enqueue+0x359/0x920 [ptlrpc] [<ffffffffa0738c6f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffffa07bb022>] tgt_enqueue+0x62/0x1d0 [ptlrpc] [<ffffffffa07bb3cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa075a01c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] [<ffffffffa076a5ca>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8150e600>] ? thread_return+0x4e/0x76e [<ffffffffa07698b0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 I have a core dump so feel free to ask for it. |