[LU-4260] ASSERTION( lc->ldo_stripenr == 0 ) failed: Created: 15/Nov/13  Updated: 29/Apr/14  Resolved: 11/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Minh Diep Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: mn4, sdsc

Issue Links:
Related
is related to LU-2789 lod_load_striping()) ASSERTION( lo->l... Closed
is related to LU-4623 creating file stripe > 167 fails Resolved
Severity: 3
Rank (Obsolete): 11624

 Description   

Servers and clients are 2.4.1 configured active-active failover

one out of two clients is 1.8.9

As soon as I wrote a file to remote directory, the second MDS crashed.

Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_lov.c:554:lod_generate_and_set_lovea()) rhino-MDT0001-mdtlov: Can not locate [0x640000bd0:0x22:0x0]: rc = -5
Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_object.c:704:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed:
Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_object.c:704:lod_ah_init()) LBUG
Nov 15 09:00:06 lustre-mds-0-1 kernel: Pid: 20726, comm: mdt03_000
Nov 15 09:00:06 lustre-mds-0-1 kernel:
Nov 15 09:00:06 lustre-mds-0-1 kernel: Call Trace:
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0349895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0349e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0e3f78f>] lod_ah_init+0x57f/0x5c0 [lod]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0b73a83>] mdd_object_make_hint+0x83/0xa0 [mdd]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0b7feb2>] mdd_create_data+0x332/0x7d0 [mdd]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d9cc2c>] mdt_finish_open+0x125c/0x18a0 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d984f8>] ? mdt_object_open_lock+0x1c8/0x510 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d9ee8d>] mdt_reint_open+0x115d/0x20c0 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa036682e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa071fdcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d89911>] mdt_reint_rec+0x41/0xe0 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6eae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6f06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6cf1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa06d7831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa06fe1ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6d3a6>] mdt_enqueue+0x46/0xe0 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d73a97>] mdt_handle_common+0x647/0x16d0 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0720bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0dad3f5>] mds_regular_handle+0x15/0x20 [mdt]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa07303c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa034a5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa035bd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0727729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa073175e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Nov 15 09:00:06 lustre-mds-0-1 kernel:



 Comments   
Comment by Minh Diep [ 15/Nov/13 ]

is this similar to LU-4226?

Comment by Peter Jones [ 15/Nov/13 ]

Di

Could you please assist with this one?

Thanks

Peter

Comment by Oleg Drokin [ 16/Nov/13 ]

I suspect this is LU-2789 or closely related.

Or after another look: might be not, but certainly there are several crashes logged in there with different backtraces.

Comment by Andreas Dilger [ 16/Nov/13 ]

Minh, could you please try the patches referenced by LU-2789. We aren't sure if this is a duplicate or not. That one relates to a race condition, which I don't think is the case here.

Comment by Andreas Dilger [ 16/Nov/13 ]

Minh, it isn't possible to use clients < 2.4.0 with multiple MDTs. There definitely shouldn't be an LASSERT() failure on the MDS, but it should return an error to the client.

Comment by Di Wang [ 16/Nov/13 ]

Hmm, I checked the debug log here (Minh collected for me), it seem the new created OST sequence is not being inserted into FLDB somehow, which caused some garbage stripe_info(of lod object) left in memory, then the LBUG is being hitted. So we need

1. cleanup the stripe_info of lod object when some error happens.
2. Figure out why OST sequence is not being inserted into FLDB during upgrade process.

Comment by Minh Diep [ 18/Nov/13 ]

Andreas, clients < 2.4.0 still can be mounted and use the MDT0. Only 2.4.0+ clients can access all MDTs. This is hit when 2.4.1 client accessing the remote directory.

Comment by Di Wang [ 19/Nov/13 ]

Those OST sequence seems to be added accidentally during wrong strange upgrade process(though I do not know how to reproduce it . But after we removed seq_srv on OST(i.e. those local sequence file on OST, and only used for DNE), and restart OST, and OST will re-acquire new meta sequence from MDT0. Everything works fine.

Though we still need cleanup the stripe_info of lod object when some error happens, to avoid LBUG. Here is the patch

http://review.whamcloud.com/8325 (b2_4)
http://review.whamcloud.com/8324 (master)

Comment by Peter Jones [ 11/Feb/14 ]

Landed for 2.6

Comment by Jinshan Xiong (Inactive) [ 29/Apr/14 ]

Not sure if this is the same issue, but I'm still seeing the crash with exactly the same call stack on latest master.

LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: 
LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) LBUG
Pid: 20580, comm: mdt01_008

Call Trace:
 [<ffffffffa03a3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa03a3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0e098de>] lod_ah_init+0xaae/0xb80 [lod]
 [<ffffffffa0ce1b09>] mdd_object_make_hint+0x139/0x180 [mdd]
 [<ffffffffa0cd12f9>] mdd_create_data+0x359/0x7f0 [mdd]
 [<ffffffffa075c4e0>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc]
 [<ffffffffa0d4afdb>] mdt_mfd_open+0xc8b/0xf10 [mdt]
 [<ffffffffa0e033a3>] ? lod_xattr_get+0x153/0x420 [lod]
 [<ffffffffa0d4c253>] mdt_finish_open+0x553/0xc20 [mdt]
 [<ffffffffa0d46383>] ? mdt_object_open_lock+0x2f3/0x9c0 [mdt]
 [<ffffffffa0d4e76f>] mdt_reint_open+0x12af/0x2130 [mdt]
 [<ffffffffa03c11c6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
 [<ffffffffa0550310>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0d36851>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0d1be13>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
 [<ffffffffa0d1c308>] mdt_intent_reint+0x1f8/0x520 [mdt]
 [<ffffffffa0d1a9e9>] mdt_intent_policy+0x499/0xca0 [mdt]
 [<ffffffff81168742>] ? kmem_cache_alloc+0x182/0x190
 [<ffffffffa070f809>] ldlm_lock_enqueue+0x359/0x920 [ptlrpc]
 [<ffffffffa0738c6f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffffa07bb022>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
 [<ffffffffa07bb3cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 [<ffffffffa075a01c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
 [<ffffffffa076a5ca>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8150e600>] ? thread_return+0x4e/0x76e
 [<ffffffffa07698b0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

I have a core dump so feel free to ask for it.

Generated at Sat Feb 10 01:41:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.