[LU-4083] lod_lov.c:824:lod_load_striping()) ASSERTION( lo->ldo_stripenr == 0 ) failed Created: 10/Oct/13  Updated: 06/Mar/14  Resolved: 02/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Jinshan Xiong (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2789 lod_load_striping()) ASSERTION( lo->l... Closed
Severity: 3
Rank (Obsolete): 10975

 Description   

I saw this crash when I was running racer.

LustreError: 17343:0:(lod_lov.c:824:lod_load_striping()) ASSERTION( lo->ldo_stripenr == 0 ) failed:
LustreError: 17343:0:(lod_lov.c:824:lod_load_striping()) LBUG
Pid: 17343, comm: mdt03_007

Call Trace:
[<ffffffffa04c2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa04c2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0efded3>] lod_load_striping+0x383/0x4b0 [lod]
[<ffffffffa0f08bab>] lod_declare_object_destroy+0x16b/0x390 [lod]
[<ffffffffa0c972a0>] mdd_declare_finish_unlink+0x90/0x170 [mdd]
[<ffffffffa0ca0579>] mdd_rename+0x1eb9/0x2390 [mdd]
[<ffffffffa0e10143>] mdt_reint_rename+0x1383/0x1bf0 [mdt]
[<ffffffffa066ad60>] ? lu_ucred+0x20/0x30 [obdclass]
[<ffffffffa0e0aea1>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0df2c93>] mdt_reint_internal+0x4c3/0x780 [mdt]
[<ffffffffa0df2f94>] mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa0df5a8a>] mdt_handle_common+0x52a/0x1470 [mdt]
[<ffffffffa0e2fc45>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa07d9e25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
[<ffffffffa04d427f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa07d14c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
[<ffffffffa07db18d>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
[<ffffffffa07da6a0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
[<ffffffff81096a36>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810969a0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1381356940.17343



 Comments   
Comment by Jinshan Xiong (Inactive) [ 10/Oct/13 ]

After applying this patch, the issue went away:

diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index e7b1de0..49575b7 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -813,6 +813,7 @@ repeat_find:
                rc = 0;
        } else {
                /* nobody provided us with a single object */
+               lo->ldo_stripenr = 0;
                rc = -ENOSPC;
        }
 
Comment by Peter Jones [ 10/Oct/13 ]

James

Could you please upload this patch into gerrit on behalf of Jinshan?

Peter

Comment by James Nunez (Inactive) [ 10/Oct/13 ]

Patch at: http://review.whamcloud.com/7919

Comment by Oleg Drokin [ 06/Nov/13 ]

I just want to add that I also hit this pretty frequently and it disrupts my testing. As such I am increasing priority to critical.

Comment by James Nunez (Inactive) [ 02/Dec/13 ]

Patch landed to master.

Comment by Patrick Farrell (Inactive) [ 04/Dec/13 ]

Is this a duplicate of https://jira.hpdd.intel.com/browse/LU-2789 ?

From the fixes, they tentatively appear to take the same lock, but around different operations. Is it the same race condition or a different one?

Comment by Stuart Midgley [ 19/Feb/14 ]

FWIW we have this this with a production lustre system running 2.5.0

Will apply the patch and LU2789 and move on.

Generated at Sat Feb 10 01:39:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.