[LU-2093] sanity test 27m 118m ASSERTION( nfound == stripe_cnt ) failed Out of space Created: 04/Oct/12  Updated: 19/Apr/13  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Attachments: File log-lu2093.txt.gz    
Issue Links:
Related
Rank (Obsolete): 4372

 Description   

Second time I see a crash like this. First time was test 118m, now test 27m, but same assertion/stack.
Has to do with OOS apparently:

[10419.559628] Lustre: DEBUG MARKER: == sanity test 27m: create file while OST0 
was full ==================== 22:00:08 (1349402408)
[10437.623359] LustreError: 16640:0:(vvp_io.c:1038:vvp_io_commit_write()) Write 
page 37435 of inode ffff88016e63bb20 failed -28
[10437.775336] LustreError: 12752:0:(osp_precreate.c:275:osp_precreate_send()) l
ustre-OST0000-osc-MDT0000: can't precreate: rc = -28[10439.841663] LustreError: 12937:0:(lod_qos.c:1147:lod_alloc_qos()) can't decla
re new object on #0: -28
[10439.843142] LustreError: 12937:0:(lod_qos.c:1159:lod_alloc_qos()) Didn't find
 any OSTs?
[10439.844380] LustreError: 12937:0:(lod_qos.c:1163:lod_alloc_qos()) ASSERTION( nfound == stripe_cnt ) failed:
[10439.845971] LustreError: 12937:0:(lod_qos.c:1163:lod_alloc_qos()) LBUG
[10439.847156] Pid: 12937, comm: mdt00_004
[10439.848067]
[10439.848067] Call Trace:
[10439.848838]  [<ffffffffa074d915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[10439.850378]  [<ffffffffa074df27>] lbug_with_loc+0x47/0xb0 [libcfs]
[10439.851372]  [<ffffffffa0c39217>] lod_alloc_qos.clone.0+0x8e7/0x1170 [lod]
[10439.852475]  [<ffffffffa0c3b303>] lod_qos_prep_create+0x693/0x18e4 [lod]
[10439.853557]  [<ffffffffa0c36a8b>] lod_declare_striped_object+0x14b/0x920 [lod]
[10439.854839]  [<ffffffffa0c37568>] lod_declare_object_create+0x308/0x4f0 [lod]
[10439.855969]  [<ffffffffa06ffc4f>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
[10439.857238]  [<ffffffffa0710aca>] mdd_create+0x39a/0x1550 [mdd]
[10439.858190]  [<ffffffffa0b7bbc9>] mdt_reint_open+0x1079/0x1860 [mdt]
[10439.859199]  [<ffffffffa071686e>] ? md_ucred+0x1e/0x60 [mdd]
[10439.860114]  [<ffffffffa0b46655>] ? mdt_ucred+0x15/0x20 [mdt]
[10439.861049]  [<ffffffffa0b660a1>] mdt_reint_rec+0x41/0xe0 [mdt]
[10439.862006]  [<ffffffffa0b5f483>] mdt_reint_internal+0x4e3/0x7e0 [mdt]
[10439.863046]  [<ffffffffa0b5fa4d>] mdt_intent_reint+0x1ed/0x500 [mdt]
[10439.864064]  [<ffffffffa0b5b3fe>] mdt_intent_policy+0x38e/0x770 [mdt]
[10439.865125]  [<ffffffffa022ddda>] ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc]
[10439.866208]  [<ffffffffa0254fc7>] ldlm_handle_enqueue0+0x4e7/0x1010 [ptlrpc]
[10439.867328]  [<ffffffffa0b5b936>] mdt_enqueue+0x46/0x130 [mdt]
[10439.868292]  [<ffffffffa0b4f1f2>] mdt_handle_common+0x932/0x1740 [mdt]
[10439.869535]  [<ffffffffa0b500d5>] mdt_regular_handle+0x15/0x20 [mdt]
[10439.870577]  [<ffffffffa0283743>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[10439.871790]  [<ffffffffa074e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[10439.873067]  [<ffffffffa027c431>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[10439.874154]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[10439.875000]  [<ffffffffa02862ce>] ptlrpc_main+0xb8e/0x1960 [ptlrpc]
[10439.876019]  [<ffffffffa0285740>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[10439.877234]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[10439.878049]  [<ffffffffa0285740>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[10439.879049]  [<ffffffffa0285740>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[10439.880047]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[10439.880889]
[10439.881508] Kernel panic - not syncing: LBUG


 Comments   
Comment by Oleg Drokin [ 05/Oct/12 ]

It seems when running with USE_OFD=yes this issue hits much-much more frequently for me.

Anyway, here's the debug log dumped during ofd run.

Comment by Oleg Drokin [ 05/Oct/12 ]

Apparently ORI-213 is the same or very similar issue

Comment by Alex Zhuravlev [ 06/Oct/12 ]

please try http://review.whamcloud.com/4210
also, could you run sanity with full debug enabled?

Comment by Oleg Drokin [ 08/Oct/12 ]

Just hit it again running racer on a very fresh master (with this particular patch included).

[17138.598512] LustreError: 9189:0:(vvp_io.c:1038:vvp_io_commit_write()) Write page 975 of inode ffff88007df36b20 failed -28
[17138.599687] LustreError: 9189:0:(vvp_io.c:1038:vvp_io_commit_write()) Skipped 8192 previous similar messages
[17162.396113] LustreError: 16540:0:(file.c:2333:ll_inode_revalidate_fini()) failure -116 inode 144115205322946107
[17162.396697] LustreError: 16540:0:(file.c:2333:ll_inode_revalidate_fini()) Skipped 6 previous similar messages
[17170.436387] LustreError: 28970:0:(lod_qos.c:1147:lod_alloc_qos()) can't declare new object on #0: -28
[17170.437162] LustreError: 28970:0:(lod_qos.c:1147:lod_alloc_qos()) Skipped 4 previous similar messages
[17180.616649] LustreError: 25274:0:(ofd_obd.c:1222:ofd_create()) unable to precreate: -28
[17180.617596] LustreError: 25274:0:(ofd_obd.c:1222:ofd_create()) Skipped 339 previous similar messages
[17180.618456] LustreError: 24438:0:(osp_precreate.c:275:osp_precreate_send()) lustre-OST0000-osc-MDT0000: can't precreate: rc = -28
[17180.619540] LustreError: 24438:0:(osp_precreate.c:275:osp_precreate_send()) Skipped 339 previous similar messages
[17205.641098] LustreError: 24130:0:(lod_qos.c:1159:lod_alloc_qos()) Didn't find any OSTs?
[17205.642043] LustreError: 24130:0:(lod_qos.c:1163:lod_alloc_qos()) ASSERTION( nfound == stripe_cnt ) failed: 
[17205.643006] LustreError: 24130:0:(lod_qos.c:1163:lod_alloc_qos()) LBUG
[17205.643845] Pid: 24130, comm: mdt01_002
[17205.644498] 
[17205.644500] Call Trace:
[17205.645703]  [<ffffffffa0f14915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[17205.646309]  [<ffffffffa0f14f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[17205.646872]  [<ffffffffa097f217>] lod_alloc_qos.clone.0+0x8e7/0x1170 [lod]
[17205.647511]  [<ffffffffa0981303>] lod_qos_prep_create+0x693/0x18e4 [lod]
[17205.648144]  [<ffffffffa0832d92>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs]
[17205.649115]  [<ffffffffa097ca8b>] lod_declare_striped_object+0x14b/0x920 [lod]
[17205.649994]  [<ffffffffa097d568>] lod_declare_object_create+0x308/0x4f0 [lod]
[17205.650602]  [<ffffffffa072bc4f>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
[17205.651497]  [<ffffffffa073caca>] mdd_create+0x39a/0x1550 [mdd]
[17205.652060]  [<ffffffffa08c1bc9>] mdt_reint_open+0x1079/0x1860 [mdt]
[17205.652646]  [<ffffffffa074286e>] ? md_ucred+0x1e/0x60 [mdd]
[17205.653248]  [<ffffffffa088c655>] ? mdt_ucred+0x15/0x20 [mdt]
[17205.653807]  [<ffffffffa08ac0a1>] mdt_reint_rec+0x41/0xe0 [mdt]
[17205.654443]  [<ffffffffa08a5483>] mdt_reint_internal+0x4e3/0x7e0 [mdt]
[17205.655038]  [<ffffffffa08a5a4d>] mdt_intent_reint+0x1ed/0x500 [mdt]
[17205.655630]  [<ffffffffa08a13fe>] mdt_intent_policy+0x38e/0x770 [mdt]
[17205.656247]  [<ffffffffa11d0dda>] ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc]
[17205.657122]  [<ffffffffa11f80d7>] ldlm_handle_enqueue0+0x4e7/0x1010 [ptlrpc]
[17205.657748]  [<ffffffffa08a1936>] mdt_enqueue+0x46/0x130 [mdt]
[17205.658413]  [<ffffffffa08951f2>] mdt_handle_common+0x932/0x1740 [mdt]
[17205.658681]  [<ffffffffa08960d5>] mdt_regular_handle+0x15/0x20 [mdt]
[17205.659050]  [<ffffffffa1226853>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[17205.659479]  [<ffffffffa0f1566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[17205.659755]  [<ffffffffa121f541>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[17205.660018]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[17205.660288]  [<ffffffffa12293ea>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[17205.660557]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.660815]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[17205.661094]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.662229]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.662490]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[17205.662727] 
[17205.666053] Kernel panic - not syncing: LBUG
[17205.666056] Pid: 24130, comm: mdt01_002 Not tainted 2.6.32-debug #6
[17205.666058] Call Trace:
[17205.666066]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[17205.666094]  [<ffffffffa0f14f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[17205.666115]  [<ffffffffa097f217>] ? lod_alloc_qos.clone.0+0x8e7/0x1170 [lod]
[17205.666132]  [<ffffffffa0981303>] ? lod_qos_prep_create+0x693/0x18e4 [lod]
[17205.666160]  [<ffffffffa0832d92>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs]
[17205.666177]  [<ffffffffa097ca8b>] ? lod_declare_striped_object+0x14b/0x920 [lod]
[17205.666203]  [<ffffffffa097d568>] ? lod_declare_object_create+0x308/0x4f0 [lod]
[17205.666222]  [<ffffffffa072bc4f>] ? mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
[17205.666240]  [<ffffffffa073caca>] ? mdd_create+0x39a/0x1550 [mdd]
[17205.666272]  [<ffffffffa08c1bc9>] ? mdt_reint_open+0x1079/0x1860 [mdt]
[17205.666288]  [<ffffffffa074286e>] ? md_ucred+0x1e/0x60 [mdd]
[17205.666309]  [<ffffffffa088c655>] ? mdt_ucred+0x15/0x20 [mdt]
[17205.666332]  [<ffffffffa08ac0a1>] ? mdt_reint_rec+0x41/0xe0 [mdt]
[17205.666355]  [<ffffffffa08a5483>] ? mdt_reint_internal+0x4e3/0x7e0 [mdt]
[17205.666378]  [<ffffffffa08a5a4d>] ? mdt_intent_reint+0x1ed/0x500 [mdt]
[17205.666400]  [<ffffffffa08a13fe>] ? mdt_intent_policy+0x38e/0x770 [mdt]
[17205.666462]  [<ffffffffa11d0dda>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc]
[17205.666521]  [<ffffffffa11f80d7>] ? ldlm_handle_enqueue0+0x4e7/0x1010 [ptlrpc]
[17205.666543]  [<ffffffffa08a1936>] ? mdt_enqueue+0x46/0x130 [mdt]
[17205.666564]  [<ffffffffa08951f2>] ? mdt_handle_common+0x932/0x1740 [mdt]
[17205.666585]  [<ffffffffa08960d5>] ? mdt_regular_handle+0x15/0x20 [mdt]
[17205.666649]  [<ffffffffa1226853>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[17205.666671]  [<ffffffffa0f1566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[17205.666732]  [<ffffffffa121f541>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[17205.666739]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[17205.666797]  [<ffffffffa12293ea>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[17205.666856]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.666862]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[17205.666921]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.666981]  [<ffffffffa1228850>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[17205.666987]  [<ffffffff8100c140>] ? child_rip+0x0/0x20

Crashdump is at /exports/crashdumps/192.168.10.211-2012-10-08-21\:08\:23 (modules included)

Comment by Alex Zhuravlev [ 10/Oct/12 ]

please try with http://review.whamcloud.com/#change,4241

Comment by Alex Zhuravlev [ 14/Oct/12 ]

landed

Generated at Sat Feb 10 01:22:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.