[LU-4623] creating file stripe > 167 fails Created: 12/Feb/14  Updated: 18/Apr/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Emoly Liu
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File setstripedb.out.gz    
Issue Links:
Related
is related to LU-4260 ASSERTION( lc->ldo_stripenr == 0 ) fa... Resolved
Severity: 3
Rank (Obsolete): 12649

 Description   

Creating a file with stripe > 167 fails. If a second setripe is attempted on the same file the mdt LBUG.

mhanafi@pfe20:/nobackupp9/mhanafi/teststripe> cat /proc/fs/lustre/version 
lustre: 2.4.1
kernel: ../lustre/scripts
build:  3nasC_ofed154
mhanafi@pfe20:/nobackupp9/mhanafi/teststripe> lfs setstripe -c 166 test169
mhanafi@pfe20:/nobackupp9/mhanafi/teststripe> lfs setstripe -c 167 test167
mhanafi@pfe20:/nobackupp9/mhanafi/teststripe> lfs setstripe -c 168 test168
error on ioctl 0x4008669a for 'test168' (3): No space left on device
error: setstripe: create stripe file 'test168' failed
mhanafi@pfe20:/nobackupp9/mhanafi/teststripe> lfs getstripe test168
test168 has no stripe info

LBUG OUTPUT

LNet: 1919:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Skipped 11 previous similar messages^M
LustreError: 4699:0:(lod_object.c:704:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: ^M
LustreError: 4699:0:(lod_object.c:704:lod_ah_init()) LBUG^M
Pid: 4699, comm: mdt03_002^M
^M
Call Trace:^M
 [<ffffffffa050c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa050ce97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
 [<ffffffffa0faa78f>] lod_ah_init+0x57f/0x5c0 [lod]^M
 [<ffffffffa0c62a83>] mdd_object_make_hint+0x83/0xa0 [mdd]^M
 [<ffffffffa0c6eeb2>] mdd_create_data+0x332/0x7d0 [mdd]^M
 [<ffffffffa0f08d8c>] mdt_finish_open+0x125c/0x1950 [mdt]^M
 [<ffffffffa0f04658>] ? mdt _ 2o3b joeuctt_ oopf en24_ lcopcuks+ 0xi1n c8k/d0b,x5 w1a0i [tmidntg ]^Mf
or the rest, timeout in 10 second(s)^M
 [<ffffffffa0f0af26>] mdt_reint_open+0xfe6/0x20e0 [mdt]^M
.All cpus are now in kdb^M

MDT VERSION

nbp9-mds ~ # cat /proc/fs/lustre/version 
lustre: 2.4.1
kernel: 2.6.32-358.23.2.el6.20140115.x86_64.lustre241
build:  5.2nasS_ofed154


 Comments   
Comment by Peter Jones [ 12/Feb/14 ]

Emoly

Could you please look into this one?

Thanks

Peter

Comment by Emoly Liu [ 13/Feb/14 ]

I can reproduce it and will investigate it.

Comment by Oleg Drokin [ 13/Feb/14 ]

I think this is a duplicate of LU-4260

Comment by Jay Lan (Inactive) [ 13/Feb/14 ]

I think it was a mistake that this bug was marked as duplicate of LU-4620 and closed. It should be LU-4260 instead. Please fix it.

Comment by Peter Jones [ 13/Feb/14 ]

ok I have fixed this Jay.

Comment by Mahmoud Hanafi [ 24/Mar/14 ]

We still hit this but with patch LU-4260 applied

Running lustre-2.4.1-6nas source at https://github.com/jlan/lustre-nas

<6>Lustre: nbp9-MDT0000: Recovery over after 2:04, of 11086 clients 11086 recovered and 0 were evicted.
<0>LustreError: 5367:0:(lod_object.c:704:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed:
<0>LustreError: 5367:0:(lod_object.c:704:lod_ah_init()) LBUG
<4>Pid: 5367, comm: mdt00_009
<4>
<4>Call Trace:
<4> [<ffffffffa0511895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0511e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0faa77f>] lod_ah_init+0x57f/0x5c0 [lod]
<4> [<ffffffffa0c67a83>] mdd_object_make_hint+0x83/0xa0 [mdd]
<4> [<ffffffffa0c73ec2>] mdd_create_data+0x332/0x7d0 [mdd]
<4> [<ffffffffa0f08d8c>] mdt_finish_open+0x125c/0x1950 [mdt]
<4> [<ffffffffa0f04658>] ? mdt_object_open_lock+0x1c8/0x510 [mdt]
<4> [<ffffffffa0f0af26>] mdt_reint_open+0xfe6/0x20e0 [mdt]
<4> [<ffffffffa052e85e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
<4> [<ffffffffa07f7ddc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
<4> [<ffffffffa0ef5981>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa0edab03>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4> [<ffffffffa0edb090>] mdt_intent_reint+0x1f0/0x530 [mdt]
<4> [<ffffffffa0ed8f3e>] mdt_intent_policy+0x39e/0x720 [mdt]
<4> [<ffffffffa07af831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
<4> [<ffffffffa07d61ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
<4> [<ffffffffa0ed93c6>] mdt_enqueue+0x46/0xe0 [mdt]
<4> [<ffffffffa0edfad7>] mdt_handle_common+0x647/0x16d0 [mdt]
<4> [<ffffffffa0f19615>] mds_regular_handle+0x15/0x20 [mdt]
<4> [<ffffffffa08083d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
<4> [<ffffffffa05125de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4> [<ffffffffa0523d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
<4> [<ffffffffa07ff739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
<4> [<ffffffff81055813>] ? __wake_up+0x53/0x70
<4> [<ffffffffa080976e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
<4> [<ffffffffa0808ca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
<4> [<ffffffffa0808ca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffffa0808ca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Jay Lan (Inactive) [ 18/Apr/14 ]

I think LU-4791 supersedes LU-4260.
I want to document here since we filed this one.

Generated at Sat Feb 10 01:44:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.