[LU-2246] failure on sanity.sh test_132: ASSERTION( env->le_ses != ((void *)0) ) failed Created: 29/Oct/12  Updated: 07/Jan/16  Resolved: 07/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

Both server and client are RHEL6


Issue Links:
Related
is related to LU-1403 ucred code cleanup Resolved
is related to LU-1942 2.1.3<->2.3 Test failure on test suit... Resolved
Severity: 3
Rank (Obsolete): 5318

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/34cdc80e-21c2-11e2-b552-52540035b04c.

The sub-test test_132 failed with the following error:

test failed to respond and timed out

From MDS console log:

00:20:08:Lustre: lustre-MDT0000: haven't heard from client 5be596f3-4c31-36f1-0f15-cf978ff44af7 (at 192.168.4.23@o2ib) in 55 seconds. I think it's dead, and I am evicting it. exp ffff88030ddd2000, cur 1351495207 expire 1351495177 last 1351495152
00:20:08:LustreError: 25221:0:(mdd_device.c:1426:md_ucred()) ASSERTION( env->le_ses != ((void *)0) ) failed: 
00:20:08:LustreError: 25221:0:(mdd_device.c:1426:md_ucred()) LBUG
00:20:08:Pid: 25221, comm: ll_evictor
00:20:08:
00:20:08:Call Trace:
00:20:08: [<ffffffffa03f2905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
00:20:08: [<ffffffffa03f2f17>] lbug_with_loc+0x47/0xb0 [libcfs]
00:20:08: [<ffffffffa0be7c2c>] md_ucred+0x5c/0x60 [mdd]
00:20:08: [<ffffffffa0bd03a6>] mdd_xattr_sanity_check+0x36/0x1f0 [mdd]
00:20:08: [<ffffffffa0bd71db>] mdd_xattr_set+0x17b/0x620 [mdd]
00:20:08: [<ffffffffa0bf0426>] ? mdd_read_unlock+0x26/0x30 [mdd]
00:20:08: [<ffffffffa0bd513b>] ? mdd_xattr_get+0x13b/0x340 [mdd]
00:20:08: [<ffffffffa0c4de73>] mdt_som_attr_set+0x1b3/0x440 [mdt]
00:20:08: [<ffffffffa0c4e24c>] mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt]
00:20:08: [<ffffffffa0f48e89>] ? osd_key_init+0x119/0x680 [osd_ldiskfs]
00:20:08: [<ffffffffa0c4eccb>] mdt_ioepoch_close+0x2ab/0x3d0 [mdt]
00:20:08: [<ffffffffa0c4f272>] mdt_mfd_close+0x482/0x700 [mdt]
00:20:08: [<ffffffffa0c1e01e>] mdt_obd_disconnect+0x3ae/0x4f0 [mdt]
00:20:08: [<ffffffffa056ed88>] class_fail_export+0x248/0x580 [obdclass]
00:20:08: [<ffffffffa0765e69>] ping_evictor_main+0x249/0x640 [ptlrpc]
00:20:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffff8100c14a>] child_rip+0xa/0x20
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffff8100c140>] ? child_rip+0x0/0x20
00:20:08:
00:20:08:Kernel panic - not syncing: LBUG
00:20:08:Pid: 25221, comm: ll_evictor Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1
00:20:08:Call Trace:
00:20:08: [<ffffffff814fd58a>] ? panic+0xa0/0x168
00:20:08: [<ffffffffa03f2f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
00:20:08: [<ffffffffa0be7c2c>] ? md_ucred+0x5c/0x60 [mdd]
00:20:08: [<ffffffffa0bd03a6>] ? mdd_xattr_sanity_check+0x36/0x1f0 [mdd]
00:20:08: [<ffffffffa0bd71db>] ? mdd_xattr_set+0x17b/0x620 [mdd]
00:20:08: [<ffffffffa0bf0426>] ? mdd_read_unlock+0x26/0x30 [mdd]
00:20:08: [<ffffffffa0bd513b>] ? mdd_xattr_get+0x13b/0x340 [mdd]
00:20:08: [<ffffffffa0c4de73>] ? mdt_som_attr_set+0x1b3/0x440 [mdt]
00:20:08: [<ffffffffa0c4e24c>] ? mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt]
00:20:08: [<ffffffffa0f48e89>] ? osd_key_init+0x119/0x680 [osd_ldiskfs]
00:20:08: [<ffffffffa0c4eccb>] ? mdt_ioepoch_close+0x2ab/0x3d0 [mdt]
00:20:08: [<ffffffffa0c4f272>] ? mdt_mfd_close+0x482/0x700 [mdt]
00:20:08: [<ffffffffa0c1e01e>] ? mdt_obd_disconnect+0x3ae/0x4f0 [mdt]
00:20:08: [<ffffffffa056ed88>] ? class_fail_export+0x248/0x580 [obdclass]
00:20:08: [<ffffffffa0765e69>] ? ping_evictor_main+0x249/0x640 [ptlrpc]
00:20:08: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffffa0765c20>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
00:20:08: [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Hongchao Zhang [ 08/Nov/12 ]

this issue is related to SOM, and it can be reproduced easily, it will occur if a client is evicted while SOM is enabled,
for ping_evictor threads have no lu_context "le_ses", which is designed for per-request.

in this case, the client encountered ASSERTION of LU-1527 in test_132 and was evicted by MDT, which caused this ASSERTION.

Comment by Hongchao Zhang [ 09/Nov/12 ]

the initial patch is under test and will be attached soon

Comment by Hongchao Zhang [ 12/Nov/12 ]

the patch is tracked at http://review.whamcloud.com/#change,4512

this problem is related to SOM, which is not a mature feature, and the patch fixes some bugs in SOM,
here are some notes about the issues fixed by the patch,

1, the lu_env could not contain lu_context related to specific request (lu_env->le_ses), say, for eviction case,
then the operations related to SOM should take it into account.
2, the AU (ll_som_update) rpc call could encounter an destroyed object(marked with LU_OBJECT_HEARD_BANSHEE) for the object
was unlinked/closed but the a mfd is still held in MDT to wait the AU rpc, which cause deadlock for the AU rpc can't
get the object by lu_object_find_at for it was marked as dead object.

this issue can be dropped as a blocker, for it is caused by blocker LU-1527 and it is only related to unstable SOM feature

Comment by Hongchao Zhang [ 09/Dec/12 ]

the patch is updated again, and it tries to fix the following problems related to SOM, the ASSERTION issue
"env->le_ses != ((void *)0))" will be fixed by http://review.whamcloud.com/2733

1, if there is no LSM, ll_som_update won't need to update SOM attributes on MDS (test_206 in sanity.sh)

2, if the file isn't a regular one, ll_setattr_raw should not open the file (MF_EPOCH_OPEN), for it will leave the set_attr
request and the corresponding obd_import & obd_device won't be released (there are several MDC obd_device with ST state)

3, mdt_som_au_close could be called during eviction, then there is no ptlrpc_request with it.

4, during closing a file, MDT should not require the client to send AU if the file has been unlinked.

Comment by John Fuchs-Chesney (Inactive) [ 07/Jan/16 ]

Patch was merged.
~ jfc.

Generated at Sat Feb 10 01:23:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.