osc_page_delete()) ASSERTION(0) failed (LU-2720)

[LU-2788] Sanity test_132 (lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed: Created: 08/Feb/13  Updated: 22/Feb/13  Resolved: 22/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Technical task Priority: Major
Reporter: Keith Mannthey (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

A patched pushed via git.


Rank (Obsolete): 6754

 Description   

While http://review.whamcloud.com/5222 was being testing and assertion was trigged. So far this this error has not been seen outside of this patch.

The test is sanity 132

As seen in the logs for the MDS in all occasions.

14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
14:06:06:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/P1 /mnt/mds1
14:06:06:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
14:06:06:LustreError: 17859:0:(fld_index.c:201:fld_index_create()) srv-lustre-MDT0000: insert range [0x000000000000000c-0x0000000100000000):0:mdt failed: rc = -17
14:06:06:Lustre: lustre-MDT0000: used disk, loading
14:06:06:Lustre: 17859:0:(mdt_lproc.c:383:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity
14:06:06:Lustre: Increasing default stripe size to min 1048576
14:06:06:Lustre: Enabling SOM
14:06:06:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
14:06:06:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message
14:06:06:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null
14:06:06:Lustre: 3644:0:(client.c:1845:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1360015559/real 1360015559]  req@ffff88002f949800 x1426077333348683/t0(0) o8->lustre-OST0006-osc-MDT0000@10.10.4.195@tcp:28/4 lens 400/544 e 0 to 1 dl 1360015564 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
14:06:17:LustreError: 11-0: lustre-OST0004-osc-MDT0000: Communicating with 10.10.4.195@tcp, operation ost_connect failed with -19.
14:06:18:Lustre: DEBUG MARKER: lctl get_param -n timeout
14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
14:06:19:Lustre: DEBUG MARKER: Using TIMEOUT=20
14:06:19:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.sys.jobid_var=procname_uid
14:07:12:Lustre: MGS: haven't heard from client 2b7f8516-fc0a-afb9-790c-1965aaaa46c2 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880078f2e800, cur 1360015629 expire 1360015599 last 1360015579
14:07:23:Lustre: lustre-MDT0000: haven't heard from client 5fcc94dc-d9c0-7c5c-7665-6b8afe791bb0 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007832ec00, cur 1360015634 expire 1360015604 last 1360015584
14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed: 
14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) LBUG
14:07:23:Pid: 17820, comm: ll_evictor
14:07:23:
14:07:23:Call Trace:
14:07:23: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
14:07:23: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
14:07:23: [<ffffffffa0664755>] lu_ucred_assert+0x45/0x50 [obdclass]
14:07:23: [<ffffffffa0c52c66>] mdd_xattr_sanity_check+0x36/0x1f0 [mdd]
14:07:23: [<ffffffffa0c58221>] mdd_xattr_del+0xf1/0x540 [mdd]
14:07:23: [<ffffffffa0e3fe0a>] mdt_som_attr_set+0xfa/0x390 [mdt]
14:07:23: [<ffffffffa0e401ec>] mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt]
14:07:23: [<ffffffffa0f100c9>] ? osp_key_init+0x59/0x1a0 [osp]
14:07:23: [<ffffffffa0e40c4b>] mdt_ioepoch_close+0x2ab/0x3b0 [mdt]
14:07:23: [<ffffffffa0e411fe>] mdt_mfd_close+0x4ae/0x6e0 [mdt]
14:07:23: [<ffffffffa0e1297e>] mdt_obd_disconnect+0x3ae/0x4d0 [mdt]
14:07:23: [<ffffffffa061cd78>] class_fail_export+0x248/0x580 [obdclass]
14:07:23: [<ffffffffa07f9079>] ping_evictor_main+0x249/0x640 [ptlrpc]
14:07:23: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffff8100c0ca>] child_rip+0xa/0x20
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
14:07:23: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The root cause looks to be the client being locked up but even if the client was in a death spin it should not bring the mds down.



 Comments   
Comment by Keith Mannthey (Inactive) [ 08/Feb/13 ]

The client was crashing with this. And I believe this was part of the code changed in the patch (feel free to correct me if this is wrong).

14:06:24:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
14:06:24:Lustre: DEBUG MARKER: cancel_lru_locks osc start
14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) ASSERTION( lock->cll_state == CLS_CACHED ) failed: 
14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) LBUG
14:06:24:Pid: 7850, comm: ll_close
14:06:24:
14:06:24:Call Trace:
14:06:24: [<ffffffffa042f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
14:06:24: [<ffffffffa042fe97>] lbug_with_loc+0x47/0xb0 [libcfs]
14:06:24: [<ffffffffa0605c34>] cl_use_try+0x274/0x2e0 [obdclass]
14:06:24: [<ffffffffa0607580>] cl_lock_peek+0x110/0x180 [obdclass]
14:06:24: [<ffffffffa0b4d806>] cl_local_size+0x266/0x2e0 [lustre]
14:06:24: [<ffffffffa0b12a06>] ll_done_writing_attr+0x46/0x190 [lustre]
14:06:24: [<ffffffffa0b12e49>] ll_ioepoch_close+0x2f9/0x5b0 [lustre]
14:06:24: [<ffffffffa0b14710>] ll_close_thread+0x2a0/0x1030 [lustre]
14:06:24: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
14:06:24: [<ffffffff8100c0ca>] child_rip+0xa/0x20
14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
14:06:24: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
14:06:24:
14:06:24:Kernel panic - not syncing: LBUG

So basically a client crashed then one timeout later the MDS trips over an assert and reboots.

Comment by Keith Mannthey (Inactive) [ 22/Feb/13 ]

A do while loop was removed from the cl_lock_peek code patch in the following version of the patch. This error should not be hit again and was related to the earlier patch set.

Comment by Keith Mannthey (Inactive) [ 22/Feb/13 ]

Newer code versions of the patch to not have issue.

Generated at Sat Feb 10 01:28:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.