Details
-
Technical task
-
Resolution: Cannot Reproduce
-
Major
-
None
-
None
-
None
-
A patched pushed via git.
-
6754
Description
While http://review.whamcloud.com/5222 was being testing and assertion was trigged. So far this this error has not been seen outside of this patch.
The test is sanity 132
As seen in the logs for the MDS in all occasions.
14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1 14:06:06:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1 14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 14:06:06:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 14:06:06:LustreError: 17859:0:(fld_index.c:201:fld_index_create()) srv-lustre-MDT0000: insert range [0x000000000000000c-0x0000000100000000):0:mdt failed: rc = -17 14:06:06:Lustre: lustre-MDT0000: used disk, loading 14:06:06:Lustre: 17859:0:(mdt_lproc.c:383:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity 14:06:06:Lustre: Increasing default stripe size to min 1048576 14:06:06:Lustre: Enabling SOM 14:06:06:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. 14:06:06:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. 14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message 14:06:06:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null 14:06:06:Lustre: 3644:0:(client.c:1845:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1360015559/real 1360015559] req@ffff88002f949800 x1426077333348683/t0(0) o8->lustre-OST0006-osc-MDT0000@10.10.4.195@tcp:28/4 lens 400/544 e 0 to 1 dl 1360015564 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 14:06:17:LustreError: 11-0: lustre-OST0004-osc-MDT0000: Communicating with 10.10.4.195@tcp, operation ost_connect failed with -19. 14:06:18:Lustre: DEBUG MARKER: lctl get_param -n timeout 14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20 14:06:19:Lustre: DEBUG MARKER: Using TIMEOUT=20 14:06:19:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l 14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.sys.jobid_var=procname_uid 14:07:12:Lustre: MGS: haven't heard from client 2b7f8516-fc0a-afb9-790c-1965aaaa46c2 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880078f2e800, cur 1360015629 expire 1360015599 last 1360015579 14:07:23:Lustre: lustre-MDT0000: haven't heard from client 5fcc94dc-d9c0-7c5c-7665-6b8afe791bb0 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007832ec00, cur 1360015634 expire 1360015604 last 1360015584 14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed: 14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) LBUG 14:07:23:Pid: 17820, comm: ll_evictor 14:07:23: 14:07:23:Call Trace: 14:07:23: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 14:07:23: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs] 14:07:23: [<ffffffffa0664755>] lu_ucred_assert+0x45/0x50 [obdclass] 14:07:23: [<ffffffffa0c52c66>] mdd_xattr_sanity_check+0x36/0x1f0 [mdd] 14:07:23: [<ffffffffa0c58221>] mdd_xattr_del+0xf1/0x540 [mdd] 14:07:23: [<ffffffffa0e3fe0a>] mdt_som_attr_set+0xfa/0x390 [mdt] 14:07:23: [<ffffffffa0e401ec>] mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt] 14:07:23: [<ffffffffa0f100c9>] ? osp_key_init+0x59/0x1a0 [osp] 14:07:23: [<ffffffffa0e40c4b>] mdt_ioepoch_close+0x2ab/0x3b0 [mdt] 14:07:23: [<ffffffffa0e411fe>] mdt_mfd_close+0x4ae/0x6e0 [mdt] 14:07:23: [<ffffffffa0e1297e>] mdt_obd_disconnect+0x3ae/0x4d0 [mdt] 14:07:23: [<ffffffffa061cd78>] class_fail_export+0x248/0x580 [obdclass] 14:07:23: [<ffffffffa07f9079>] ping_evictor_main+0x249/0x640 [ptlrpc] 14:07:23: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc] 14:07:23: [<ffffffff8100c0ca>] child_rip+0xa/0x20 14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc] 14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc] 14:07:23: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
The root cause looks to be the client being locked up but even if the client was in a death spin it should not bring the mds down.
Newer code versions of the patch to not have issue.