Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2720 osc_page_delete()) ASSERTION(0) failed
  3. LU-2788

Sanity test_132 (lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed:

Details

    • Technical task
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • None
    • A patched pushed via git.
    • 6754

    Description

      While http://review.whamcloud.com/5222 was being testing and assertion was trigged. So far this this error has not been seen outside of this patch.

      The test is sanity 132

      As seen in the logs for the MDS in all occasions.

      14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
      14:06:06:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
      14:06:06:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/P1 /mnt/mds1
      14:06:06:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
      14:06:06:LustreError: 17859:0:(fld_index.c:201:fld_index_create()) srv-lustre-MDT0000: insert range [0x000000000000000c-0x0000000100000000):0:mdt failed: rc = -17
      14:06:06:Lustre: lustre-MDT0000: used disk, loading
      14:06:06:Lustre: 17859:0:(mdt_lproc.c:383:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity
      14:06:06:Lustre: Increasing default stripe size to min 1048576
      14:06:06:Lustre: Enabling SOM
      14:06:06:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      14:06:06:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      14:06:06:LNet: 18036:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message
      14:06:06:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null
      14:06:06:Lustre: 3644:0:(client.c:1845:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1360015559/real 1360015559]  req@ffff88002f949800 x1426077333348683/t0(0) o8->lustre-OST0006-osc-MDT0000@10.10.4.195@tcp:28/4 lens 400/544 e 0 to 1 dl 1360015564 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      14:06:17:LustreError: 11-0: lustre-OST0004-osc-MDT0000: Communicating with 10.10.4.195@tcp, operation ost_connect failed with -19.
      14:06:18:Lustre: DEBUG MARKER: lctl get_param -n timeout
      14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
      14:06:19:Lustre: DEBUG MARKER: Using TIMEOUT=20
      14:06:19:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
      14:06:19:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.sys.jobid_var=procname_uid
      14:07:12:Lustre: MGS: haven't heard from client 2b7f8516-fc0a-afb9-790c-1965aaaa46c2 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880078f2e800, cur 1360015629 expire 1360015599 last 1360015579
      14:07:23:Lustre: lustre-MDT0000: haven't heard from client 5fcc94dc-d9c0-7c5c-7665-6b8afe791bb0 (at 10.10.4.197@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007832ec00, cur 1360015634 expire 1360015604 last 1360015584
      14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed: 
      14:07:23:LustreError: 17820:0:(lu_object.c:1982:lu_ucred_assert()) LBUG
      14:07:23:Pid: 17820, comm: ll_evictor
      14:07:23:
      14:07:23:Call Trace:
      14:07:23: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      14:07:23: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      14:07:23: [<ffffffffa0664755>] lu_ucred_assert+0x45/0x50 [obdclass]
      14:07:23: [<ffffffffa0c52c66>] mdd_xattr_sanity_check+0x36/0x1f0 [mdd]
      14:07:23: [<ffffffffa0c58221>] mdd_xattr_del+0xf1/0x540 [mdd]
      14:07:23: [<ffffffffa0e3fe0a>] mdt_som_attr_set+0xfa/0x390 [mdt]
      14:07:23: [<ffffffffa0e401ec>] mdt_ioepoch_close_on_eviction+0x14c/0x170 [mdt]
      14:07:23: [<ffffffffa0f100c9>] ? osp_key_init+0x59/0x1a0 [osp]
      14:07:23: [<ffffffffa0e40c4b>] mdt_ioepoch_close+0x2ab/0x3b0 [mdt]
      14:07:23: [<ffffffffa0e411fe>] mdt_mfd_close+0x4ae/0x6e0 [mdt]
      14:07:23: [<ffffffffa0e1297e>] mdt_obd_disconnect+0x3ae/0x4d0 [mdt]
      14:07:23: [<ffffffffa061cd78>] class_fail_export+0x248/0x580 [obdclass]
      14:07:23: [<ffffffffa07f9079>] ping_evictor_main+0x249/0x640 [ptlrpc]
      14:07:23: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
      14:07:23: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
      14:07:23: [<ffffffffa07f8e30>] ? ping_evictor_main+0x0/0x640 [ptlrpc]
      14:07:23: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      The root cause looks to be the client being locked up but even if the client was in a death spin it should not bring the mds down.

      Attachments

        Activity

          [LU-2788] Sanity test_132 (lu_object.c:1982:lu_ucred_assert()) ASSERTION( uc != ((void *)0) ) failed:

          Newer code versions of the patch to not have issue.

          keith Keith Mannthey (Inactive) added a comment - Newer code versions of the patch to not have issue.

          A do while loop was removed from the cl_lock_peek code patch in the following version of the patch. This error should not be hit again and was related to the earlier patch set.

          keith Keith Mannthey (Inactive) added a comment - A do while loop was removed from the cl_lock_peek code patch in the following version of the patch. This error should not be hit again and was related to the earlier patch set.

          The client was crashing with this. And I believe this was part of the code changed in the patch (feel free to correct me if this is wrong).

          14:06:24:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var
          14:06:24:Lustre: DEBUG MARKER: cancel_lru_locks osc start
          14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) ASSERTION( lock->cll_state == CLS_CACHED ) failed: 
          14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) LBUG
          14:06:24:Pid: 7850, comm: ll_close
          14:06:24:
          14:06:24:Call Trace:
          14:06:24: [<ffffffffa042f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          14:06:24: [<ffffffffa042fe97>] lbug_with_loc+0x47/0xb0 [libcfs]
          14:06:24: [<ffffffffa0605c34>] cl_use_try+0x274/0x2e0 [obdclass]
          14:06:24: [<ffffffffa0607580>] cl_lock_peek+0x110/0x180 [obdclass]
          14:06:24: [<ffffffffa0b4d806>] cl_local_size+0x266/0x2e0 [lustre]
          14:06:24: [<ffffffffa0b12a06>] ll_done_writing_attr+0x46/0x190 [lustre]
          14:06:24: [<ffffffffa0b12e49>] ll_ioepoch_close+0x2f9/0x5b0 [lustre]
          14:06:24: [<ffffffffa0b14710>] ll_close_thread+0x2a0/0x1030 [lustre]
          14:06:24: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
          14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
          14:06:24: [<ffffffff8100c0ca>] child_rip+0xa/0x20
          14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
          14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre]
          14:06:24: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          14:06:24:
          14:06:24:Kernel panic - not syncing: LBUG
          

          So basically a client crashed then one timeout later the MDS trips over an assert and reboots.

          keith Keith Mannthey (Inactive) added a comment - The client was crashing with this. And I believe this was part of the code changed in the patch (feel free to correct me if this is wrong). 14:06:24:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n jobid_var 14:06:24:Lustre: DEBUG MARKER: cancel_lru_locks osc start 14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) ASSERTION( lock->cll_state == CLS_CACHED ) failed: 14:06:24:LustreError: 7850:0:(cl_lock.c:1111:cl_use_try()) LBUG 14:06:24:Pid: 7850, comm: ll_close 14:06:24: 14:06:24:Call Trace: 14:06:24: [<ffffffffa042f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 14:06:24: [<ffffffffa042fe97>] lbug_with_loc+0x47/0xb0 [libcfs] 14:06:24: [<ffffffffa0605c34>] cl_use_try+0x274/0x2e0 [obdclass] 14:06:24: [<ffffffffa0607580>] cl_lock_peek+0x110/0x180 [obdclass] 14:06:24: [<ffffffffa0b4d806>] cl_local_size+0x266/0x2e0 [lustre] 14:06:24: [<ffffffffa0b12a06>] ll_done_writing_attr+0x46/0x190 [lustre] 14:06:24: [<ffffffffa0b12e49>] ll_ioepoch_close+0x2f9/0x5b0 [lustre] 14:06:24: [<ffffffffa0b14710>] ll_close_thread+0x2a0/0x1030 [lustre] 14:06:24: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre] 14:06:24: [<ffffffff8100c0ca>] child_rip+0xa/0x20 14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre] 14:06:24: [<ffffffffa0b14470>] ? ll_close_thread+0x0/0x1030 [lustre] 14:06:24: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 14:06:24: 14:06:24:Kernel panic - not syncing: LBUG So basically a client crashed then one timeout later the MDS trips over an assert and reboots.

          People

            keith Keith Mannthey (Inactive)
            keith Keith Mannthey (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: