Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9280

coral-beta-combined build 134 (osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.9.0
    • 1
    • 9223372036854775807

    Description

      Running Lustre 2.9 + coral-betal-combined branch based on RC3:

      IOR tests:
      IOR-3.0.1: MPI Coordinated Test of Parallel I/O

      Began: Thu Mar 30 00:07:33 2017
      Command line used: /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 1024 -b 1m -t 1m
      Machine: Linux wolf-6.wolf.hpdd.intel.com

      Test 0 started: Thu Mar 30 00:07:33 2017
      Summary:
      api = POSIX
      test filename = testFile
      access = file-per-process
      ordering in a file = sequential offsets
      ordering inter file= no tasks offsets
      clients = 4 (1 per node)
      repetitions = 1
      xfersize = 1 MiB
      blocksize = 1 MiB
      aggregate filesize = 4 GiB

      access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
      ------ --------- ---------- --------- -------- -------- -------- -------- ----

      While IOR was writing we hit the following error:

      [19744.556366] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1c:0x0] from accounting ZAP for usr 0: rc = -5
      [19744.580303] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) Skipped 1 previous similar message
      [19745.014350] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1c:0x0] from accounting ZAP for grp 0: rc = -5
      [19745.037113] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) Skipped 2 previous similar messages
      [19768.423554] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1f:0x0] from accounting ZAP for usr 0: rc = -52
      [19768.586567] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1f:0x0] from accounting ZAP for grp 0: rc = -52
      [19779.750997] LustreError: 52432:0:(osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed: 
      [19779.751007] LustreError: 50225:0:(osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed: 
      [19779.751010] LustreError: 50225:0:(osd_object.c:745:osd_attr_get()) LBUG
      [19779.751012] Pid: 50225, comm: ll_ost01_002
      [19779.751012] 
      Call Trace:
      [19779.751043]  [<ffffffffa0a1b7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [19779.751054]  [<ffffffffa0a1b841>] lbug_with_loc+0x41/0xb0 [libcfs]
      [19779.751072]  [<ffffffffa0968210>] osd_attr_set+0x0/0xce0 [osd_zfs]
      [19779.751096]  [<ffffffffa0f1b405>] ofd_attr_get+0xa5/0x230 [ofd]
      [19779.751111]  [<ffffffffa0f29bfd>] ofd_lvbo_init+0x42d/0xb02 [ofd]
      [19779.751248]  [<ffffffffa0cd22d9>] ldlm_handle_enqueue0+0x8f9/0x1680 [ptlrpc]
      [19779.751322]  [<ffffffffa0cfa0f0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [19779.751407]  [<ffffffffa0d52dc2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [19779.751483]  [<ffffffffa0d57225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [19779.751545]  [<ffffffffa0d031ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [19779.751563]  [<ffffffffa0a28128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [19779.751621]  [<ffffffffa0d00d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [19779.751635]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [19779.751639]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [19779.751708]  [<ffffffffa0d07260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [19779.751765]  [<ffffffffa0d067c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
      [19779.751775]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [19779.751779]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      [19779.751789]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [19779.751794]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
      [19779.751795] 
      [19779.751797] Kernel panic - not syncing: LBUG
      [19779.751801] CPU: 26 PID: 50225 Comm: ll_ost01_002 Tainted: G          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
      [19779.751803] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [19779.751813]  ffffffffa0a38d4c 00000000e5fc8e4d ffff880fe9b33a78 ffffffff81636431
      [19779.751820]  ffff880fe9b33af8 ffffffff8162fcc0 ffffffff00000008 ffff880fe9b33b08
      [19779.751827]  ffff880fe9b33aa8 00000000e5fc8e4d 00000000e5fc8e4d 0000000000000092
      [19779.751828] Call Trace:
      [19779.751843]  [<ffffffff81636431>] dump_stack+0x19/0x1b
      [19779.751847]  [<ffffffff8162fcc0>] panic+0xd8/0x1e7
      [19779.751859]  [<ffffffffa0a1b859>] lbug_with_loc+0x59/0xb0 [libcfs]
      [19779.751871]  [<ffffffffa0968210>] osd_attr_get+0x2d0/0x2d0 [osd_zfs]
      [19779.751885]  [<ffffffffa0f1b405>] ofd_attr_get+0xa5/0x230 [ofd]
      [19779.751898]  [<ffffffffa0f29bfd>] ofd_lvbo_init+0x42d/0xb02 [ofd]
      [19779.751952]  [<ffffffffa0cd22d9>] ldlm_handle_enqueue0+0x8f9/0x1680 [ptlrpc]
      [19779.752010]  [<ffffffffa0cfa0f0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [19779.752084]  [<ffffffffa0d52dc2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [19779.752165]  [<ffffffffa0d57225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [19779.752238]  [<ffffffffa0d031ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [19779.752255]  [<ffffffffa0a28128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [19779.752326]  [<ffffffffa0d00d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [19779.752333]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [19779.752337]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [19779.752409]  [<ffffffffa0d07260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [19779.752482]  [<ffffffffa0d067c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [19779.752489]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [19779.752494]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [19779.752500]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [19779.752505]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 
      

      osd-zfs/osd_object.c:937

      932	 * dmu_tx_hold_bonus(tx, oid) called and then assigned
      933	 * to a transaction group.
      934	 */
      935	static int osd_attr_set(const struct lu_env *env, struct dt_object *dt,
      936				const struct lu_attr *la, struct thandle *handle)
      937	{
      938		struct osd_thread_info	*info = osd_oti_get(env);
      939		sa_bulk_attr_t		*bulk = osd_oti_get(env)->oti_attr_bulk;
      940		struct osd_object	*obj = osd_dt_obj(dt);
      941		struct osd_device	*osd = osd_obj2dev(obj);
      

      ofd/ofd_objects.c:780

      775	 * \retval		0 if successful
      776	 * \retval		negative value on error
      777	 */
      778	int ofd_attr_get(const struct lu_env *env, struct ofd_object *fo,
      779			 struct lu_attr *la)
      780	{
      781		int rc = 0;
      782	
      783		ENTRY;
      784
      

      Dump is at:
      /scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-03-30-00:08:02/

      Attachments

        Issue Links

          Activity

            [LU-9280] coral-beta-combined build 134 (osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26617/
            Subject: LU-9280 osd-zfs: don't mark existing on failed creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 80c9ba8d4070c6c106afd0362d2503324c7d0e99

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26617/ Subject: LU-9280 osd-zfs: don't mark existing on failed creation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 80c9ba8d4070c6c106afd0362d2503324c7d0e99
            niu Niu Yawei (Inactive) added a comment - ported to b2_9: https://review.whamcloud.com/26653

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26653
            Subject: LU-9280 osd-zfs: don't mark existing on failed creation
            Project: fs/lustre-release
            Branch: b2_9
            Current Patch Set: 1
            Commit: f35a9387825e97785a81f12378d6bae3283534d7

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26653 Subject: LU-9280 osd-zfs: don't mark existing on failed creation Project: fs/lustre-release Branch: b2_9 Current Patch Set: 1 Commit: f35a9387825e97785a81f12378d6bae3283534d7

            There is a defect in osd_object_create() is likely related to this bug, I pushed a patch to master for review, once it's passed review, I'll backport it to b2_9.

            niu Niu Yawei (Inactive) added a comment - There is a defect in osd_object_create() is likely related to this bug, I pushed a patch to master for review, once it's passed review, I'll backport it to b2_9.

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26617
            Subject: LU-9280 osd-zfs: don't mark existing on failed creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 25382ce1f38812f891575db2c12423f4f49420ae

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26617 Subject: LU-9280 osd-zfs: don't mark existing on failed creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 25382ce1f38812f891575db2c12423f4f49420ae

            No patches to 2.9.0. We are making use of both 16MB RPCs from Lustre Client to OSS and have BRW size to 16 as well.

            jsalians_intel John Salinas (Inactive) added a comment - No patches to 2.9.0. We are making use of both 16MB RPCs from Lustre Client to OSS and have BRW size to 16 as well.

            Yes, but I didn't find the root cause yet. Is this a clean 2.9.0 Lustre or any patches applied?

            niu Niu Yawei (Inactive) added a comment - Yes, but I didn't find the root cause yet. Is this a clean 2.9.0 Lustre or any patches applied?

            Have you looked at the code? Do you have any questions we can get answered for you?

            jsalians_intel John Salinas (Inactive) added a comment - Have you looked at the code? Do you have any questions we can get answered for you?

            from the stacktrace, it seems unlikely related to cheksum. I'll look into the coral changes to see if there is anything suspicious.

            niu Niu Yawei (Inactive) added a comment - from the stacktrace, it seems unlikely related to cheksum. I'll look into the coral changes to see if there is anything suspicious.

            People

              niu Niu Yawei (Inactive)
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: