[LU-9280] coral-beta-combined build 134 (osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed Created: 30/Mar/17  Updated: 14/Jun/18  Resolved: 23/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: John Salinas (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: LS_RZ, prod
Environment:

Running Lustre 2.9 + coral-betal-combined branch based on RC3:

kmod-lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-tests-2.9.0_dirty-1.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
lustre-2.9.0_dirty-1.el7.centos.x86_64
lustre-iokit-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-2.9.0_dirty-1.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64
[root@wolf-3 debug_info.20170330_034804_14535_wolf-3.wolf.hpdd.intel.com]# rpm -qa |grep -i zfs
libzfs2-0.7.0-rc3_28_g4661777.el7.centos.x86_64
kmod-zfs-0.7.0-rc3_28_g4661777.el7.centos.x86_64
zfs-kmod-debuginfo-0.7.0-rc3_28_g4661777.el7.centos.x86_64
lustre-osd-zfs-mount-2.9.0_dirty-1.el7.centos.x86_64
zfs-0.7.0-rc3_28_g4661777.el7.centos.x86_64
zfs-test-0.7.0-rc3_28_g4661777.el7.centos.x86_64
kmod-lustre-osd-zfs-2.9.0_dirty-1.el7.centos.x86_64
zfs-debuginfo-0.7.0-rc3_28_g4661777.el7.centos.x86_64

Pool configuration:
quick_oss1.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost0 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpathaa mpathab mpathac mpathad mpathae mpathaf mpathag mpathah mpathai mpathaj mpathak mpathal mpatham mpathan mpathao mpathap mpathaq mpathar
quick_oss1.sh:zpool status -v ost0
quick_oss1.sh:zpool feature@large_blocks=enabled ost0
quick_oss1.sh:zpool get all ost0 |grep large_blocks
quick_oss2.sh:zpool create -f -o ashift=12 -o cachefile=none -O recordsize=16MB ost1 draid2 cfg=test_2_5_4_18_draidcfg.nvl mpatha mpathb mpathc mpathd mpathe mpathf mpathg mpathh mpathi mpathj mpathk mpathl mpathm mpathn mpatho mpathp mpathq mpathr
quick_oss2.sh:zpool status -v ost1
quick_oss2.sh:zpool feature@large_blocks=enabled ost1
quick_oss2.sh:zpool get all ost1 |grep large_blocks

Example from ost1.
pool: ost1
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
ost1 ONLINE 0 0 0
draid2-0 ONLINE 0 0 0
mpatha ONLINE 0 0 0
mpathb ONLINE 0 0 0
mpathc ONLINE 0 0 0
mpathd ONLINE 0 0 0
mpathe ONLINE 0 0 0
mpathf ONLINE 0 0 0
mpathg ONLINE 0 0 0
mpathh ONLINE 0 0 0
mpathi ONLINE 0 0 0
mpathj ONLINE 0 0 0
mpathk ONLINE 0 0 0
mpathl ONLINE 0 0 0
mpathm ONLINE 0 0 0
mpathn ONLINE 0 0 0
mpatho ONLINE 0 0 0
mpathp ONLINE 0 0 0
mpathq ONLINE 0 0 0
mpathr ONLINE 0 0 0
spares
$draid2-0-s0 AVAIL
$draid2-0-s1 AVAIL
$draid2-0-s2 AVAIL
$draid2-0-s3 AVAIL


Issue Links:
Related
Severity: 1
Rank (Obsolete): 9223372036854775807

 Description   

Running Lustre 2.9 + coral-betal-combined branch based on RC3:

IOR tests:
IOR-3.0.1: MPI Coordinated Test of Parallel I/O

Began: Thu Mar 30 00:07:33 2017
Command line used: /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 1024 -b 1m -t 1m
Machine: Linux wolf-6.wolf.hpdd.intel.com

Test 0 started: Thu Mar 30 00:07:33 2017
Summary:
api = POSIX
test filename = testFile
access = file-per-process
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 4 (1 per node)
repetitions = 1
xfersize = 1 MiB
blocksize = 1 MiB
aggregate filesize = 4 GiB

access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----

While IOR was writing we hit the following error:

[19744.556366] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1c:0x0] from accounting ZAP for usr 0: rc = -5
[19744.580303] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) Skipped 1 previous similar message
[19745.014350] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1c:0x0] from accounting ZAP for grp 0: rc = -5
[19745.037113] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) Skipped 2 previous similar messages
[19768.423554] LustreError: 84625:0:(osd_object.c:597:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1f:0x0] from accounting ZAP for usr 0: rc = -52
[19768.586567] LustreError: 84625:0:(osd_object.c:603:osd_object_destroy()) lsdraid-OST0000: failed to remove [0x100000000:0x1f:0x0] from accounting ZAP for grp 0: rc = -52
[19779.750997] LustreError: 52432:0:(osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed: 
[19779.751007] LustreError: 50225:0:(osd_object.c:745:osd_attr_get()) ASSERTION( obj->oo_db ) failed: 
[19779.751010] LustreError: 50225:0:(osd_object.c:745:osd_attr_get()) LBUG
[19779.751012] Pid: 50225, comm: ll_ost01_002
[19779.751012] 
Call Trace:
[19779.751043]  [<ffffffffa0a1b7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
[19779.751054]  [<ffffffffa0a1b841>] lbug_with_loc+0x41/0xb0 [libcfs]
[19779.751072]  [<ffffffffa0968210>] osd_attr_set+0x0/0xce0 [osd_zfs]
[19779.751096]  [<ffffffffa0f1b405>] ofd_attr_get+0xa5/0x230 [ofd]
[19779.751111]  [<ffffffffa0f29bfd>] ofd_lvbo_init+0x42d/0xb02 [ofd]
[19779.751248]  [<ffffffffa0cd22d9>] ldlm_handle_enqueue0+0x8f9/0x1680 [ptlrpc]
[19779.751322]  [<ffffffffa0cfa0f0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[19779.751407]  [<ffffffffa0d52dc2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[19779.751483]  [<ffffffffa0d57225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[19779.751545]  [<ffffffffa0d031ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[19779.751563]  [<ffffffffa0a28128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[19779.751621]  [<ffffffffa0d00d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[19779.751635]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[19779.751639]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[19779.751708]  [<ffffffffa0d07260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[19779.751765]  [<ffffffffa0d067c0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
[19779.751775]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[19779.751779]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
[19779.751789]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[19779.751794]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
[19779.751795] 
[19779.751797] Kernel panic - not syncing: LBUG
[19779.751801] CPU: 26 PID: 50225 Comm: ll_ost01_002 Tainted: G          IOE  ------------   3.10.0-327.36.3.el7.x86_64 #1
[19779.751803] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[19779.751813]  ffffffffa0a38d4c 00000000e5fc8e4d ffff880fe9b33a78 ffffffff81636431
[19779.751820]  ffff880fe9b33af8 ffffffff8162fcc0 ffffffff00000008 ffff880fe9b33b08
[19779.751827]  ffff880fe9b33aa8 00000000e5fc8e4d 00000000e5fc8e4d 0000000000000092
[19779.751828] Call Trace:
[19779.751843]  [<ffffffff81636431>] dump_stack+0x19/0x1b
[19779.751847]  [<ffffffff8162fcc0>] panic+0xd8/0x1e7
[19779.751859]  [<ffffffffa0a1b859>] lbug_with_loc+0x59/0xb0 [libcfs]
[19779.751871]  [<ffffffffa0968210>] osd_attr_get+0x2d0/0x2d0 [osd_zfs]
[19779.751885]  [<ffffffffa0f1b405>] ofd_attr_get+0xa5/0x230 [ofd]
[19779.751898]  [<ffffffffa0f29bfd>] ofd_lvbo_init+0x42d/0xb02 [ofd]
[19779.751952]  [<ffffffffa0cd22d9>] ldlm_handle_enqueue0+0x8f9/0x1680 [ptlrpc]
[19779.752010]  [<ffffffffa0cfa0f0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[19779.752084]  [<ffffffffa0d52dc2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[19779.752165]  [<ffffffffa0d57225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[19779.752238]  [<ffffffffa0d031ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[19779.752255]  [<ffffffffa0a28128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[19779.752326]  [<ffffffffa0d00d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[19779.752333]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
[19779.752337]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
[19779.752409]  [<ffffffffa0d07260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
[19779.752482]  [<ffffffffa0d067c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[19779.752489]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
[19779.752494]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
[19779.752500]  [<ffffffff81646a98>] ret_from_fork+0x58/0x90
[19779.752505]  [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140 

osd-zfs/osd_object.c:937

932	 * dmu_tx_hold_bonus(tx, oid) called and then assigned
933	 * to a transaction group.
934	 */
935	static int osd_attr_set(const struct lu_env *env, struct dt_object *dt,
936				const struct lu_attr *la, struct thandle *handle)
937	{
938		struct osd_thread_info	*info = osd_oti_get(env);
939		sa_bulk_attr_t		*bulk = osd_oti_get(env)->oti_attr_bulk;
940		struct osd_object	*obj = osd_dt_obj(dt);
941		struct osd_device	*osd = osd_obj2dev(obj);

ofd/ofd_objects.c:780

775	 * \retval		0 if successful
776	 * \retval		negative value on error
777	 */
778	int ofd_attr_get(const struct lu_env *env, struct ofd_object *fo,
779			 struct lu_attr *la)
780	{
781		int rc = 0;
782	
783		ENTRY;
784

Dump is at:
/scratch/dumps/wolf-3.wolf.hpdd.intel.com/10.8.1.3-2017-03-30-00:08:02/



 Comments   
Comment by Peter Jones [ 31/Mar/17 ]

Niu

Could you please advise?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 01/Apr/17 ]

I see it's "Lustre 2.9 + coral-betal-combined branch based on RC3:", where can I get the source code?

Comment by John Salinas (Inactive) [ 01/Apr/17 ]

Lustre is stock Lustre 2.9.0
ZFS is: git clone ssh://jsalinas@review.whamcloud.com:29418/fs/zfs -b coral-beta-combined

Comment by John Salinas (Inactive) [ 04/Apr/17 ]

Should I re-try this with checksums on? Aren't the lnet checksums turned off by default?

Comment by Niu Yawei (Inactive) [ 05/Apr/17 ]

from the stacktrace, it seems unlikely related to cheksum. I'll look into the coral changes to see if there is anything suspicious.

Comment by John Salinas (Inactive) [ 12/Apr/17 ]

Have you looked at the code? Do you have any questions we can get answered for you?

Comment by Niu Yawei (Inactive) [ 13/Apr/17 ]

Yes, but I didn't find the root cause yet. Is this a clean 2.9.0 Lustre or any patches applied?

Comment by John Salinas (Inactive) [ 13/Apr/17 ]

No patches to 2.9.0. We are making use of both 16MB RPCs from Lustre Client to OSS and have BRW size to 16 as well.

Comment by Gerrit Updater [ 14/Apr/17 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26617
Subject: LU-9280 osd-zfs: don't mark existing on failed creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 25382ce1f38812f891575db2c12423f4f49420ae

Comment by Niu Yawei (Inactive) [ 14/Apr/17 ]

There is a defect in osd_object_create() is likely related to this bug, I pushed a patch to master for review, once it's passed review, I'll backport it to b2_9.

Comment by Gerrit Updater [ 17/Apr/17 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: https://review.whamcloud.com/26653
Subject: LU-9280 osd-zfs: don't mark existing on failed creation
Project: fs/lustre-release
Branch: b2_9
Current Patch Set: 1
Commit: f35a9387825e97785a81f12378d6bae3283534d7

Comment by Niu Yawei (Inactive) [ 17/Apr/17 ]

ported to b2_9: https://review.whamcloud.com/26653

Comment by Gerrit Updater [ 23/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26617/
Subject: LU-9280 osd-zfs: don't mark existing on failed creation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 80c9ba8d4070c6c106afd0362d2503324c7d0e99

Comment by Peter Jones [ 23/Apr/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:24:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.