[LU-2991] (osd_internal.h:909:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 2 Created: 19/Mar/13  Updated: 15/Apr/13  Resolved: 15/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Minh Diep Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: MB
Environment:

https://build.whamcloud.com/job/lustre-master/1329/


Severity: 3
Rank (Obsolete): 7290

 Description   

1 mds, 2 oss with 150 osts each
1 client mounted doing
for i in $(seq 1 300); do
lfs setstripe /mnt/lustre/dir$i
touch /mnt/lustre/dir$i/file
lfs getstripe /mnt/lustre/dir$i
done

rm -rf /mnt/lustre/dir*

LustreError: 3627:0:(osd_internal.h:909:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 2
LustreError: 3627:0:(osd_internal.h:909:osd_trans_exec_op()) LBUG
Pid: 3627, comm: mdt02_001

Call Trace:
[<ffffffffa04cd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa04cde97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0d1c688>] osd_trans_exec_op+0x128/0x160 [osd_ldiskfs]
[<ffffffffa0d26878>] osd_xattr_set+0x278/0x380 [osd_ldiskfs]
[<ffffffffa0edef30>] lod_generate_and_set_lovea+0x350/0x720 [lod]
[<ffffffffa0eea6d6>] lod_striping_create+0x1e6/0x320 [lod]
[<ffffffffa0eea983>] lod_object_create+0x173/0x260 [lod]
[<ffffffffa0c3b540>] mdd_object_create_internal+0xa0/0x1c0 [mdd]
[<ffffffffa0c4e503>] mdd_create+0x963/0x1470 [mdd]
[<ffffffffa0ee7460>] ? lod_index_lookup+0x0/0x30 [lod]
[<ffffffffa0e4bf84>] mdt_reint_open+0x13d4/0x20d0 [mdt]
[<ffffffffa04ea7ae>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
[<ffffffffa07bb7ec>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
[<ffffffffa0638cf0>] ? lu_ucred+0x20/0x30 [obdclass]
[<ffffffffa0e368c1>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0e2ff23>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
[<ffffffffa0e304dd>] mdt_intent_reint+0x1ed/0x4f0 [mdt]
[<ffffffffa0e2c0ae>] mdt_intent_policy+0x3ae/0x750 [mdt]
[<ffffffffa0774351>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
[<ffffffffa079a3e7>] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc]
[<ffffffffa0e2c586>] mdt_enqueue+0x46/0x110 [mdt]
[<ffffffffa0e21168>] mdt_handle_common+0x628/0x1620 [mdt]
[<ffffffffa0e5a285>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa07cc27c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
[<ffffffffa04ce5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa04dfd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa07c38b9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
[<ffffffff81052223>] ? __wake_up+0x53/0x70
[<ffffffffa07cd7c5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
[<ffffffffa07ccc50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa07ccc50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffffa07ccc50>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1363711711.3627



 Comments   
Comment by Andreas Dilger [ 20/Mar/13 ]

Looks like the transaction credit debugging caught a case where we didn't declare any credits for the rollback or something. The http://review.whamcloud.com/5698 change will disable this accounting for 2.3.90 and beyond, but I suspect we are missing some error handling code?

Comment by Bruno Faccini (Inactive) [ 22/Mar/13 ]

Minh,
Where/How have you been able to run such "2 oss with 150 osts each" configuration ?
Is it easily reproducible ?
Have you been able to get a crash-dump from occurrence(s) you already got ?

Comment by Minh Diep [ 22/Mar/13 ]

I ran this on our lab creating osts on LVM. I see if I can get the crash dump. I'll ping you offline to see if you can access the system

Comment by Bruno Faccini (Inactive) [ 22/Mar/13 ]

Assert comes from Patch/commit 28fc8fcc8bfe092c9a1a8c192ab6fe22d92820e7 for LU-2640.
Could it be the number of credits exceeds/overflows the size of entries (unsigned char) in oti_declare_ops_rb[] ??

Both the transaction credits counters declaration and the Assert comes from Patch/commit 28fc8fcc8bfe092c9a1a8c192ab6fe22d92820e7 for LU-2640.

Comment by Alex Zhuravlev [ 22/Mar/13 ]

with wide striping - yes.

Comment by Bruno Faccini (Inactive) [ 25/Mar/13 ]

Thank's Alex, and assume this may have occur whith the 150 OSTs Minh was using !

I changed the size of the transaction credits counters from uchar to ushort within patch http://review.whamcloud.com/5830 for master I just pushed to gerrit.

Minh, if no build failures (I still expect side-effects during build+run when changing size !!) can you give it a try ??

Comment by Bruno Faccini (Inactive) [ 26/Mar/13 ]

Humm, tests of build for my patch failed in sanity/test_24v with same LBUG :

04:13:52:Lustre: DEBUG MARKER: == sanity test 24v: list directory with large files (handle hash collision, bug: 17560) == 04:04:38 (1364209478)
04:13:52:LustreError: 11418:0:(osd_internal.h:909:osd_trans_exec_op()) ASSERTION( oti->oti_declare_ops_rb[rb] > 0 ) failed: rb = 5
04:13:52:LustreError: 11418:0:(osd_internal.h:909:osd_trans_exec_op()) LBUG
04:13:52:Pid: 11418, comm: mdt00_000
04:13:52:
04:13:53:Call Trace:
04:13:53: [<ffffffffa04d7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
04:13:53: [<ffffffffa04d7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
04:13:53: [<ffffffffa0d44360>] osd_object_ref_del+0x2d0/0x310 [osd_ldiskfs]
04:13:53: [<ffffffffa0f157fb>] lod_ref_del+0x3b/0xd0 [lod]
04:13:53: [<ffffffffa0c6acd5>] mdo_ref_del+0x35/0xc0 [mdd]
04:13:54: [<ffffffffa0c72446>] mdd_unlink+0x6c6/0xe20 [mdd]
04:13:54: [<ffffffffa0e61a18>] mdo_unlink+0x18/0x50 [mdt]
04:13:54: [<ffffffffa0e64cd9>] mdt_reint_unlink+0x739/0xfd0 [mdt]
04:13:54: [<ffffffffa0e616d1>] mdt_reint_rec+0x41/0xe0 [mdt]
04:13:54: [<ffffffffa0e5ad53>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
04:13:55: [<ffffffffa0e5b084>] mdt_reint+0x44/0xe0 [mdt]
04:13:55: [<ffffffffa0e49078>] mdt_handle_common+0x648/0x1660 [mdt]
04:13:56: [<ffffffffa0e85125>] mds_regular_handle+0x15/0x20 [mdt]
04:13:56: [<ffffffffa07f91dc>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
04:13:56: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
04:13:57: [<ffffffffa07f0819>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
04:13:57: [<ffffffff81052223>] ? __wake_up+0x53/0x70
04:13:57: [<ffffffffa07fa725>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
04:13:58: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:13:58: [<ffffffff8100c0ca>] child_rip+0xa/0x20
04:13:58: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:13:58: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:13:58: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
04:13:58:
04:13:59:Kernel panic - not syncing: LBUG
04:13:59:Pid: 11418, comm: mdt00_000 Not tainted 2.6.32-279.19.1.el6_lustre.gc4681d8.x86_64 #1
04:13:59:Call Trace:
04:14:00: [<ffffffff814e9811>] ? panic+0xa0/0x168
04:14:00: [<ffffffffa04d7eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
04:14:00: [<ffffffffa0d44360>] ? osd_object_ref_del+0x2d0/0x310 [osd_ldiskfs]
04:14:00: [<ffffffffa0f157fb>] ? lod_ref_del+0x3b/0xd0 [lod]
04:14:00: [<ffffffffa0c6acd5>] ? mdo_ref_del+0x35/0xc0 [mdd]
04:14:01: [<ffffffffa0c72446>] ? mdd_unlink+0x6c6/0xe20 [mdd]
04:14:01: [<ffffffffa0e61a18>] ? mdo_unlink+0x18/0x50 [mdt]
04:14:01: [<ffffffffa0e64cd9>] ? mdt_reint_unlink+0x739/0xfd0 [mdt]
04:14:01: [<ffffffffa0e616d1>] ? mdt_reint_rec+0x41/0xe0 [mdt]
04:14:01: [<ffffffffa0e5ad53>] ? mdt_reint_internal+0x4e3/0x7d0 [mdt]
04:14:01: [<ffffffffa0e5b084>] ? mdt_reint+0x44/0xe0 [mdt]
04:14:02: [<ffffffffa0e49078>] ? mdt_handle_common+0x648/0x1660 [mdt]
04:14:02: [<ffffffffa0e85125>] ? mds_regular_handle+0x15/0x20 [mdt]
04:14:02: [<ffffffffa07f91dc>] ? ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
04:14:02: [<ffffffffa04d85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
04:14:02: [<ffffffffa07f0819>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
04:14:03: [<ffffffff81052223>] ? __wake_up+0x53/0x70
04:14:03: [<ffffffffa07fa725>] ? ptlrpc_main+0xb75/0x1870 [ptlrpc]
04:14:03: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:14:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
04:14:03: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:14:03: [<ffffffffa07f9bb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
04:14:03: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Unless my very little change of counter's size introduced a regression, I am afraid that recent landed patches from LU-2640 may need some re-work ...
I will try to run this build on Toro and reproduce in order to get a crash-dump for analysis, but Minh if you already have been able to reproduce with your own test/configuration, just tell me.

Comment by Bruno Faccini (Inactive) [ 26/Mar/13 ]

I am trying to reproduce on a single Toro node and loop devices but no success/LBUG till now ...
I added "max_loop=164" Kernel boot parameter and then ran the following script based on your reproducer
original text :

export OSTCOUNT=150
/usr/lib64/lustre/tests/llmount.sh
while true ; do 
     for i in $(seq 1 300); do 
          mkdir /mnt/lustre/dir$i
          lfs setstripe -c 150 /mnt/lustre/dir$i
         touch /mnt/lustre/dir$i/file
         lfs getstripe /mnt/lustre/dir$i
     done
     rm -rf /mnt/lustre/dir*
done

may be I need more Nodes/VMs to trigger it ??

Comment by Minh Diep [ 26/Mar/13 ]

Yes, I think you might need more than one node to reproduce it. Did auto test provide a crash dump?

Comment by Bruno Faccini (Inactive) [ 26/Mar/13 ]

Yes, according to MDS Console log, but then how/where can I find it ?

Comment by Minh Diep [ 26/Mar/13 ]

I hit this during your patch test. I am not sure if it's related, likely not. I will reproduce this bug on master and collect the crash dump.

-----------[ cut here ]-----------
kernel BUG at fs/jbd2/transaction.c:1033!
invalid opcode: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 8
Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa microcode serio_raw mlx4_ib ib_mad ib_core mlx4_en mlx4_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core ses enclosure sg igb dca ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 674, comm: mdt00_003 Not tainted 2.6.32-279.19.1.el6_lustre.gc4681d8.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffffa044c86d>] [<ffffffffa044c86d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
RSP: 0018:ffff88032d16d6b0 EFLAGS: 00010246
RAX: ffff88035d5bf380 RBX: ffff880359dd4678 RCX: ffff880f985172d0
RDX: 0000000000000000 RSI: ffff880f985172d0 RDI: 0000000000000000
RBP: ffff88032d16d6d0 R08: a010000000000000 R09: f049b2734454b402
R10: 0000000000000000 R11: ffff880eb0d34000 R12: ffff88038e527ba8
R13: ffff880f985172d0 R14: ffff8808298c2800 R15: 0000000000001000
FS: 00007fac03683700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006dbf98 CR3: 0000000826eb6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mdt00_003 (pid: 674, threadinfo ffff88032d16c000, task ffff88032d213540)
Stack:
ffff880359dd4678 ffffffffa0d82b50 ffff880f985172d0 0000000000000000
<d> ffff88032d16d710 ffffffffa04721bb ffff88032d16d710 ffffffffa047f898
<d> 0000000000001000 0000000000001000 0000000000001000 ffff880f985172d0
Call Trace:
[<ffffffffa04721bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
[<ffffffffa047f898>] ? ldiskfs_bread+0x18/0x80 [ldiskfs]
[<ffffffffa0d6d1ec>] osd_ldiskfs_write_record+0xec/0x330 [osd_ldiskfs]
[<ffffffffa0d6e038>] osd_write+0x148/0x2a0 [osd_ldiskfs]
[<ffffffffa066aeb5>] dt_record_write+0x45/0x130 [obdclass]
[<ffffffffa0c8acd0>] ? md_capainfo+0x20/0x30 [mdd]
[<ffffffffa063788e>] llog_osd_write_blob+0x2fe/0x730 [obdclass]
[<ffffffffa063b157>] llog_osd_write_rec+0x6b7/0x1200 [obdclass]
[<ffffffffa0d679a5>] ? iam_path_fini+0x25/0x30 [osd_ldiskfs]
[<ffffffffa06073b8>] llog_write_rec+0xc8/0x290 [obdclass]
[<ffffffffa060f57d>] llog_cat_add_rec+0xad/0x480 [obdclass]
[<ffffffffa06071b1>] llog_add+0x91/0x1d0 [obdclass]
[<ffffffffa0f71ef7>] osp_sync_add_rec+0x247/0x8a0 [osp]
[<ffffffffa0d650af>] ? osd_oi_delete+0x2af/0x4b0 [osd_ldiskfs]
[<ffffffffa0f725fb>] osp_sync_add+0x7b/0x80 [osp]
[<ffffffffa0f664e6>] osp_object_destroy+0x106/0x150 [osp]
[<ffffffffa0f22347>] lod_object_destroy+0x1a7/0x350 [lod]
[<ffffffffa0c83a19>] mdd_finish_unlink+0x229/0x380 [mdd]
[<ffffffffa0c86748>] mdd_unlink+0x9c8/0xe20 [mdd]
[<ffffffffa0e6ba18>] mdo_unlink+0x18/0x50 [mdt]
[<ffffffffa0e6ecd9>] mdt_reint_unlink+0x739/0xfd0 [mdt]
[<ffffffffa0e6b6d1>] mdt_reint_rec+0x41/0xe0 [mdt]
[<ffffffffa0e64d53>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
[<ffffffffa0e65084>] mdt_reint+0x44/0xe0 [mdt]
[<ffffffffa0e53078>] mdt_handle_common+0x648/0x1660 [mdt]
[<ffffffffa0e8f125>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa07fc1dc>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
[<ffffffffa04f45de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa0505d8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa07f3819>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
[<ffffffff81052223>] ? __wake_up+0x53/0x70
[<ffffffffa07fd725>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
[<ffffffffa07fcbb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffa07fcbb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffffa07fcbb0>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: c6 9c 03 00 00 4c 89 f7 e8 a1 ff 09 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e8 11 ec ff ff 4c 89 f0 66 ff 00 66 66 90 e9 73 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 66 0f 1f 84 00 00 00 00 00 eb f5
RIP [<ffffffffa044c86d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
RSP <ffff88032d16d6b0>

I will file a separate bug if I hit this on master

Comment by Bruno Faccini (Inactive) [ 27/Mar/13 ]

According to stack-trace, I don't think it is related, but as I already said, with item size changes some weird consequences can show-up ...

Also, Chris confirmed that actually auto-test generated crash-dumps are not saved, it is in their to-do list.

Thus I rely on your testing to reproduce the LBUG, in // I try to build a configuration with more Nodes/VMs too, like the failed auto-test one.

Comment by Minh Diep [ 27/Mar/13 ]

I have reproduced it on master and got the debug log with debug=-1. I also did a crash dump but since this is a LBUG, not sure if a crash dump is useful.

Comment by Bruno Faccini (Inactive) [ 28/Mar/13 ]

crash-dump is not really useful in this case since it can only confirm that oti->oti_declare_ops_rb[OSD_OT_XATTR_SET] = 0 which caused the Assert to trigger.

But the lustre-logs are much more interesting since they confirm that the thread that trigger the Assert was gathering transaction credits stats for a work-load where wide-striping was used, and each time near the 256 (in fact 253 + ??) "fatal" value. So I still suspect an overflow due to uchar size ...

Minh, I know that for the crash-dump you were running with a genuine master build, but was it also the case for the full/debug=1 trace case ?

In fact, I would like to know if you got such Assert/LBUG when running with my patch ??

Comment by Alex Zhuravlev [ 28/Mar/13 ]

to check the theory about uchar size, you could put LASSERT(currect_value < 255); just before increment?

Comment by Bruno Faccini (Inactive) [ 28/Mar/13 ]

You are right Alex, I thought I can find it by counting accurate traces in the full debug log from Minh, but seems we miss these traces in fact ...

Minh, I just submitted http://review.whamcloud.com/5869 to trigger if we reach UCHAR_MAX, can you give it a try and expose it to your reproducer+configuration ??

Comment by Bruno Faccini (Inactive) [ 29/Mar/13 ]

Ok, fact that we can reach UCHAR_MAX and overflow oti_declare_ops counter has been definitely proven.

Then, changing the size of both oti_declare_ops/oti_declare_ops_rb from uchar to ushort, as per 1st patch/change 5830, seems to allow us to go further, but then we finally trigger the same Assert, now because oti_rollback is true but oti_declare_ops_rb is Null !!…

Seems that we missed something in the rollback logic … So, we continue to debug by adding traces/prints to see how counters progress and find where things break.

Comment by Alex Zhuravlev [ 29/Mar/13 ]

notice the checks in rollback path are not quite correct. in short, from ldiskfs/zfs point of view, there is no point to declare rollback changes: if you have insert for a name declared, then you'll be able to remove that name as part of rollback. this is why we plan to just disable these checks in production (until a better mechanism to track changes is implemented).

Comment by Minh Diep [ 01/Apr/13 ]

Alex, can you confirm that if we disable these check in production, this ASSERTION will no happen?

Comment by Alex Zhuravlev [ 01/Apr/13 ]

the assertions in OSD - yes. but I think we still need to understand the root cause for kernel BUG at fs/jbd2/transaction.c:1033!

Comment by Bruno Faccini (Inactive) [ 01/Apr/13 ]

Minh,

Yes, undefine of compile-time OSD_TRACK_DECLARES will fully disable the concerned transaction credits counters feature/code.
Concerning the "fs/jbd2/transaction.c:1033" kernel BUG()/Oops, am I wrong when I think it was a one-shoot and never re-occurred during the next heavy testing you ran ?
Also did you get a crash-dump at that time ? And did you open a new ticket for it finally ?

Comment by Minh Diep [ 01/Apr/13 ]

Disable by taking out the define OSD_TRACK_DECLARES results in compile error

/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c: In function ‘osd_trans_create’:
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:683: error: ‘struct osd_thread_info’ has no member named ‘oti_declare_ops’
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:684: error: ‘struct osd_thread_info’ has no member named ‘oti_declare_ops_rb’
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:685: error: ‘struct osd_thread_info’ has no member named ‘oti_declare_ops_cred’
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:686: error: ‘struct osd_thread_info’ has no member named ‘oti_rollback’
cc1: warnings being treated as errors
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c: In function ‘osd_trans_start’:
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:717: error: unused variable ‘last_credits’
/home/minh.diep/build/lu2991/lustre/osd-ldiskfs/osd_handler.c:716: error: unused variable ‘last_printed’

Bruno, no bug was filed for the kernel BUG and haven't hit it again

Comment by Bruno Faccini (Inactive) [ 01/Apr/13 ]

Humm, in osd_trans_create() function oti_declare_ops[_rb,_cred] and rollback fields memset()s/initializations need to be put under #ifdef OSD_TRACK_DECLARES.

Same for last_credits/last_printed variables declares in osd_trans_start().

Then we may run your tests again and see if we trigger Oops in jbd2 again.

Comment by Minh Diep [ 02/Apr/13 ]

ok, fixed the compile issue and ran the test for couple hours now. No ASSERTION. I will try to reproduce the Oops but in the mean time, this bug's priority can be lowered.

Comment by Bruno Faccini (Inactive) [ 03/Apr/13 ]

Patch-set #3 submitted according to the fact patch-set #1 did not correctly initialized/memset()s the new sized fields.

If it still shows the same wrong behavior we will definitely revert unset/undefine OSD_TRACK_DECLARES.

Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ]

I had to re-base patch #6 due to auto-tests failures like racer/test_1 [LustreError: 31880:0:(file.c:930:ll_file_io_generic()) ASSERTION( io->u.ci_rw.crw_count == count ) failed: 21504 != 4194304] which has been fixed with LU-3035 ...

Comment by Bruno Faccini (Inactive) [ 07/Apr/13 ]

Wow too bad, now patch-set #6 experienced 2 auto-tests failures (sanity/test_56w linked to LU-3092 and sanity/test_65ic linked to LU-3099), so a new re-base is required to integrate LU-3057 fix since both LU-3092/LU-3099 have been dup'ed to it ...

So, patch-set #7 just submitted after rebase ...

Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ]

Again 4x, and different from the preceding!, auto-tests failures. But all in review-zfs group, with at least 2 of them being known/perf ZFS issues (LU-2872, LU-2547) still under investigations. Thus, considering that this bug+patch is pure osd-ldiskfs stuff and since all review (pure ldiskfs ?) group tests have been successful, we may reasonably think that patch is ok and verified ...

Generated at Sat Feb 10 06:21:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.