[LU-5040] kernel BUG at fs/jbd2/transaction.c:1033 Created: 09/May/14  Updated: 07/Jul/16  Resolved: 02/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre: 2.4.1
kernel: 2.6.32-358.23.2.el6.20140115.x86_64.lustre241
build: 7nasS_ofed154

Source at https://github.com/jlan/lustre-nas


Issue Links:
Related
is related to LU-5777 reserve enough credits for chown/chgr... Resolved
is related to LU-5392 kernel BUG at fs/jbd2/transaction.c:1... Open
is related to LU-5336 kernel BUG at fs/jbd2/transaction.c:1... Resolved
is related to LU-5640 mds crash after update Resolved
is related to LU-5250 OSSes with LU-4611: hitting J_ASSERT_... Resolved
Severity: 3
Rank (Obsolete): 13932

 Description   

mdt crashed with

<4>------------[ cut here ]------------^M
<2>kernel BUG at fs/jbd2/transaction.c:1033!^M
[1]kdb> sr 8^M
SysRq : Changing Loglevel^M
Loglevel set to 8^M
[1]kdb> sr p^M
SysRq : Show Regs^M
CPU 1 ^M
Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) jbd2 mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) dm_round_robin scsi_dh_rdac lpfc(U) scsi_transport_fc scsi_tgt nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc bonding 8021q garp stp llc ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) dm_multipath tcp_bic power_meter dcdbas microcode iTCO_wdt iTCO_vendor_support shpchp mlx4_core(U) memtrack(U) ses enclosure sg tg3 hwmon ext3 jbd sd_mod crc_t10dif wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]^M
^M
Pid: 13917, comm: mdt_rdpg02_017 Not tainted 2.6.32-358.23.2.el6.20140115.x86_64.lustre241 #1 Dell Inc. PowerEdge R720/0VWT90^M
RIP: 0010:[<ffffffffa0bd88ad>]  [<ffffffffa0bd88ad>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]^M
RSP: 0018:ffff880f537198a0  EFLAGS: 00010246^M
RAX: ffff880f88da9cc0 RBX: ffff880eb8352d08 RCX: ffff880bf382b610^M
RDX: 0000000000000000 RSI: ffff880bf382b610 RDI: 0000000000000000^M
RBP: ffff880f537198c0 R08: 2010000000000000 R09: f3ee8046d0a58402^M
R10: 0000000000000001 R11: ffff880863dd6e10 R12: ffff880f4897f518^M
R13: ffff880bf382b610 R14: ffff881007dcc800 R15: 0000000000000008^M
FS:  00007fffedaf3700(0000) GS:ffff88084c400000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: 000000000061c9b8 CR3: 0000000001a25000 CR4: 00000000000407e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process mdt_rdpg02_017 (pid: 13917, threadinfo ffff880f53718000, task ffff880f5370aae0)^M
Stack:^M
 ffff880eb8352d08 ffffffffa0ca92d0 ffff880bf382b610 0000000000000000^M
<d> ffff880f53719900 ffffffffa0c680bb ffff880f537198f0 ffffffff810962ff^M
<d> ffff8810213f3350 ffff880eb8352d08 0000000000000018 ffff880bf382b610^M
all Trace:^M
 [<ffffffffa0c680bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]^M
 [<ffffffff810962ff>] ? wake_up_bit+0x2f/0x40^M
 [<ffffffffa0c9ea55>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]^M
 [<ffffffff811e2221>] v2_write_file_info+0xa1/0xe0^M
 [<ffffffff811de328>] dquot_acquire+0x138/0x140^M
 [<ffffffffa0c9d5f6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]^M
 [<ffffffff811e029c>] dqget+0x2ac/0x390^M
 [<ffffffff811e0848>] dquot_initialize+0x98/0x240^M
 [<ffffffffa0c9d812>] ldiskfs_dquot_initialize+0x62/0xc0 [ldiskfs]^M
 [<ffffffffa0cf8d6f>] osd_attr_set+0x12f/0x540 [osd_ldiskfs]^M
 [<ffffffffa0eb15cb>] lod_attr_set+0x12b/0x450 [lod]^M
 [<ffffffffa0b6d411>] mdd_attr_set_internal+0x151/0x230 [mdd]^M
 [<ffffffffa0b706ea>] mdd_attr_set+0x107a/0x1390 [mdd]^M
 [<ffffffffa06fd011>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]^M
 [<ffffffffa0e0e182>] mdt_mfd_close+0x502/0x6e0 [mdt]^M
 [<ffffffffa0e0f73a>] mdt_close+0x67a/0xab0 [mdt]^M
 [<ffffffffa0de7ad7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
 [<ffffffffa0e21635>] mds_readpage_handle+0x15/0x20 [mdt]^M
 [<ffffffffa070d3d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
 [<ffffffffa04175de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
 [<ffffffffa0428d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
 [<ffffffffa0704739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
 [<ffffffff81055813>] ? __wake_up+0x53/0x70^M
 [<ffffffffa070e76e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
Code: c6 9c 03 00 00 4c 89 f7 e8 11 97 96 e0 48 8b 33 ba 01 00 00 00 4c 89 e7 e8 11 ec ff ff 4c 89 f0 66 ff 00 66 66 90 e9 73 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 66 
Call Trace:^M
 [<ffffffffa0c680bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]^M
 [<ffffffff810962ff>] ? wake_up_bit+0x2f/0x40^M
 [<ffffffffa0c9ea55>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]^M
 [<ffffffff811e2221>] v2_write_file_info+0xa1/0xe0^M
 [<ffffffff811de328>] dquot_acquire+0x138/0x140^M
 [<ffffffffa0c9d5f6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]^M
 [<ffffffff811e029c>] dqget+0x2ac/0x390^M
 [<ffffffff811e0848>] dquot_initialize+0x98/0x240^M
 [<ffffffffa0c9d812>] ldiskfs_dquot_initialize+0x62/0xc0 [ldiskfs]^M
 [<ffffffffa0cf8d6f>] osd_attr_set+0x12f/0x540 [osd_ldiskfs]^M
 [<ffffffffa0eb15cb>] lod_attr_set+0x12b/0x450 [lod]^M
 [<ffffffffa0b6d411>] mdd_attr_set_internal+0x151/0x230 [mdd]^M
 [<ffffffffa0b706ea>] mdd_attr_set+0x107a/0x1390 [mdd]^M
 [<ffffffffa06fd011>] ? lustre_pack_reply_v2+0x1e1/0x280 [ptlrpc]^M
 [<ffffffffa0e0e182>] mdt_mfd_close+0x502/0x6e0 [mdt]^M
 [<ffffffffa0e0f73a>] mdt_close+0x67a/0xab0 [mdt]^M
 [<ffffffffa0de7ad7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
 [<ffffffffa0e21635>] mds_readpage_handle+0x15/0x20 [mdt]^M
 [<ffffffffa070d3d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
 [<ffffffffa04175de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
 [<ffffffffa0428d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
 [<ffffffffa0704739>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
 [<ffffffff81055813>] ? __wake_up+0x53/0x70^M
 [<ffffffffa070e76e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffffa070dca0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M

After recover it crashed again at the same place.

AFTER RECOVER

Lustre: nbp7-MDT0000: recovery is timed out, evict stale exports^M
Lustre: nbp7-MDT0000: disconnecting 30 stale clients^M
LustreError: 5667:0:(mdt_lvb.c:157:mdt_lvbo_fill()) nbp7-MDT0000: expected 56 actual 0.^M
Lustre: nbp7-MDT0000: Recovery over after 5:02, of 11832 clients 11802 recovered and 30 were evicted.^M
------------[ cut here ]------------^M
kernel BUG at fs/jbd2/transaction.c:1033!^M

Rebooted Ran fsck.

Ran recovery Crashed again same place

Rebooted Mounted with abort recover no crash so far.



 Comments   
Comment by Peter Jones [ 09/May/14 ]

Bobijam

Is this related to LU-4382?

Peter

Comment by Zhenyu Xu [ 12/May/14 ]

yes, I think so, should we land the patch on b2_4 branch as well?

Comment by Peter Jones [ 12/May/14 ]

I think that NASA would appreciate the option of applying a patch for this issue rather than waiting until they upgrade to 2.5.x, so could you please port the patch to b2_4? Thanks

Comment by Zhenyu Xu [ 12/May/14 ]

LU-4382 patch port for b2_4 tracks at http://review.whamcloud.com/10293

Comment by Jay Lan (Inactive) [ 16/Jun/14 ]

I figured out that the patch in #10293 is a kernel patch. Does that mean we need to rebuild rhel6.3 kernel to get help from this patch? Are the ldiskfs rpms alone built during lustre build not sufficient? Thanks for advise.

Comment by Jay Lan (Inactive) [ 16/Jun/14 ]

Our production system crashed on this bug. My question comes down to this:
Lustre build relies on ext4/* source from the kernel. IF I just change the two .c
files that are to be used by lustre build without rebuild the kernel, would
the lustre server built out of this (ie, running kernel not rebuilt) work correctly?

This would give me quick turnaround.

Comment by Andreas Dilger [ 16/Jun/14 ]

Jay,
the change in the 10293 patch is against a patch that is applied on top of the ext4 code that is copied from the kernel, in the ldiskfs directory of the lustre source tree. I don't think it would be possible to patch this in the source of the kernel ext4 code, since the affected code (in ext4_dquot_initialize() and ext4_dquot_drop()) do not exist in the upstream kernel code. The changes to the core ext4 kernel code in ext4_delete_inode() could be applied to your kernel sources, but it would just be better to apply the whole 10293 patch to your Lustre source tree and rebuild Lustre completely.

In theory just the ldiskfs.ko module would need to be replaced, since this patch shouldn't affect the binary API between any of the modules, but for safety it would be better to just install the whole package (with a new package version) so that it is clear that this updated version is installed everywhere.

Comment by Jay Lan (Inactive) [ 16/Jun/14 ]

Oops, my bad. Only lustre/kernel_patches/ needs to be patched into the kerne source. This patch is under ldiskfs/kernel_patches/...
I do not need to rebuild the lustre server kernel.

Comment by Li Xi (Inactive) [ 01/Jul/14 ]

We saw a similar problem in a system which already has LU-4382 patch applied.

014-07-01 06:24:48 Stack:
2014-07-01 06:24:48 ffff882067153858 ffffffffa0c2e510 ffff880f812de9b8
0000000000000000
2014-07-01 06:24:48 <d> ffff880f67de5890 ffffffffa0bed0bb
ffff880f67de5880 ffffffff81096d8f
2014-07-01 06:24:48 <d> ffff880f812cc5f0 ffff882067153858
0000000000000018 ffff880f812de9b8
2014-07-01 06:24:48 Call Trace:
2014-07-01 06:24:48 [<ffffffffa0bed0bb>]
__ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
2014-07-01 06:24:48 [<ffffffff81096d8f>] ? wake_up_bit+0x2f/0x40
2014-07-01 06:24:48 [<ffffffffa0c23c35>]
ldiskfs_quota_write+0x165/0x210 [ldiskfs]
2014-07-01 06:24:48 [<ffffffff811e4a81>] v2_write_file_info+0xa1/0xe0
2014-07-01 06:24:48 [<ffffffff811e0b88>] dquot_acquire+0x138/0x140
2014-07-01 06:24:48 [<ffffffffa0c227a6>]
ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]
2014-07-01 06:24:48 [<ffffffff811e2afc>] dqget+0x2ac/0x390
2014-07-01 06:24:48 [<ffffffff811e30a8>] dquot_initialize+0x98/0x240
2014-07-01 06:24:48 [<ffffffffa0c22a03>]
ldiskfs_dquot_initialize+0x83/0xd0 [ldiskfs]
2014-07-01 06:24:48 [<ffffffffa0c7ddcf>] osd_attr_set+0x12f/0x540
[osd_ldiskfs]
2014-07-01 06:24:48 [<ffffffffa0d55879>] dt_attr_set.clone.2+0x29/0xc0
[ofd]
2014-07-01 06:24:48 [<ffffffffa0d59362>] ofd_attr_set+0x522/0x6c0 [ofd]
2014-07-01 06:24:48 [<ffffffffa0d4ae2a>] ofd_setattr+0x69a/0xb80 [ofd]
2014-07-01 06:24:48 [<ffffffffa0d1bc1c>] ost_setattr+0x31c/0x990 [ost]
2014-07-01 06:24:48 [<ffffffffa0d1f746>] ost_handle+0x21e6/0x48e0 [ost]
2014-07-01 06:24:48 [<ffffffffa06cfbcb>] ?
ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc]
2014-07-01 06:24:48 [<ffffffffa06d83a8>]
ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
2014-07-01 06:24:48 [<ffffffffa03765de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2014-07-01 06:24:48 [<ffffffffa0387d3f>] ? lc_watchdog_touch+0x6f/0x170
[libcfs]
2014-07-01 06:24:48 [<ffffffffa06cf709>] ? ptlrpc_wait_event+0xa9/0x290
[ptlrpc]
2014-07-01 06:24:48 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
2014-07-01 06:24:48 [<ffffffffa06d973e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
2014-07-01 06:24:49 [<ffffffffa06d8c70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-07-01 06:24:49 [<ffffffff8100c0ca>] child_rip+0xa/0x20
2014-07-01 06:24:49 [<ffffffffa06d8c70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-07-01 06:24:49 [<ffffffffa06d8c70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
2014-07-01 06:24:49 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Comment by Li Xi (Inactive) [ 01/Jul/14 ]

I am not sure, but I am wondering whether we need to always add extra 2*LDISKFS_QUOTA_INIT_BLOCKS credit in osd_declare_attr_set(), so that ldiskfs_dquot_initialize() will have enough credit...

Comment by Mahmoud Hanafi [ 13/Jul/14 ]

We have hit this on 2.4.3 with patch LU-4382 applied.

Comment by Zhenyu Xu [ 14/Jul/14 ]

Niu,

This issue relates to credits counting deficiency in osd_declare_attr_set(), can you take a look?

Comment by Niu Yawei (Inactive) [ 14/Jul/14 ]

I am not sure, but I am wondering whether we need to always add extra 2*LDISKFS_QUOTA_INIT_BLOCKS credit in osd_declare_attr_set(), so that ldiskfs_dquot_initialize() will have enough credit...

We do reserve LDISKFS_QUOTA_INIT_BLOCKS for each ID in declare stage. (see osd_declare_qid())

This issue relates to credits counting deficiency in osd_declare_attr_set(), can you take a look?

The inconsistence comes from:

  • osd_declare_attr_set() calls osd_declare_qid() to reserve credit only when original ID isn't equal to ID to be set, however;
  • osd_attr_set() always calls ll_vfs_dq_init() to initialize dquot for the inode;

I think we can just move the ll_vfs_dq_init() in osd_attr_set() into osd_quota_transfer() to resolve the problem.

Comment by Zhenyu Xu [ 14/Jul/14 ]

patch for master branch http://review.whamcloud.com/11085

Comment by Mahmoud Hanafi [ 14/Jul/14 ]

We will need a patch for 2.4.3.

Comment by Peter Jones [ 14/Jul/14 ]

Understood. Our usual practice is to finalize the form of the patch on master then to back port to older branches as needed

Comment by Zhenyu Xu [ 15/Jul/14 ]

patch for b2_4 http://review.whamcloud.com/11096
patch for b2_5 http://review.whamcloud.com/11097

Comment by Mahmoud Hanafi [ 15/Jul/14 ]

Just wanted to confirm that this patch will fix LU-5336

Comment by Jay Lan (Inactive) [ 16/Jul/14 ]

Mahmoud, this bug looks like LU-4382 "BUG at fs/jbd2/transaction.c:1033".

I have cherry-picked the LU-4382 patch to both 2.4.1-8nasS and 2.4.3-2nasS. Please check if we still hit this problem with any of the above builds.

Patch for LU-4382 was landed to master and b2_5, but not to b2_4. Zhenyu's b2_4 patch conflicts with LU-4382 patch, and thus I noticed when I tried to cherry-pick Zhenyu's b2_4 patch. We may not need Zhenyu's patch if LU-4382 patch addressed our problem.

Comment by Jay Lan (Inactive) [ 16/Jul/14 ]

Ah, just saw Mahmoud's comment on July 13 that we hit the but on 2.4.3 server with LU-4382 patch applied.

Can we land LU-4382 patch to b2_4 first and create a new b2_4 patch for this LU? Thanks!

Comment by Peter Jones [ 16/Jul/14 ]

I believe that we can decouple the landing from the patch creation - I would think that simply rebasing http://review.whamcloud.com/#/c/11096/ to be dependent upon http://review.whamcloud.com/#/c/10293/ would meet your needs

Comment by Jay Lan (Inactive) [ 16/Jul/14 ]

Peter, that would work for me. We have been cherry-picking patches before they land to official branches.

The patch set #1 of #11096 has conflicts with #10293, and that raised a concern of me. The newer patch certainly needs to have knowledge of earlier patch(es) when they work on the same issue.

Comment by Zhenyu Xu [ 17/Jul/14 ]

http://review.whamcloud.com/#/c/10293/ and http://review.whamcloud.com/#/c/11096/ modifie different files, how come they would conflict? Must be #11096 conflict with some other patches you've pushed in your code base I think.

Comment by Jay Lan (Inactive) [ 17/Jul/14 ]

Quite embarrassing...

The one that caused conflict was http://review.whamcloud.com/9807 of LU-4611.

It was back ported to b2_4 on March 27.

Sorry I provided incorrect information.

Comment by Zhenyu Xu [ 17/Jul/14 ]

You can apply the b2_5 version (http://review.whamcloud.com/11097) on top of your code base, since b2_5 already has LU-4611 patch.

Comment by Jay Lan (Inactive) [ 17/Jul/14 ]

Thanks, Zhenyu!

Comment by Mahmoud Hanafi [ 04/Aug/14 ]

We hit this bug on a OSS with the patched applied

-----------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1033!
BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [<ffffffff8145d81d>] kdb_bb+0x3bd/0x1290
PGD 1a27067 PUD 1a28067 PMD 0 
Oops: 0000 [#1] SMP 
crash> bt
PID: 8324   TASK: ffff880afaba4ae0  CPU: 11  COMMAND: "ll_ost03_000"
 #0 [ffff880afabaf340] machine_kexec at ffffffff81035e8b
 #1 [ffff880afabaf3a0] crash_kexec at ffffffff810c0492
 #2 [ffff880afabaf470] kdb_kdump_check at ffffffff812858d7
 #3 [ffff880afabaf480] kdb_main_loop at ffffffff81288ac7
 #4 [ffff880afabaf590] kdb_save_running at ffffffff81282c2e
 #5 [ffff880afabaf5a0] kdba_main_loop at ffffffff81463988
 #6 [ffff880afabaf5e0] kdb at ffffffff81285dc6
 #7 [ffff880afabaf650] report_bug at ffffffff812992b3
 #8 [ffff880afabaf680] die at ffffffff8100f2cf
 #9 [ffff880afabaf6b0] do_trap at ffffffff81542a34
#10 [ffff880afabaf710] do_invalid_op at ffffffff8100cea5
#11 [ffff880afabaf7b0] invalid_op at ffffffff8100be5b
    [exception RIP: jbd2_journal_dirty_metadata+269]
    RIP: ffffffffa0ca28ad  RSP: ffff880afabaf860  RFLAGS: 00010246
    RAX: ffff880bb027db80  RBX: ffff88072d37c468  RCX: ffff8805de35b748
    RDX: 0000000000000000  RSI: ffff8805de35b748  RDI: 0000000000000000
    RBP: ffff880afabaf880   R8: 9010000000000000   R9: fa03cbc04565d202
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff8808062b9ba8
    R13: ffff8805de35b748  R14: ffff8805b6d0a800  R15: 0000000000000080
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#12 [ffff880afabaf888] __ldiskfs_handle_dirty_metadata at ffffffffa0d320bb [ldiskfs]
#13 [ffff880afabaf8c8] osd_ldiskfs_write_record at ffffffffa0dce92c [osd_ldiskfs]
#14 [ffff880afabaf958] osd_write at ffffffffa0dcf878 [osd_ldiskfs]
#15 [ffff880afabaf998] dt_record_write at ffffffffa0638415 [obdclass]
#16 [ffff880afabaf9c8] tgt_client_data_write at ffffffffa080dcac [ptlrpc]
#17 [ffff880afabafa08] ofd_txn_stop_cb at ffffffffa0e96ad5 [ofd]
#18 [ffff880afabafa68] dt_txn_hook_stop at ffffffffa0637f23 [obdclass]
#19 [ffff880afabafa98] osd_trans_stop at ffffffffa0db0ca7 [osd_ldiskfs]
#20 [ffff880afabafb18] ofd_trans_stop at ffffffffa0e96882 [ofd]
#21 [ffff880afabafb28] ofd_attr_set at ffffffffa0e9b225 [ofd]
#22 [ffff880afabafb88] ofd_setattr at ffffffffa0e8ce2a [ofd]
#23 [ffff880afabafc18] ost_setattr at ffffffffa0e5dc1c [ost]
#24 [ffff880afabafc78] ost_handle at ffffffffa0e61746 [ost]
#25 [ffff880afabafdb8] ptlrpc_server_handle_request at ffffffffa07cf3b8 [ptlrpc]
#26 [ffff880afabafeb8] ptlrpc_main at ffffffffa07d074e [ptlrpc]
#27 [ffff880afabaff48] kernel_thread at ffffffff8100c0ca
Comment by Mahmoud Hanafi [ 04/Aug/14 ]

And a second one crashed

Pid: 17618, comm: ll_ost03_056 Not tainted 2.6.32-358.23.2.el6.20140115.x86_64.lustre243 #1 SGI.COM SUMMIT/S2600GZ^M
RIP: 0010:[<ffffffffa05038ad>]  [<ffffffffa05038ad>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]^M
RSP: 0018:ffff881f7a1a9530  EFLAGS: 00010246^M
RAX: ffff880b9a89d4c0 RBX: ffff881a9133aaf8 RCX: ffff882011fb7a20^M
RDX: 0000000000000000 RSI: ffff882011fb7a20 RDI: 0000000000000000^M
RBP: ffff881f7a1a9550 R08: 4010000000000000 R09: dfd00c8a5ed68802^M
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88118ba5e208^M
R13: ffff882011fb7a20 R14: ffff881fe0bff800 R15: 0000000000001400^M
FS:  00007fffedaf0700(0000) GS:ffff881078880000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
CR2: 00000000006c9038 CR3: 0000000001a25000 CR4: 00000000000407e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process ll_ost03_056 (pid: 17618, threadinfo ffff881f7a1a8000, task ffff881f7a196aa0)^M
Stack:^M
 ffff881a9133aaf8 ffffffffa0bec510 ffff882011fb7a20 0000000000000000^M
<d> ffff881f7a1a9590 ffffffffa0bab0bb ffff881f7a1a9580 ffffffff810962ff^M
<d> ffff88201f19e250 ffff881a9133aaf8 0000000000000400 ffff882011fb7a20^M
Call Trace:^M
 [<ffffffffa0bab0bb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]^M
 [<ffffffff810962ff>] ? wake_up_bit+0x2f/0x40^M
 [<ffffffffa0be1c85>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]^M
 [<ffffffff811e28ae>] write_blk+0x2e/0x30^M
 [<ffffffff811e2e5a>] remove_free_dqentry+0x8a/0x140^M
 [<ffffffff811e3807>] do_insert_tree+0x317/0x3d0^M
 [<ffffffff811e3775>] do_insert_tree+0x285/0x3d0^M
 [<ffffffff811e3775>] do_insert_tree+0x285/0x3d0^M
 [<ffffffff811e3775>] do_insert_tree+0x285/0x3d0^M
 [<ffffffff811e39b8>] qtree_write_dquot+0xf8/0x150^M
 [<ffffffff811e2c2e>] ? qtree_read_dquot+0x5e/0x200^M
 [<ffffffff811e2100>] v2_write_dquot+0x30/0x40^M
 [<ffffffff811de2b0>] dquot_acquire+0xc0/0x140^M
 [<ffffffffa0be07f6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]^M
 [<ffffffff811e029c>] dqget+0x2ac/0x390^M
 [<ffffffff811e1b86>] dquot_transfer+0x116/0x620^M
 [<ffffffff811e09ab>] ? dquot_initialize+0x1fb/0x240^M
 [<ffffffffa0be0558>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]^M
 [<ffffffff811de4bc>] vfs_dq_transfer+0x6c/0xd0^M
 [<ffffffffa0c12128>] osd_quota_transfer+0xa8/0x160 [osd_ldiskfs]^M
 [<ffffffffa05e63ab>] ? lu_context_init+0xab/0x260 [obdclass]^M
 [<ffffffffa0c1109e>] ? osd_trans_exec_op+0x1e/0x2e0 [osd_ldiskfs]^M
 [<ffffffffa0c23432>] osd_attr_set+0x102/0x4e0 [osd_ldiskfs]^M
 [<ffffffffa0cca879>] dt_attr_set.clone.2+0x29/0xc0 [ofd]^M
 [<ffffffffa0cce362>] ofd_attr_set+0x522/0x6c0 [ofd]^M
 [<ffffffffa0cbfe2a>] ofd_setattr+0x69a/0xb80 [ofd]^M
 [<ffffffffa0c9bc1c>] ost_setattr+0x31c/0x990 [ost]^M
 [<ffffffffa0c9f746>] ost_handle+0x21e6/0x48e0 [ost]^M
 [<ffffffffa0494124>] ? libcfs_id2str+0x74/0xb0 [libcfs]^M
 [<ffffffffa077e3b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
 [<ffffffffa04885de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
 [<ffffffffa0499d6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
 [<ffffffffa0775719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20^M
 [<ffffffffa077f74e>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
 [<ffffffffa077ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffffa077ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffffa077ec80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
Comment by Jay Lan (Inactive) [ 04/Aug/14 ]

We picked up the patch on 7/17. There is a newer version of patch on 7/25 that we were not aware of.

Comment by Oleg Drokin [ 04/Aug/14 ]

Yes, unfortunately 7/17 version crashes in almost exactly the way (A bit different backtrace) in my testing, but the 7/25 does nto crash in my testing.
So please try to apply the newer patch

Comment by Mahmoud Hanafi [ 05/Aug/14 ]

I think the LU should be updated when the patch provided is change/updated

Comment by Zhenyu Xu [ 08/Aug/14 ]

sorry for that, I forgot to update here, just updated in the gerrit.

Comment by Jay Lan (Inactive) [ 08/Aug/14 ]

That is fine, Zhenyu

Peter mentioned we used to have too much information in JIRA and thus Intel no longer logs gerrit messages to JIRA.

We do not need messages about Jenkins, Autotest or Maloo. A simple message "Patch Set # uploaded" to JIRA for every new patch set is sufficient and I do not consider it noisy. I think it can be implemented to your system.

Comment by Zhenyu Xu [ 29/Aug/14 ]

the patch has been updated based on review result.

Comment by Jay Lan (Inactive) [ 29/Aug/14 ]

Thank you, Zhenyu, for the update. I will pick up the new patch set.

Comment by Peter Jones [ 02/Oct/14 ]

Landed for 2.5.4 and 2.7

Comment by Jay Lan (Inactive) [ 25/Mar/15 ]

We hit this bug again in production. It hit an OSS. The backtrace looks exactly the same as that in Mahmoud's comment on 04/Aug/14 10:43 AM.

We have patch set #5 from http://review.whamcloud.com/11097 in our git repo.
The difference between #5 and #6 was an extra empty line.
The difference between #6 and #7 was in commit message.

Comment by Zhenyu Xu [ 26/Mar/15 ]

Hi Jay,

Does your repository has patch of LU-5777? That also will cause credits deficiency, we'd backport it too.

Comment by Jay Lan (Inactive) [ 26/Mar/15 ]

Hi Zhenyu,

We do not have patch of LU-5777 in our 2.4.3 repo. Hmm, not in our 2.5.3 either.

Comment by Jay Lan (Inactive) [ 26/Mar/15 ]

I saw your back ported LU-5777 patch to b2_5. My cherry-pick went cleanly.
I will wait for your patch to clear autotest before I build it. Thanks, Zhenyu!

Generated at Sat Feb 10 01:48:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.