[LU-3071] kernel BUG at fs/jbd2/transaction.c:293 (jbd2_journal_start) Created: 29/Mar/13  Updated: 18/Jul/17  Resolved: 18/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Won't Do Votes: 0
Labels: llnl
Environment:

Lustre 2.1.4 on RHEL 6.4


Issue Links:
Related
is related to LU-4529 kernel BUG at fs/jbd2/transaction.c:2... Resolved
Severity: 3
Rank (Obsolete): 7472

 Description   

A couple OSS nodes went down with very similar stack traces. One of the stack is below:

2013-03-28 11:01:49 ------------[ cut here ]------------
2013-03-28 11:01:49 kernel BUG at fs/jbd2/transaction.c:293!
2013-03-28 11:01:49 invalid opcode: 0000 [#1] SMP 
2013-03-28 11:01:49 last sysfs file: /sys/devices/system/edac/pci/pci_parity_count
2013-03-28 11:01:49 CPU 11 
2013-03-28 11:01:49 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) libcfs(U) sd_mod crc_t10dif ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sr_mod cdrom mpt2sas scsi_transport_sas raid_class serio_raw i2c_i801 i2c_core ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core shpchp nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand]
2013-03-28 11:01:50 
2013-03-28 11:01:50 Pid: 6814, comm: ll_ost_io_72 Not tainted 2.6.32-358.5chaos.ch5.1.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
2013-03-28 11:01:50 RIP: 0010:[<ffffffffa0c2b5ac>]  [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2]
2013-03-28 11:01:50 RSP: 0018:ffff880107687190  EFLAGS: 00010287
2013-03-28 11:01:50 RAX: ffff8800add1f580 RBX: ffff880341bc2318 RCX: 0000000000000000
2013-03-28 11:01:50 RDX: 0000000000000000 RSI: 0000000000000012 RDI: ffff88010ed29000
2013-03-28 11:01:50 RBP: ffff8801076871d0 R08: ffff88039ec46b38 R09: fea40d252cae9407
2013-03-28 11:01:50 R10: 0000000000000000 R11: 0000000000000001 R12: ffff880341bc2318
2013-03-28 11:01:50 R13: ffff88010ed29000 R14: ffff8801237ae080 R15: 0000000000000012
2013-03-28 11:01:50 FS:  00007ffff7fdb700(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
2013-03-28 11:01:50 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2013-03-28 11:01:50 CR2: 000000000047f670 CR3: 0000000001a85000 CR4: 00000000000007e0
2013-03-28 11:01:50 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2013-03-28 11:01:50 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2013-03-28 11:01:50 Process ll_ost_io_72 (pid: 6814, threadinfo ffff880107686000, task ffff8801237ae080)
2013-03-28 11:01:50 Stack:
2013-03-28 11:01:50  ffff8801076871a0 ffffffff810e684e ffff8801076871c0 ffff880129716c00
2013-03-28 11:01:50 <d> 0000000000000012 0000000000000000 ffffffff81fcb680 ffff88039ec46b38
2013-03-28 11:01:50 <d> ffff8801076871f0 ffffffffa0c83506 ffff88039ec46b28 ffff88039ec46c60
2013-03-28 11:01:50 Call Trace:
2013-03-28 11:01:50  [<ffffffff810e684e>] ? call_rcu+0xe/0x10
2013-03-28 11:01:50  [<ffffffffa0c83506>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
2013-03-28 11:01:50  [<ffffffffa0c83854>] ldiskfs_dquot_drop+0x34/0x80 [ldiskfs]
2013-03-28 11:01:50  [<ffffffff811e00a2>] vfs_dq_drop+0x52/0x60
2013-03-28 11:01:50  [<ffffffff8119cac3>] clear_inode+0x93/0x140
2013-03-28 11:01:50  [<ffffffff8119cbb0>] dispose_list+0x40/0x120
2013-03-28 11:01:50  [<ffffffff8119cf04>] shrink_icache_memory+0x274/0x2e0
2013-03-28 11:01:50  [<ffffffff8113196a>] shrink_slab+0x12a/0x1a0
2013-03-28 11:01:50  [<ffffffff81134439>] zone_reclaim+0x279/0x400
2013-03-28 11:01:50  [<ffffffff8112a08c>] get_page_from_freelist+0x69c/0x830
2013-03-28 11:01:50  [<ffffffff8112b623>] ? __alloc_pages_nodemask+0x113/0x8d0
2013-03-28 11:01:50  [<ffffffffa0c53456>] ? ldiskfs_ext_get_blocks+0x296/0x14b0 [ldiskfs]
2013-03-28 11:01:50  [<ffffffff8112b623>] __alloc_pages_nodemask+0x113/0x8d0
2013-03-28 11:01:50  [<ffffffff81166f78>] ? ____cache_alloc_node+0x108/0x160
2013-03-28 11:01:50  [<ffffffff81166572>] kmem_getpages+0x62/0x170
2013-03-28 11:01:50  [<ffffffff81166bdf>] cache_grow+0x2cf/0x320
2013-03-28 11:01:50  [<ffffffff81166e32>] cache_alloc_refill+0x202/0x240
2013-03-28 11:01:50  [<ffffffff81167ecf>] kmem_cache_alloc+0x15f/0x190
2013-03-28 11:01:50  [<ffffffff8109e6b6>] prepare_creds+0x26/0xb0
2013-03-28 11:01:50  [<ffffffffa052a865>] cfs_cap_raise+0x15/0x40 [libcfs]
2013-03-28 11:01:50  [<ffffffffa0d580aa>] filter_commitrw_write+0x11ca/0x2d28 [obdfilter]
2013-03-28 11:01:50  [<ffffffff8150db00>] ? thread_return+0x4e/0x76e
2013-03-28 11:01:50  [<ffffffff810812fc>] ? lock_timer_base+0x3c/0x70
2013-03-28 11:01:50  [<ffffffff81081e12>] ? del_timer_sync+0x22/0x30
2013-03-28 11:01:50  [<ffffffff8150e8fa>] ? schedule_timeout+0x19a/0x2e0
2013-03-28 11:01:50  [<ffffffff81081410>] ? process_timeout+0x0/0x10
2013-03-28 11:01:50  [<ffffffffa0d4b332>] filter_commitrw+0x272/0x290 [obdfilter]
2013-03-28 11:01:50  [<ffffffffa07c69a9>] ? target_bulk_io+0x619/0x940 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa0ce0bdd>] obd_commitrw+0x11d/0x3c0 [ost]
2013-03-28 11:01:50  [<ffffffffa0ce8994>] ost_brw_write+0xcc4/0x1600 [ost]
2013-03-28 11:01:50  [<ffffffffa07bb000>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa0cee0a7>] ost_handle+0x2b77/0x4270 [ost]
2013-03-28 11:01:50  [<ffffffffa07f9e1c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa080141b>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa080900e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
2013-03-28 11:01:50  [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
2013-03-28 11:01:50  [<ffffffffa08083c0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
2013-03-28 11:01:50  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-03-28 11:01:50 Code: e8 6a cf 53 e0 49 c7 86 38 08 00 00 00 00 00 00 8b 45 c8 48 63 d8 e9 62 ff ff ff 0f 1f 40 00 48 c7 c3 f4 ff ff ff e9 52 ff ff ff <0f> 0b eb fe 55 48 89 e5 0f 1f 44 00 00 be 01 00 00 00 e8 dd fe 
2013-03-28 11:01:50 RIP  [<ffffffffa0c2b5ac>] jbd2_journal_start+0x10c/0x110 [jbd2]
2013-03-28 11:01:50  RSP <ffff880107687190>

From what I can tell, the root of the problem is current->journal_info->h_transaction->t_journal does not match LDISKFS_SB(sb)->s_journal. Causing us to trigger this assertion in jbd2_journal_start:

 292         if (handle) {                                                           
 293                 J_ASSERT(handle->h_transaction->t_journal == journal);          
 294                 handle->h_ref++;                                                
 295                 return handle;                                                  
 296         }


 Comments   
Comment by Peter Jones [ 29/Mar/13 ]

Niu

Could you please assist with this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 01/Apr/13 ]

What I can think of is stack overflow. Prakash, do you have dump file that can be used to verify if the task_struct is sane? Thanks.

Comment by Oleg Drokin [ 01/Apr/13 ]

I actually had a string of mystique jbd crashes in my testing too, with no clear explanations:

[ 4290.997885] ------------[ cut here ]------------
[ 4290.998129] kernel BUG at fs/jbd2/commit.c:353!
[ 4290.998386] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 4290.998690] last sysfs file: /sys/devices/system/cpu/possible
[ 4290.998974] CPU 1 
[ 4290.999540] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[ 4291.001741] 
[ 4291.001741] Pid: 22965, comm: jbd2/loop1-8 Not tainted 2.6.32-debug #6 Bochs Bochs
[ 4291.001741] RIP: 0010:[<ffffffffa0391dab>]  [<ffffffffa0391dab>] jbd2_journal_commit_transaction+0x144b/0x16e0 [jbd2]
[ 4291.001741] RSP: 0018:ffff880064b3fd20  EFLAGS: 00010246
[ 4291.001741] RAX: 0000000000000008 RBX: ffff8800ae8b47f0 RCX: ffff8800ae8b4900
[ 4291.001741] RDX: ffff8800132aa000 RSI: 0000000000000282 RDI: ffff8800ae8b47f0
[ 4291.001741] RBP: ffff880064b3fe50 R08: ffff880078ea6388 R09: 0000000000000001
[ 4291.001741] R10: 0000000000000800 R11: 0000000000000001 R12: 0000000000000000
[ 4291.001741] R13: ffff8800ae8b47f0 R14: ffff8800a713e340 R15: ffff8800ae8b48e8
[ 4291.001741] FS:  0000000000000000(0000) GS:ffff880006240000(0000) knlGS:0000000000000000
[ 4291.001741] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 4291.001741] CR2: 00007f0dcf9ac0f0 CR3: 000000008d1be000 CR4: 00000000000006e0
[ 4291.001741] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4291.001741] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4291.001741] Process jbd2/loop1-8 (pid: 22965, threadinfo ffff880064b3e000, task ffff8800a713e340)
[ 4291.001741] Stack:
[ 4291.001741]  ffff880064b3fd80 ffffffff81009310 ffff88005f66bdb0 ffff8800a56e1df0
[ 4291.001741] <d> ffff8800ae8b4c58 000003e712e100e5 ffff8800ae8b4c88 ffff8800b077e000
[ 4291.001741] <d> 0000000800000000 00000fd400000000 ffff8800ae8b4818 0000000000000282
[ 4291.001741] Call Trace:
[ 4291.001741]  [<ffffffff81009310>] ? __switch_to+0xd0/0x320
[ 4291.001741]  [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30
[ 4291.001741]  [<ffffffff8107c65b>] ? try_to_del_timer_sync+0x7b/0xe0
[ 4291.001741]  [<ffffffffa0397627>] kjournald2+0xb7/0x210 [jbd2]
[ 4291.001741]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
[ 4291.001741]  [<ffffffffa0397570>] ? kjournald2+0x0/0x210 [jbd2]
[ 4291.001741]  [<ffffffff8108fa16>] kthread+0x96/0xa0
[ 4291.001741]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[ 4291.001741]  [<ffffffff8108f980>] ? kthread+0x0/0xa0
[ 4291.001741]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 4291.001741] Code: 0b eb fe 49 3b 85 90 00 00 00 0f 84 f4 f9 ff ff 0f 0b eb fe 8b b5 48 ff ff ff 4c 89 ef e8 be 5b 00 00 e9 97 f7 ff ff 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 
[ 4291.001741] RIP  [<ffffffffa0391dab>] jbd2_journal_commit_transaction+0x144b/0x16e0 [jbd2]
[ 4291.001741]  RSP <ffff880064b3fd20>

the asertion is J_ASSERT(journal->j_running_transaction != NULL);
I poked in the crashdump and there I see that:

  j_running_transaction = 0xffff8800a4357df0,
  j_committing_transaction = 0x0,

Interested parties can dig into the crashdump at /exports/crashdumps/192.168.10.220-2013-03-05-13\:11\:23/

The other related crash I had was this:

[268042.643749] ------------[ cut here ]------------
[268042.643749] kernel BUG at fs/jbd2/commit.c:353!
[268042.643749] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[268042.643749] last sysfs file: /sys/devices/system/cpu/possible
[268042.643749] CPU 7 
[268042.643749] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[268042.643749] 
[268042.643749] Pid: 15086, comm: jbd2/loop2-8 Not tainted 2.6.32-debug #6 Bochs Bochs
[268042.643749] RIP: 0010:[<ffffffffa03a3dab>]  [<ffffffffa03a3dab>] jbd2_journal_commit_transaction+0x144b/0x16e0 [jbd2]
[268042.643749] RSP: 0018:ffff880017355d20  EFLAGS: 00010246
[268042.643749] RAX: 0000000000000008 RBX: ffff8800afa307f0 RCX: ffff8800afa30900
[268042.643749] RDX: ffff880097422000 RSI: 0000000000000282 RDI: ffff8800afa307f0
[268042.643749] RBP: ffff880017355e50 R08: ffff8800301d2448 R09: 0000000000000001
[268042.643749] R10: 0000000000000800 R11: 0000000000000000 R12: 0000000000000000
[268042.643749] R13: ffff8800afa307f0 R14: ffff8800b3e82580 R15: ffff8800afa308e8
[268042.643749] FS:  0000000000000000(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000
[268042.643749] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[268042.643749] CR2: 00007f58a8242800 CR3: 0000000076b0b000 CR4: 00000000000006e0
[268042.643749] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[268042.643749] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[268042.643749] Process jbd2/loop2-8 (pid: 15086, threadinfo ffff880017354000, task ffff8800b3e82580)
[268042.643749] Stack:
[268042.643749]  ffff880017355d80 ffffffff81009310 ffff8800296d1cb0 ffff880081396df0
[268042.643749] <d> ffff8800afa30c58 0000f3c88a448fef ffff8800afa30c88 ffff880015dde000
[268042.643749] <d> 0000000800000000 00000fd400000000 ffff8800afa30818 0000000000000282
[268042.643749] Call Trace:
[268042.643749]  [<ffffffff81009310>] ? __switch_to+0xd0/0x320
[268042.643749]  [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30
[268042.643749]  [<ffffffff8107c65b>] ? try_to_del_timer_sync+0x7b/0xe0
[268042.643749]  [<ffffffffa03a9627>] kjournald2+0xb7/0x210 [jbd2]
[268042.643749]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
[268042.643749]  [<ffffffffa03a9570>] ? kjournald2+0x0/0x210 [jbd2]
[268042.643749]  [<ffffffff8108fa16>] kthread+0x96/0xa0
[268042.643749]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[268042.643749]  [<ffffffff8108f980>] ? kthread+0x0/0xa0
[268042.643749]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[268042.643749] Code: 0b eb fe 49 3b 85 90 00 00 00 0f 84 f4 f9 ff ff 0f 0b eb fe 8b b5 48 ff ff ff 4c 89 ef e8 be 5b 00 00 e9 97 f7 ff ff 0f 0b eb fe <0f> 0b 0f 1f 00 eb fb 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 
[268042.643749] RIP  [<ffffffffa03a3dab>] jbd2_journal_commit_transaction+0x144b/0x16e0 [jbd2]
[268042.643749]  RSP <ffff880017355d20>

Crashdump is in /exports/crashdumps/192.168.10.210-2013-03-16-16\:31\:57/vmcore

In addition there was a string of random hangs in jbd2_commit too, for example see LU-2938 on one of my systems or LU-1663 where there is an unexplained hang in the middle of jbd2_commit too, only with 1.8 codebase.

Comment by Prakash Surya (Inactive) [ 02/Apr/13 ]

Niu, unfortunately I don't have any crash dumps for the two failures we experienced. Our HA stuff powered the node down as it was writing the vmcore to disk, so it never finished.

Comment by Prakash Surya (Inactive) [ 02/Apr/13 ]

What makes you think this might be a stack overflow?

Comment by Oleg Drokin [ 03/Apr/13 ]

I think I don't buy stack overflow theory myself either.
There's 4.4k of stack left at this point, so unlikely other callers dived this extra deep to corrupt the stack current pointer.

Comment by Niu Yawei (Inactive) [ 03/Apr/13 ]

Well, I was just thinking that journal_info of task_struct could be corrupt somehow (stack overflow?). I'll keep on investigating to see if there is any potential jbd2 bugs, please keep posting if you get any new information on this. Thanks.

Comment by Niu Yawei (Inactive) [ 03/Apr/13 ]

Oleg, I see there is a bug report for redhat which looks same as what you see on your test machine: https://bugzilla.redhat.com/show_bug.cgi?id=735768 . It said the patches are available on kernel-2.6.32-304.el6.

Comment by Prakash Surya (Inactive) [ 04/Apr/13 ]

Sorry, I forgot to mention this happened on an OSS running 2.1.4.

After spending some more time reading the code, I think the issue is a result of the memory reclaim logic trying to start a transaction while another one is already open. I'll try and outline what I think is causing this:

1. Enter filter_commitrw_write
2. Start transaction via fsfilt_brw_start
3. Enter memory reclaim logic
4. Start another transaction via ldiskfs_dquot_drop
5. ASSERT because two transactions opened on different journals

Does that sound plausible?

Comment by Prakash Surya (Inactive) [ 04/Apr/13 ]

And if that is what's happening, we can't simply pass in GFP_NOFS to the allocation in question since it's hard coded to GFP_KERNEL:

(gdb) l *prepare_creds+0x26
0xffffffff8109e6b6 is in prepare_creds (kernel/cred.c:292).
287             struct cred *new;
288
289             validate_process_creds();
290
291             new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
292             if (!new)
293                     return NULL;
294
295             kdebug("prepare_creds() alloc %p", new);
296
Comment by Oleg Drokin [ 04/Apr/13 ]

Hm, this actually sounds pretty plausible. We had quite a bunch of such problems in reiser4 in the beginning.

Niu, Thanks, indeed my issue seems to be pretty separate. Interesting that my search did not turn anything like that.

Comment by Prakash Surya (Inactive) [ 04/Apr/13 ]

Oleg, out of curiosity, is the transaction start operation in fsfilt_brw_start (step 2) just part of the normal transaction handling mechanics for a BRW write on top of ldiskfs? Or are we doing some kind of laying violation here?

If this is "normal", the only fix I can imagine is to audit the code to ensure any memory allocations that occur between the transaction start and stop in fsfilt_commitrw_write pass in the GFP_NOFS flag. And any allocations that we can't pass in that flag need to happen outside of the start/stop calls.

With that said, if lustre 2.4 doesn't suffer from this it might be OK to simply mark this as "won't fix".

Comment by Niu Yawei (Inactive) [ 07/Apr/13 ]

1. Enter filter_commitrw_write
2. Start transaction via fsfilt_brw_start
3. Enter memory reclaim logic
4. Start another transaction via ldiskfs_dquot_drop
5. ASSERT because two transactions opened on different journals

I don't see why the ldiskfs_dquot_drop want to open transaction on different journal? I think it should just use the same 'handle' based on same 'transaction', and there is only one 'journal' for the single fs, isn't it?

Comment by Prakash Surya (Inactive) [ 08/Apr/13 ]

I don't see why the ldiskfs_dquot_drop want to open transaction on different journal?

This is confusing to me too, I would have thought the two journals would be the same. Does the lustre journal use a separate one from the backend ldiskfs journal?

Comment by Oleg Drokin [ 08/Apr/13 ]

The transaction start is in fsfilt_brw_start is a standard way of doing it in our case, I don't think it's a layering violation.

About the "two journals", the assertion will not trigger if the journal is the same, just a nested transaction would happen.
But I suspect you might be having other ext3/4 filesystems on those nodes? Or more than a single OST? Either of those cases could try to push out pages into a filesystem other than what we have a transaction open for. If you have a crashdump, it should be easy to verify.

Comment by D. Marc Stearman (Inactive) [ 08/Apr/13 ]

These OSS nodes have 15 OSTs on them, and 30 OSTs if failover happens.

Comment by Prakash Surya (Inactive) [ 08/Apr/13 ]

Ah yes, I forgot that we run multiple OSTs per OSS with ldiskfs. Good catch, Oleg. We don't have a crash dump for this issue, but it makes perfect sense for this to happen if each ldiskfs FS (i.e each OST) has a separate journal.

This doesn't really seem like a problem that is specific to Lustre, is it? I can imagine this cropping up for anybody running multiple ext4/ldiskfs filesystems concurrently, with or without Lustre. For example, FS A and FS B both opening a transaction concurrently would trigger this (assuming both use separate journals on top of jbd2).

Comment by Oleg Drokin [ 09/Apr/13 ]

Yes, this does might like a pretty generic bug, except that nobody really calls this functions with a journal open it seems.

So the fix for us is probably to call cfs_cap_raise before we opened the journal with fsfilt_brw_start (we have enough data to decide when we need to do it before we open the journal).

Comment by Niu Yawei (Inactive) [ 09/Apr/13 ]

The cfs_cap_raise() is not only used when setting uid/gid on first write, it's also used when enforce quota (see filter_direct_io() -> quota_enforce(), and it could be cfs_cap_lower() as well).

What about just call cfs_memory_pressure_set() to disable memroy reclaim before staring the transaction?

Comment by Oleg Drokin [ 09/Apr/13 ]

Hm... Indeed I see there's another possibility for the prepare_creds calls in the quota path.
But I see it's actually possible to call the quota_enforce before we open the transaction as well in filter_commitrw_write?

the memory pressure hack might work, but I suspect it's just a hack and as such is not super-desirable.

Comment by Niu Yawei (Inactive) [ 10/Apr/13 ]

http://review.whamcloud.com/6004

Comment by Sebastien Buisson (Inactive) [ 19/Sep/13 ]

Hi,

We recently hit this issue on a customer site running Lustre 2.1. A kernel panic was reported on an OSS with the message "kernel BUG at fs/jbd2/transaction.c:293!" and with the following backtrace:

crash> bt
PID: 10848 TASK: ffff880102a4e7d0 CPU: 1 COMMAND: "ll_ost_io_312"
#0 [ffff880108a5ad60] machine_kexec at ffffffff81027a4b
#1 [ffff880108a5adc0] crash_kexec at ffffffff810a2db2
#2 [ffff880108a5ae90] oops_end at ffffffff81481730
#3 [ffff880108a5aec0] die at ffffffff810071cb
#4 [ffff880108a5aef0] do_trap at ffffffff81481024
#5 [ffff880108a5af50] do_invalid_op at ffffffff81004d75
#6 [ffff880108a5aff0] invalid_op at ffffffff81003f3b
[exception RIP: jbd2_journal_start+268]
RIP: ffffffffa009056c RSP: ffff880108a5b0a0 RFLAGS: 00010206
RAX: ffff880075fc20c0 RBX: ffff8805be9573a8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000012 RDI: ffff88061bbd6000
RBP: ffff880108a5b0e0 R8: 0000000000000001 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000004 R12: ffff8805be9573a8
R13: ffff88061bbd6000 R14: ffff880102a4e7d0 R15: 0000000000000012
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880108a5b0e8] ldiskfs_journal_start_sb at ffffffffa054faf8 [ldiskfs]
#8 [ffff880108a5b0f8] ldiskfs_dquot_drop at ffffffffa054fdf4 [ldiskfs]
#9 [ffff880108a5b118] vfs_dq_drop at ffffffff811b9422
#10 [ffff880108a5b128] clear_inode at ffffffff81178106
#11 [ffff880108a5b148] dispose_list at ffffffff811781d8
#12 [ffff880108a5b178] shrink_icache_memory at ffffffff811784e8
#13 [ffff880108a5b1d8] shrink_slab at ffffffff8111446a
#14 [ffff880108a5b238] do_try_to_free_pages at ffffffff8111601b
#15 [ffff880108a5b2c8] try_to_free_pages at ffffffff8111642f
#16 [ffff880108a5b378] __alloc_pages_nodemask at ffffffff8110dae1
#17 [ffff880108a5b498] kmem_getpages at ffffffff81144de2
#18 [ffff880108a5b4c8] fallback_alloc at ffffffff81145a1e
#19 [ffff880108a5b548] ____cache_alloc_node at ffffffff81145789
#20 [ffff880108a5b598] kmem_cache_alloc at ffffffff8114617b
#21 [ffff880108a5b5d8] prepare_creds at ffffffff81081876
#22 [ffff880108a5b5f8] cfs_cap_raise at ffffffffa05a46d5 [libcfs]
#23 [ffff880108a5b618] filter_quota_enforce at ffffffffa08db345 [lquota]
#24 [ffff880108a5b668] filter_do_bio at ffffffffa08459a1 [obdfilter]
#25 [ffff880108a5b718] filter_commitrw_write at ffffffffa08484be [obdfilter]
#26 [ffff880108a5b908] filter_commitrw at ffffffffa083a6fd [obdfilter]
#27 [ffff880108a5b9c8] obd_commitrw at ffffffffa06e3f5a [ost]
#28 [ffff880108a5ba48] ost_brw_write at ffffffffa06ed922 [ost]
#29 [ffff880108a5bbf8] ost_handle at ffffffffa06f1cd5 [ost]
#30 [ffff880108a5bd68] ptlrpc_main at ffffffffa076d459 [ptlrpc]
#31 [ffff880108a5bf48] kernel_thread at ffffffff810041aa

The patch mentioned (http://review.whamcloud.com/6004) above is partly reviewed but has never landed since last April.
So what is the status of that patch? Is it ready for use in production with Lustre 2.1?

TIA,
Sebastien.

Comment by Niu Yawei (Inactive) [ 18/Jul/17 ]

The patch for 2.1 was ready but won't get a chance to be landed. Close this old 2.1 issue.

Generated at Sat Feb 10 01:30:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.