Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.3
Labels:
None
Environment:
RHEL 6 w/ patched kernel for Lustre

Epic/Theme:
- Quota
Severity:
3
Rank (Obsolete):
15008

Description

We hit the following bug on our MDS while starting the OSTs. We were registering all the targerts after a writeconf, so the MDT has been started before the OSTs.

kernel BUG at fs/jbd2/transaction.c:1030!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:80/0000:80:05.0/0000:85:00.0/host12/rport-12:0-0/target12:0:0/12:0:0:0/state
CPU 1
Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) fsfilt_ldiskfs(U) lvfs(U) libcfs(U) ldiskfs(U) sha512_generic sha256_generic crc32c_intel nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) dm_round_robin scsi_dh_rdac dm_multipath uinput sg lpc_ich mfd_core ioatdma igb dca i2c_algo_bit i2c_core ptp pps_core lpfc scsi_transport_fc scsi_tgt ext4 jbd2 mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: libcfs]

Pid: 18550, comm: osp-syn-1 Tainted: G        W  ---------------    2.6.32-431.17.1.el6.Bull.50.x86_64 #1 BULL bullx super-node
RIP: 0010:[<ffffffffa006f79d>]  [<ffffffffa006f79d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
RSP: 0018:ffff880c68cb74a0  EFLAGS: 00010246
RAX: ffff88087a587680 RBX: ffff88087a13d588 RCX: ffff881075405678
RDX: 0000000000000000 RSI: ffff881075405678 RDI: 0000000000000000
RBP: ffff880c68cb74c0 R08: f010000000000000 R09: ef6cc3c6d80d9e02
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88107542cf28
R13: ffff881075405678 R14: ffff881060c97000 R15: 0000000000000008
FS:  00007f0798d0f700(0000) GS:ffff88048e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006d3aa8 CR3: 000000046dea3000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process osp-syn-1 (pid: 18550, threadinfo ffff880c68cb6000, task ffff880c78a1d500)
Stack:
 ffff88087a13d588 ffffffffa0eb5df0 ffff881075405678 0000000000000000
<d> ffff880c68cb7500 ffffffffa0e730eb ffff880c68cb74f0 ffffffff8109a1bf
<d> ffff88107543c310 ffff88087a13d588 0000000000000018 ffff881075405678
Call Trace:
 [<ffffffffa0e730eb>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
 [<ffffffff8109a1bf>] ? wake_up_bit+0x2f/0x40
 [<ffffffffa0ea9145>] ldiskfs_quota_write+0x165/0x210 [ldiskfs]
 [<ffffffff811eeb31>] v2_write_file_info+0xa1/0xe0
 [<ffffffff811eac38>] dquot_acquire+0x138/0x140
 [<ffffffffa0ea83e6>] ldiskfs_acquire_dquot+0x66/0xb0 [ldiskfs]
 [<ffffffff811ecbac>] dqget+0x2ac/0x390
 [<ffffffff811ed13b>] dquot_initialize+0x7b/0x240
 [<ffffffffa0ea8602>] ldiskfs_dquot_initialize+0x62/0xc0 [ldiskfs]
 [<ffffffffa0522d64>] osd_write+0x104/0x2a0 [osd_ldiskfs]
 [<ffffffffa1003945>] dt_record_write+0x45/0x130 [obdclass]
 [<ffffffffa0fdc45b>] llog_osd_write_blob+0x57b/0x850 [obdclass]
 [<ffffffffa0fdf7d4>] llog_osd_write_rec+0x274/0x1370 [obdclass]
 [<ffffffffa0fab438>] llog_write_rec+0xc8/0x290 [obdclass]
 [<ffffffffa0facbad>] llog_write+0x2ad/0x420 [obdclass]
 [<ffffffffa0face0c>] llog_cancel_rec+0xbc/0x7c0 [obdclass]
 [<ffffffffa0fb3447>] llog_cat_cancel_records+0x107/0x340 [obdclass]
 [<ffffffffa070a12b>] osp_sync_process_queues+0x12bb/0x15e0 [osp]
 [<ffffffffa050ac4b>] ? osd_object_read_unlock+0x8b/0xd0 [osd_ldiskfs]
 [<ffffffffa0fade0b>] llog_process_thread+0x8fb/0xe00 [obdclass]
 [<ffffffffa0708e70>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
 [<ffffffffa0fafc7d>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa0fb261a>] llog_cat_process_cb+0x56a/0x620 [obdclass]
 [<ffffffff81054619>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0708e70>] ? osp_sync_process_queues+0x0/0x15e0 [osp]
 [<ffffffffa0fb1509>] llog_cat_process+0x19/0x20 [obdclass]
 [<ffffffffa0edb75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
 [<ffffffffa070aa60>] osp_sync_thread+0x240/0x7e0 [osp]
 [<ffffffffa070a820>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffffa070a820>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffffa070a820>] ? osp_sync_thread+0x0/0x7e0 [osp]
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: c6 9c 03 00 00 4c 89 f7 e8 81 bf 4b e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e8 b1 ec ff ff 4c 89 f0 66 ff 00 66 66 90 e9 73 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 66 0f 1f 84 00 00 00 00 00 eb f5
RIP  [<ffffffffa006f79d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
 RSP <ffff880c68cb74a0>

This bug looks like ~~LU-5040~~.

Before the crash, we had a lot of VFS errors about quotas in the console:

VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.

After the crash, we tried to start the MDT and hit another BUG:

kernel BUG at fs/inode.c:1358!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:80/0000:80:05.0/0000:85:00.0/host12/rport-12:0-0/target12:0:0/12:0:0:0/state
CPU 1
Modules linked in: ldiskfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf bonding 8021q garp stp llc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) dm_round_robin scsi_dh_rdac dm_multipath uinput sg lpc_ich mfd_core ioatdma lpfc scsi_transport_fc scsi_tgt igb dca i2c_algo_bit i2c_core ptp pps_core ext4 jbd2 mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]

Pid: 6003, comm: mount Tainted: G        W  ---------------    2.6.32-431.17.1.el6.Bull.50.x86_64 #1 BULL bullx super-node
RIP: 0010:[<ffffffff811a58d9>]  [<ffffffff811a58d9>] iput+0x69/0x70
RSP: 0018:ffff880c78495c28  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88087a3eb528 RCX: 0000000000000034
RDX: 0000000000000001 RSI: ffff88087a348d80 RDI: ffff88087a3eb528
RBP: ffff880c78495c38 R08: 8038000000000000 R09: f767c3674f91d007
R10: ffff880c7a6cd000 R11: 7fffffffffffffff R12: ffff88087a663000
R13: ffff88087a3eb528 R14: 00000000ffffffea R15: 0000000000000000
FS:  00007f8051d2f7e0(0000) GS:ffff88048e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f2e4552f00f CR3: 0000000878ca6000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 6003, threadinfo ffff880c78494000, task ffff880c78608080)
Stack:
 ffff880879cabc00 ffff880879cabc00 ffff880c78495d58 ffffffffa056d947
<d> 00000004000000fd 78495d8800000004 ffff880877e32000 ffff88086ce2c000
<d> 0000000000000004 ffff88087a6630d0 00000000ffffffff ffff88087a6630f8
Call Trace:
 [<ffffffffa056d947>] ldiskfs_fill_super+0x1757/0x2ac0 [ldiskfs]
 [<ffffffff8128ceb4>] ? snprintf+0x34/0x40
 [<ffffffff8118c98e>] get_sb_bdev+0x18e/0x1d0
 [<ffffffffa056c1f0>] ? ldiskfs_fill_super+0x0/0x2ac0 [ldiskfs]
 [<ffffffffa0567018>] ldiskfs_get_sb+0x18/0x20 [ldiskfs]
 [<ffffffff8118be1b>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8118bfc2>] do_kern_mount+0x52/0x130
 [<ffffffff811acf9b>] do_mount+0x2fb/0x930
 [<ffffffff81140d34>] ? strndup_user+0x64/0xc0
 [<ffffffff811ad660>] sys_mount+0x90/0xe0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: 38 48 c7 c0 20 6a 1a 81 48 85 d2 74 12 48 8b 42 20 48 c7 c2 20 6a 1a 81 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 83 c4 08 5b c9 c3 <0f> 0b eb fe 0f 1f 00 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f
RIP  [<ffffffff811a58d9>] iput+0x69/0x70
 RSP <ffff880c78495c28>

We then ran a fsck on the MDT and were able to start the MDS.

You can find attached a bundle of logs containing:

the log of the fscks
the syslog of the MDS
the console log of the MDS
some traces from the two crashes (bt, foreach bt, ps -l, ps, log)

My concern is the first crash as I believe the second is a consequence of the first one.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash_mds.tgz
564 kB
22/Jul/14 6:45 AM

Issue Links

is related to

LU-4382 kernel BUG at fs/jbd2/transaction.c:1033

Resolved

LU-5040 kernel BUG at fs/jbd2/transaction.c:1033

Resolved

kernel BUG at fs/jbd2/transaction.c:1030! on the MDS while starting OSTs

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates