Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.3.0, Lustre 2.1.1, Lustre 2.1.3
-
None
-
4
-
5821
Description
We get OSS crash on any attempt to write data to our Lustre FS. The file system is created from scratch with version 2.1.3 package. We have tried all kernel version from the Env. field.
Initially I thought this was a kernel bug fixed in RHEL kernels-2.6.32-279.10.1.el6
- [kernel] sched: fix divide by zero at
{thread_group,task}
_times (Stanislaw Gruszka) [856703 843771]
On write attempt we get OSS crashes with the following console message :
divide error: 0000 1 SMP
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 7
Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache xt_multiport nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ip6t_REJECT ip6table_filter ip6_tables ipv6 power_meter dcdbas microcode serio_raw ixgbe dca mdio k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core sg ses enclosure bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci qla2xxx scsi_transport_fc scsi_tgt megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mperf]
Pid: 29280, comm: ll_ost_io_127 Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Dell Inc. PowerEdge R715/0C5MMK
RIP: 0010:[<ffffffffa0bb9c24>] [<ffffffffa0bb9c24>] ldiskfs_mb_normalize_request+0xf4/0x3d0 [ldiskfs]
RSP: 0018:ffff8804141a73e0 EFLAGS: 00010246
RAX: 0000000000020000 RBX: ffff88041c783898 RCX: 0000000000000003
RDX: 0000000000000000 RSI: 0000000000020100 RDI: 0000000000000000
RBP: ffff8804141a7430 R08: 0000000000000000 R09: 0000000000020000
R10: ffff88041c00e540 R11: 0000000000000000 R12: 0000000000000100
R13: ffff8804141a7500 R14: ffff88041c09dc00 R15: ffff8803908729c8
FS: 00007f5a66dde700(0000) GS:ffff880323c20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006dbf98 CR3: 000000041c04f000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ll_ost_io_127 (pid: 29280, threadinfo ffff8804141a6000, task ffff8804141a2aa0)
Stack:
ffff8804141a7400 ffffffffa0bb3c7f ffff8804141a7430 ffffffffa0bba212
<d> ffff8804141a7430 ffff8804141a7500 0000000000000100 ffff880418bfd000
<d> ffff88041c09dc00 ffff88041c783898 ffff8804141a74d0 ffffffffa0bc13aa
Call Trace:
[<ffffffffa0bb3c7f>] ? ldiskfs_dirty_inode+0x4f/0x60 [ldiskfs]
[<ffffffffa0bba212>] ? ldiskfs_mb_initialize_context+0x82/0x1f0 [ldiskfs]
[<ffffffffa0bc13aa>] ldiskfs_mb_new_blocks+0x42a/0x660 [ldiskfs]
[<ffffffff811adb49>] ? __find_get_block+0xa9/0x200
[<ffffffff811adccc>] ? __getblk+0x2c/0x2e0
[<ffffffff811639bc>] ? __kmalloc+0x20c/0x220
[<ffffffffa0c68fca>] ldiskfs_ext_new_extent_cb+0x59a/0x6d0 [fsfilt_ldiskfs]
[<ffffffffa0ba869f>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs]
[<ffffffffa0c68a30>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [fsfilt_ldiskfs]
[<ffffffffa0c68758>] fsfilt_map_nblocks+0xd8/0x100 [fsfilt_ldiskfs]
[<ffffffffa00fced5>] ? start_this_handle+0xe5/0x500 [jbd2]
[<ffffffffa0c68893>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x220 [fsfilt_ldiskfs]
[<ffffffffa0c68a25>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
[<ffffffffa0cae99b>] filter_do_bio+0xdcb/0x18f0 [obdfilter]
[<ffffffffa0c67580>] ? fsfilt_ldiskfs_brw_start+0x280/0x5a0 [fsfilt_ldiskfs]
[<ffffffffa0cb115e>] filter_commitrw_write+0x145e/0x2e78 [obdfilter]
[<ffffffffa04d8c1b>] ? lnet_send+0x29b/0xa60 [lnet]
[<ffffffff8107ebe2>] ? del_timer_sync+0x22/0x30
[<ffffffff814ff1ca>] ? schedule_timeout+0x19a/0x2e0
[<ffffffffa0ca4252>] filter_commitrw+0x272/0x290 [obdfilter]
[<ffffffffa0c35bdd>] obd_commitrw+0x11d/0x3c0 [ost]
[<ffffffffa0c3dd94>] ost_brw_write+0xcc4/0x1600 [ost]
[<ffffffffa06a2000>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[<ffffffffa0c42e37>] ost_handle+0x2b77/0x4270 [ost]
[<ffffffffa06e077c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
[<ffffffffa06e7bfb>] ? ptlrpc_update_export_timer+0x4b/0x470 [ptlrpc]
[<ffffffffa06ef7eb>] ptlrpc_main+0xc4b/0x1a40 [ptlrpc]
[<ffffffffa06eeba0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa06eeba0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffffa06eeba0>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 8b 04 ca 4c 39 c0 0f 86 2b 02 00 00 83 c7 01 48 63 cf 48 39 d1 72 e8 48 8d 04 cd f8 ff ff ff 4d 63 04 02 31 d2 44 89 c8 44 89 c7 <48> f7 f7 31 d2 89 c1 8d 46 ff 41 0f af c8 48 f7 f7 89 ca 48 83
RIP [<ffffffffa0bb9c24>] ldiskfs_mb_normalize_request+0xf4/0x3d0 [ldiskfs]
RSP <ffff8804141a73e0>
This error stops us from deploying new lustre setup. Any help is greatly appreciated.