Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
Error occurred during soak testing of build '20160104' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104). DNE is enabled. MDTs have been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA configuration.
(mds_restart means hard reset of MDS node and remount of MDTs (primary resources)
Event sequence:
- 2016-01-06 06:36:33,402:fsmgmt.fsmgmt:INFO triggering fault mds_restart for lola-9
- 2016-01-06 06:46:35,601:fsmgmt.fsmgmt:INFO oss_restart just completed for lola-9
- lola-9 crashed before 06:46:40 as last update for collectl counters
happened at 06:46:20 (frequency 20s). Also no exhausting of memory (slabs) happened. - Error message reads as:
<4>general protection fault: 0000 [#1] SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 2 <4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 5372, comm: lod0002_rec0004 Tainted: P --------------- 2.6.32-504.30.3.el6_lustre.g3f4572c.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ <4>RIP: 0010:[<ffffffffa0b8ee8b>] [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc] <4>RSP: 0018:ffff880821d05a50 EFLAGS: 00010296 <4>RAX: 0000000000005a5a RBX: ffff880804003d78 RCX: ffff880434faa2e0 <4>RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000000 RDI: 0000000000000004 <4>RBP: ffff880821d05ac0 R08: 0000000000000000 R09: 0000000000000000 <4>R10: 000000000000004d R11: 0000000000000000 R12: ffff8803ec7afe40 <4>R13: 5a5a5a5a5a5a5a42 R14: ffff880804003d88 R15: ffff8803ec7afe58 <4>FS: 0000000000000000(0000) GS:ffff880038240000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 00007f1cacb4f000 CR3: 0000000001a85000 CR4: 00000000000407e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process lod0002_rec0004 (pid: 5372, threadinfo ffff880821d04000, task ffff880821f2c040) <4>Stack: <4> ffff8807fa7c40c0 ffff880804cc5078 ffff880821d05ac0 ffff880804cc50a8 <4><d> ffff8803ef8a72d8 0000000421d05ad0 ffff880804cc5088 ffff880804cc50a8 <4><d> 0000000000007fff ffff880804cc5078 ffff8803ef8a7000 ffff88041b9b2360 <4>Call Trace: <4> [<ffffffffa1303b79>] lod_process_recovery_updates+0x1e9/0x420 [lod] <4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass] <4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass] <4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod] <4> [<ffffffffa0893e38>] llog_cat_process_cb+0x458/0x600 [obdclass] <4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass] <4> [<ffffffffa08e02e4>] ? dt_read+0x14/0x50 [obdclass] <4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass] <4> [<ffffffffa08939e0>] ? llog_cat_process_cb+0x0/0x600 [obdclass] <4> [<ffffffffa089269d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass] <4> [<ffffffffa13301b9>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod] <4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod] <4> [<ffffffffa0892809>] llog_cat_process+0x19/0x20 [obdclass] <4> [<ffffffffa13096f3>] lod_sub_recovery_thread+0x4e3/0xcf0 [lod] <4> [<ffffffffa1309210>] ? lod_sub_recovery_thread+0x0/0xcf0 [lod] <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4>Code: 4d 89 7c 24 20 49 89 44 24 08 49 89 44 24 10 8b 55 bc 41 89 14 24 e8 b5 e9 99 e0 49 8b 55 38 48 39 d3 4c 8d 6a e8 74 1f 8b 7d bc <3b> 7a e8 74 6f 8b 4d bc eb 05 3b 48 e8 74 65 49 8b 45 18 48 39 <1>RIP [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc] <4> RSP <ffff880821d05a50>
Attached messages, console and vmcore-dmesg log file of lola-9.
Crash file was saved to crashdump directory of cluster Lola and can be uploaded on demand to a desired location. I'll list the exact path of the crash dump in the next comment (box).
Attachments
Issue Links
- is related to
-
LU-7430 General protection fault: 0000 upon mounting MDT
- Resolved