Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
- soak
Environment:
lola
build: https://build.hpdd.intel.com/job/lustre-reviews/36569

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error occurred during soak testing of build '20160104' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104). DNE is enabled. MDTs have been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA configuration.

(mds_restart means hard reset of MDS node and remount of MDTs (primary resources)
Event sequence:

2016-01-06 06:36:33,402:fsmgmt.fsmgmt:INFO triggering fault mds_restart for lola-9
2016-01-06 06:46:35,601:fsmgmt.fsmgmt:INFO oss_restart just completed for lola-9
lola-9 crashed before 06:46:40 as last update for collectl counters
happened at 06:46:20 (frequency 20s). Also no exhausting of memory (slabs) happened.
Error message reads as:

<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 2 
<4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 5372, comm: lod0002_rec0004 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g3f4572c.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
<4>RIP: 0010:[<ffffffffa0b8ee8b>]  [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc]
<4>RSP: 0018:ffff880821d05a50  EFLAGS: 00010296
<4>RAX: 0000000000005a5a RBX: ffff880804003d78 RCX: ffff880434faa2e0
<4>RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000000 RDI: 0000000000000004
<4>RBP: ffff880821d05ac0 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 000000000000004d R11: 0000000000000000 R12: ffff8803ec7afe40
<4>R13: 5a5a5a5a5a5a5a42 R14: ffff880804003d88 R15: ffff8803ec7afe58
<4>FS:  0000000000000000(0000) GS:ffff880038240000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f1cacb4f000 CR3: 0000000001a85000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process lod0002_rec0004 (pid: 5372, threadinfo ffff880821d04000, task ffff880821f2c040)
<4>Stack:
<4> ffff8807fa7c40c0 ffff880804cc5078 ffff880821d05ac0 ffff880804cc50a8
<4><d> ffff8803ef8a72d8 0000000421d05ad0 ffff880804cc5088 ffff880804cc50a8
<4><d> 0000000000007fff ffff880804cc5078 ffff8803ef8a7000 ffff88041b9b2360
<4>Call Trace:
<4> [<ffffffffa1303b79>] lod_process_recovery_updates+0x1e9/0x420 [lod]
<4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass]
<4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa0893e38>] llog_cat_process_cb+0x458/0x600 [obdclass]
<4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass]
<4> [<ffffffffa08e02e4>] ? dt_read+0x14/0x50 [obdclass]
<4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa08939e0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
<4> [<ffffffffa089269d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
<4> [<ffffffffa13301b9>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
<4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa0892809>] llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa13096f3>] lod_sub_recovery_thread+0x4e3/0xcf0 [lod]
<4> [<ffffffffa1309210>] ? lod_sub_recovery_thread+0x0/0xcf0 [lod]
<4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: 4d 89 7c 24 20 49 89 44 24 08 49 89 44 24 10 8b 55 bc 41 89 14 24 e8 b5 e9 99 e0 49 8b 55 38 48 39 d3 4c 8d 6a e8 74 1f 8b 7d bc <3b> 7a e8 74 6f 8b 4d bc eb 05 3b 48 e8 74 65 49 8b 45 18 48 39 
<1>RIP  [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc]
<4> RSP <ffff880821d05a50>

Attached messages, console and vmcore-dmesg log file of lola-9.
Crash file was saved to crashdump directory of cluster Lola and can be uploaded on demand to a desired location. I'll list the exact path of the crash dump in the next comment (box).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-8.log.bz2
64 kB
11/Jan/16 10:14 AM
console-lola-9.log.bz2
177 kB
07/Jan/16 3:45 PM
messages-lola-8.log.bz2
209 kB
11/Jan/16 10:14 AM
messages-lola-9.log.bz2
99 kB
07/Jan/16 3:45 PM
vmcore-dmesg.txt.bz2
24 kB
11/Jan/16 10:14 AM
vmcore-dmesg.txt.bz2
29 kB
07/Jan/16 3:45 PM

Issue Links

is related to

LU-7430 General protection fault: 0000 upon mounting MDT

Resolved

general protection fault: 0000 after mounting MDTs

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates