[LU-7638] general protection fault: 0000 after mounting MDTs Created: 07/Jan/16  Updated: 28/Jan/16  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-reviews/36569


Attachments: File console-lola-8.log.bz2     File console-lola-9.log.bz2     File messages-lola-8.log.bz2     File messages-lola-9.log.bz2     File vmcore-dmesg.txt.bz2     File vmcore-dmesg.txt.bz2    
Issue Links:
Related
is related to LU-7430 General protection fault: 0000 upon m... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error occurred during soak testing of build '20160104' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104). DNE is enabled. MDTs have been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA configuration.

(mds_restart means hard reset of MDS node and remount of MDTs (primary resources)
Event sequence:

  • 2016-01-06 06:36:33,402:fsmgmt.fsmgmt:INFO triggering fault mds_restart for lola-9
  • 2016-01-06 06:46:35,601:fsmgmt.fsmgmt:INFO oss_restart just completed for lola-9
  • lola-9 crashed before 06:46:40 as last update for collectl counters
    happened at 06:46:20 (frequency 20s). Also no exhausting of memory (slabs) happened.
  • Error message reads as:
<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 2 
<4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 5372, comm: lod0002_rec0004 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g3f4572c.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
<4>RIP: 0010:[<ffffffffa0b8ee8b>]  [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc]
<4>RSP: 0018:ffff880821d05a50  EFLAGS: 00010296
<4>RAX: 0000000000005a5a RBX: ffff880804003d78 RCX: ffff880434faa2e0
<4>RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000000 RDI: 0000000000000004
<4>RBP: ffff880821d05ac0 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 000000000000004d R11: 0000000000000000 R12: ffff8803ec7afe40
<4>R13: 5a5a5a5a5a5a5a42 R14: ffff880804003d88 R15: ffff8803ec7afe58
<4>FS:  0000000000000000(0000) GS:ffff880038240000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f1cacb4f000 CR3: 0000000001a85000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process lod0002_rec0004 (pid: 5372, threadinfo ffff880821d04000, task ffff880821f2c040)
<4>Stack:
<4> ffff8807fa7c40c0 ffff880804cc5078 ffff880821d05ac0 ffff880804cc50a8
<4><d> ffff8803ef8a72d8 0000000421d05ad0 ffff880804cc5088 ffff880804cc50a8
<4><d> 0000000000007fff ffff880804cc5078 ffff8803ef8a7000 ffff88041b9b2360
<4>Call Trace:
<4> [<ffffffffa1303b79>] lod_process_recovery_updates+0x1e9/0x420 [lod]
<4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass]
<4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa0893e38>] llog_cat_process_cb+0x458/0x600 [obdclass]
<4> [<ffffffffa089048a>] llog_process_thread+0x94a/0x1040 [obdclass]
<4> [<ffffffffa08e02e4>] ? dt_read+0x14/0x50 [obdclass]
<4> [<ffffffffa0890c3d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa08939e0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
<4> [<ffffffffa089269d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
<4> [<ffffffffa13301b9>] ? lod_sub_prep_llog+0x4f9/0x7a0 [lod]
<4> [<ffffffffa1303990>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa0892809>] llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa13096f3>] lod_sub_recovery_thread+0x4e3/0xcf0 [lod]
<4> [<ffffffffa1309210>] ? lod_sub_recovery_thread+0x0/0xcf0 [lod]
<4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: 4d 89 7c 24 20 49 89 44 24 08 49 89 44 24 10 8b 55 bc 41 89 14 24 e8 b5 e9 99 e0 49 8b 55 38 48 39 d3 4c 8d 6a e8 74 1f 8b 7d bc <3b> 7a e8 74 6f 8b 4d bc eb 05 3b 48 e8 74 65 49 8b 45 18 48 39 
<1>RIP  [<ffffffffa0b8ee8b>] insert_update_records_to_replay_list+0xf6b/0x1b70 [ptlrpc]
<4> RSP <ffff880821d05a50>

Attached messages, console and vmcore-dmesg log file of lola-9.
Crash file was saved to crashdump directory of cluster Lola and can be uploaded on demand to a desired location. I'll list the exact path of the crash dump in the next comment (box).



 Comments   
Comment by Frank Heckes (Inactive) [ 07/Jan/16 ]

Crash file is saved at

lola-1:/scratch/crashdumps/lu-7638/lola-9-127.0.0.1-2016-01-06-06:47:10
Comment by Gerrit Updater [ 08/Jan/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17885
Subject: LU-7638 recovery: do not abort update recovery.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d02671b6857c2e872f97b970f95a5878ade4a62

Comment by Frank Heckes (Inactive) [ 11/Jan/16 ]

Same error happened also for build '20160108' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160108).
Sequence of events:

  • 2016-01-09 08:36:30,443:fsmgmt.fsmgmt:INFO mds_restart just completed (for lola-8)
  • 2016-01-09 08:40 lola-8 crashed with 'general protection fault'
<4>general protection fault: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.0/infiniband_mad/umad0/port
<4>CPU 14 
<4>Modules linked in: mgs(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) 
mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 4393, comm: mdt03_000 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g990ef68.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
<4>RIP: 0010:[<ffffffffa083bacc>]  [<ffffffffa083bacc>] llog_exist+0x3c/0x170 [obdclass]
<4>RSP: 0000:ffff880826eb1990  EFLAGS: 00010206
<4>RAX: 5a5a5a5a5a5a5a5a RBX: ffff88081cdf61c0 RCX: ffff8808336eb8c0
<4>RDX: ffff88040666d8c0 RSI: ffff88081cdf61c0 RDI: ffff88081cdf61c0
<4>RBP: ffff880826eb19a0 R08: ffff8808100262c0 R09: 0000000000010000
<4>R10: 0000000000000010 R11: 0000000000004000 R12: 0000000000000000
<4>R13: ffff88082d540c80 R14: ffff88082dbbcc00 R15: ffff8808100262c0
<4>FS:  0000000000000000(0000) GS:ffff88044e4c0000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 000000000168a950 CR3: 0000000001a85000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process mdt03_000 (pid: 4393, threadinfo ffff880826eb0000, task ffff880826ea8ab0)
<4>Stack:
<4> ffff88082dbbcc00 ffff880421df67c0 ffff880826eb1a00 ffffffffa0845e5c
<4><d> ffff88081cdf61c0 ffff88081cf84000 ffff880826eb19d0 0000000033d75218
<4><d> ffff8808336e41e0 ffff880421df67c0 ffff88082d540c80 ffff88081cf84000
<4>Call Trace:
<4> [<ffffffffa0845e5c>] llog_cat_declare_add_rec+0x35c/0x610 [obdclass]
<4> [<ffffffffa083c06f>] llog_declare_add+0x7f/0x1b0 [obdclass]
<4> [<ffffffffa0b380cc>] top_trans_start+0x17c/0x920 [ptlrpc]
<4> [<ffffffffa12a5e31>] lod_trans_start+0x61/0x70 [lod]
<4> [<ffffffffa1350e84>] mdd_trans_start+0x14/0x20 [mdd]
<4> [<ffffffffa133a67a>] mdd_create+0x9aa/0x1600 [mdd]
<4> [<ffffffffa11ecb92>] ? mdt_version_check+0x132/0x440 [mdt]
<4> [<ffffffffa11f1536>] mdt_reint_create+0xbb6/0xcc0 [mdt]
<4> [<ffffffffa0ab769b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
<4> [<ffffffff81294a3a>] ? strlcpy+0x4a/0x60
<4> [<ffffffffa11eba9d>] mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa11d787b>] mdt_reint_internal+0x62b/0xb80 [mdt]
<4> [<ffffffffa11d826b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0b21bbc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
<4> [<ffffffffa0ac9231>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
<4> [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
<4> [<ffffffffa0ac83f0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
<4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c28a>] child_rip+0xa/0x20
<4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
<4>Code: d6 f5 ff 01 48 89 fb 74 09 f6 05 6f d6 f5 ff 40 75 5d 48 85 db 0f 84 b4 00 00 00 48 8b 83 d8 00 00 00 48 85 c0 0f 84 a4 00 00 00 <48> 8b 40 58 48 85 c0 0f 84 e7 00 00 00 48 89 df ff d0 f6 05 3f 
<1>RIP  [<ffffffffa083bacc>] llog_exist+0x3c/0x170 [obdclass]
<4> RSP <ffff880826eb1990>

Attached messages, console and vmcore-dmesg files to ticket.
Crash dump file has been stored to lola-1:/scratch/crashdumps/lu-7638/lola-8-127.0.0.1-2016-01-09-08:37:36/

Comment by Gerrit Updater [ 28/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17885/
Subject: LU-7638 recovery: do not abort update recovery.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b32e55b600ca2c9bf8b62287d9f889791d157426

Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ]

Landed for 2.8

Generated at Sat Feb 10 02:10:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.