[LU-6837] MDS panic during 24 hours failover test. Created: 11/Jul/15  Updated: 19/Jul/15  Resolved: 19/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocking LU-6773 DNE2 Failover and recovery soak testing Closed
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
6>LDISKFS-fs (sde1): mounted filesystem with ordered data mode. quota=on. Opts:
<6>Lustre: lustre-MDT0002: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
<4>LNet: 8453:0:(debug.c:219:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
<4>general protection fault: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online
<4>CPU 5
<4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 iTCO_wdt iTCO_vendor_support microcode serio_raw mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core ses enclosure sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 8345, comm: lod0002_rec0006 Not tainted 2.6.32-431.29.2.el6_lustre.g2382eb0.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
<4>RIP: 0010:[<ffffffffa0905ee1>]  [<ffffffffa0905ee1>] insert_update_records_to_replay_list+0x1b1/0x1540 [ptlrpc]
<4>RSP: 0018:ffff880821697a60  EFLAGS: 00010296
<4>RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000006 RCX: 0000000000000000
<4>RDX: 0000000000005a5a RSI: 0000000000000000 RDI: 5a5a5a5a5a5a5a42
<4>RBP: ffff880821697ac0 R08: 0000000000000002 R09: ffff881008f5a000
<4>R10: 0000000000000001 R11: 0000000000000000 R12: ffff88100006a820
<4>R13: ffff88100006a830 R14: ffff88100006a800 R15: ffff8807fb377df8
<4>FS:  0000000000000000(0000) GS:ffff88085c420000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 000000000044fc20 CR3: 0000000001a85000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process lod0002_rec0006 (pid: 8345, threadinfo ffff880821696000, task ffff880831268040)
<4>Stack:
<4> ffff880821697ac0 ffffffffa0512c01 0000000000000010 ffff880821697ad0
<4><d> ffff880fff438c88 ffff880ff8600ad8 00000a8900001118 ffff880fff438c78
<4><d> ffff880ff8600800 ffff8807f9a95c60 ffff880ff6e5ba80 ffff880fff438c78
<4>Call Trace:
<4> [<ffffffffa0512c01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffffa0fc8739>] lod_process_recovery_updates+0x1e9/0x420 [lod]
<4> [<ffffffffa060a41a>] llog_process_thread+0x94a/0xfc0 [obdclass]
<4> [<ffffffffa060ab4d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa0fc8550>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa060d938>] llog_cat_process_cb+0x458/0x600 [obdclass]
<4> [<ffffffffa060a41a>] llog_process_thread+0x94a/0xfc0 [obdclass]
<4> [<ffffffffa060ab4d>] llog_process_or_fork+0xbd/0x5d0 [obdclass]
<4> [<ffffffffa060d4e0>] ? llog_cat_process_cb+0x0/0x600 [obdclass]
<4> [<ffffffffa060c39d>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
<4> [<ffffffffa0ff2fa0>] ? lod_sub_prep_llog+0x4f0/0x7b0 [lod]
<4> [<ffffffffa0fc8550>] ? lod_process_recovery_updates+0x0/0x420 [lod]
<4> [<ffffffffa060c509>] llog_cat_process+0x19/0x20 [obdclass]
<4> [<ffffffffa0fc7cfa>] lod_sub_recovery_thread+0x69a/0xbc0 [lod]
<4> [<ffffffffa0fc7660>] ? lod_sub_recovery_thread+0x0/0xbc0 [lod]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 8b 46 20 49 39 c4 48 8d 78 e8 75 1f e9 91 01 00 00 66 0f 1f 84 00 00 00 00 00 48 8b 47 18 49 39 c4 48 8d 78 e8 0f 84 77 01 00 00 <3b> 58 e8 75 ea 4c 89 e8 66 ff 00 66 66 90 48 85 ff 0f 84 38 02
<1>RIP  [<ffffffffa0905ee1>] insert_update_records_to_replay_list+0x1b1/0x1540 [ptlrpc]
<4> RSP <ffff880821697a60>


 Comments   
Comment by Gerrit Updater [ 11/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15573
Subject: LU-6837 update: re-lookup the dtrq in the replay list.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6d770b5e231d35708392dcc0721e46482aad420d

Comment by James Nunez (Inactive) [ 16/Jul/15 ]

It looks like we hit this one during review-dne testing:

2015-07-15 16:25:43 - https://testing.hpdd.intel.com/test_sets/7c3ce854-2b7f-11e5-aa6d-5254006e85c2

If this is not the same issue, please let me know.

Comment by Gerrit Updater [ 19/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15573/
Subject: LU-6837 update: re-lookup the dtrq in the replay list.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f836ef39fba1d6884aac9ff37479cbaac6a400b4

Comment by Peter Jones [ 19/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:03:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.