Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7809

general protection fault: 0000 during failback of MDS disk resources

Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160222' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150222) DNE is enabled. MDTs had been formatted using ldiskfs, OST using zfs. MDSes are configured in active-active HA failover configuration.Especially nodes affected (lola-[8,9]) form a HA failover pair.
      More set-up details can be found at https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration

      Sequence of events

      • 2016-02-23 23:52:17,963:fsmgmt.fsmgmt:INFO triggering fault mds_failover
      • 2016-02-23 23:52:17,964:fsmgmt.fsmgmt:INFO reseting MDS node lola-9
      • 2016-02-24 00:00:29 Both MDTs (mdt-2,3) failover to lola-8
      • 2016-02-24 00:01:06,468:fsmgmt.fsmgmt:INFO ... soaked-MDT0003 failed back (action completed successful!)
      • 2016-02-24 00:01:06,468:fsmgmt.fsmgmt:INFO Unmounting soaked-MDT0002 on lola-8 .. (--> caused crash)

      The error reads as:

      <4>general protection fault: 0000 [#1] 
      <3>LustreError: 6683:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      <4>SMP 
      <4>last sysfs file: /sys/devices/system/cpu/online
      <4>CPU 12 
      <4>Modules linked in: mgs(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath iTCO_wdt iTCO_vendor_support microcode zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      <4>
      <4>Pid: 6617, comm: tgt_recover_2 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.g93f956d.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
      <4>RIP: 0010:[<ffffffffa0b2222c>]  [<ffffffffa0b2222c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc]
      <4>RSP: 0018:ffff88028674fca0  EFLAGS: 00010207
      <4>RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000
      <4>RDX: ffff8802866a41e8 RSI: ffffffffa0a65b80 RDI: ffff8802866a4208
      <4>RBP: ffff88028674fcc0 R08: 00000000fffffff2 R09: 00000000fffffff5
      <4>R10: 0000000000000009 R11: 0000000000000000 R12: ffff8802866a4180
      <4>R13: ffff8802866a4208 R14: ffff8802866a4180 R15: 0000000000000000
      <4>FS:  0000000000000000(0000) GS:ffff88044e480000(0000) knlGS:0000000000000000
      <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: 00007f7e64e29000 CR3: 0000000001a85000 CR4: 00000000000407e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process tgt_recover_2 (pid: 6617, threadinfo ffff88028674e000, task ffff88028674d520)
      <4>Stack:
      <4> ffffffffa0bb6640 ffff88080c3fd038 0000000000000000 ffff88080c3fd3cc
      <4><d> ffff88028674fd50 ffffffffa0a65c07 ffff88028674fde0 00000000a085c87e
      <4><d> 0000000000000000 ffff8807fb3adddd 00000054a0a60fc0 ffffffffa0b52dc9
      <4>Call Trace:
      <4> [<ffffffffa0a65c07>] check_for_next_transno+0x87/0x6d0 [ptlrpc]
      <4> [<ffffffffa0a65b80>] ? check_for_next_transno+0x0/0x6d0 [ptlrpc]
      <4> [<ffffffffa0a62c63>] target_recovery_overseer+0xb3/0x630 [ptlrpc]
      <4> [<ffffffffa0a60f30>] ? exp_req_replay_healthy_or_from_mdt+0x0/0x40 [ptlrpc]
      <4> [<ffffffffa077bcf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      <4> [<ffffffffa0a62ac0>] ? abort_lock_replay_queue+0x30/0x120 [ptlrpc]
      <4> [<ffffffffa0a693db>] target_recovery_thread+0x8bb/0x1dd0 [ptlrpc]
      <4> [<ffffffff81064c12>] ? default_wake_function+0x12/0x20
      <4> [<ffffffffa0a68b20>] ? target_recovery_thread+0x0/0x1dd0 [ptlrpc]
      <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
      <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
      <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
      <4>Code: 89 6d f8 0f 1f 44 00 00 31 db 4c 8d af 88 00 00 00 49 89 fc 4c 89 ef e8 13 b6 a0 e0 49 8b 44 24 68 49 8d 54 24 68 48 39 d0 74 04 <48> 8b 58 10 4c 89 e8 66 ff 00 66 66 90 f6 05 26 4e c7 ff 08 74 
      <1>RIP  [<ffffffffa0b2222c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc]
      <4> RSP <ffff88028674fca0>
      

      Immediately before the crash the following errors are printed to lola-8's message file:

      lola-8.log:Feb 24 00:01:06 lola-8 kernel: LustreError: 6612:0:(osp_object.c:588:osp_attr_get()) soaked-MDT0003-osp-MDT0002:osp_attr_get update error [0x200000009:0x3:0x0]: rc = -5
      lola-8.log:Feb 24 00:01:06 lola-8 kernel: LustreError: 6612:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0002-mdtlov: can't get id from catalogs: rc = -5
      lola-8.log:Feb 24 00:01:06 lola-8 kernel: LustreError: 6612:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0003-osp-MDT0002 getting update log failed: rc = -5
      ...
      ...
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: LustreError: 6617:0:(update_records.c:72:update_records_dump()) master transno = 8594544408 batchid = 4299976565 flags = 0 ops = 73 params = 46
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: LustreError: 6617:0:(update_records.c:72:update_records_dump()) master transno = 8594544409 batchid = 4299976566 flags = 0 ops = 73 params = 46
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: LustreError: 6617:0:(update_records.c:72:update_records_dump()) master transno = 8594544411 batchid = 4299976567 flags = 0 ops = 73 params = 46
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: LustreError: 6617:0:(update_records.c:72:update_records_dump()) master transno = 8594544417 batchid = 4299976568 flags = 0 ops = 73 params = 46
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: general protection fault: 0000 [#1] 
      lola-8.log:Feb 24 00:01:09 lola-8 kernel: LustreError: 6683:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      

      Attached files: message, console, vmcore-dmesg.txt of lola-8.
      Crash file is available, too.

      Attachments

        Issue Links

          Activity

            [LU-7809] general protection fault: 0000 during failback of MDS disk resources
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18651/
            Subject: LU-7809 lod: stop recovery before destory dtrq list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f2892fda72897a8a264414c06e54751d127a5709

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18651/ Subject: LU-7809 lod: stop recovery before destory dtrq list Project: fs/lustre-release Branch: master Current Patch Set: Commit: f2892fda72897a8a264414c06e54751d127a5709

            The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) during a test session that is ongoing and last already for 7 days.

            heckes Frank Heckes (Inactive) added a comment - The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713 ) during a test session that is ongoing and last already for 7 days.

            Crash happens again for b2_8 RC4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302) while umounting the MDTs on node lola-9

            <4>general protection fault: 0000 [#1]
            <3>LustreError: 4715:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
            <4>SMP
            <4>last sysfs file: /sys/devices/system/cpu/online
            <4>CPU 12
            <4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif isci libsas ahci mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            <4>
            <4>Pid: 4496, comm: tgt_recover_2 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
            <4>RIP: 0010:[<ffffffffa0aec290>]  [<ffffffffa0aec290>] distribute_txn_get_next_transno+0xb0/0xd0 [ptlrpc]
            <4>RSP: 0018:ffff8808224c9d30  EFLAGS: 00010202
            <4>RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000
            <4>RDX: ffff8803cd1bd4e8 RSI: ffffffffa0b1a784 RDI: ffffffffa0b9f380
            <4>RBP: ffff8808224c9d50 R08: 00000000fffffffb R09: 00000000fffffffe
            <4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff8803cd1bd480
            <4>R13: ffff8803cd1bd508 R14: ffff8803cd2413e0 R15: ffff8803cd241038
            <4>FS:  0000000000000000(0000) GS:ffff88044e480000(0000) knlGS:0000000000000000
            <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            <4>CR2: 00007f1efe1ac000 CR3: 0000000001a85000 CR4: 00000000000407e0
            <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            <4>Process tgt_recover_2 (pid: 4496, threadinfo ffff8808224c8000, task ffff8808224cf520)
            <4>Stack:
            <4> ffff8808224c9dd0 ffff8803cd3af0b0 ffffffffa0a2fb80 ffff8808224c9dd0
            <4><d> ffff8808224c9e30 ffffffffa0a2ce2a ffff8808224c9dd0 0000000000000286
            <4><d> 0000000000000064 0000000056d7154b ffff8808224c9de0 ffff8808224cf520
            <4>Call Trace:
            <4> [<ffffffffa0a2fb80>] ? check_for_next_transno+0x0/0x6d0 [ptlrpc]
            <4> [<ffffffffa0a2ce2a>] target_recovery_overseer+0x27a/0x630 [ptlrpc]
            <4> [<ffffffffa0a2af30>] ? exp_req_replay_healthy_or_from_mdt+0x0/0x40 [ptlrpc]
            <4> [<ffffffffa0aec827>] ? dtrq_destroy+0x497/0x630 [ptlrpc]
            <4> [<ffffffffa0a333db>] target_recovery_thread+0x8bb/0x1dd0 [ptlrpc]
            <4> [<ffffffffa0a32b20>] ? target_recovery_thread+0x0/0x1dd0 [ptlrpc]
            <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0
            <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
            <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
            <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
            <4>Code: 02 00 00 48 c7 c6 84 a7 b1 a0 48 c7 05 26 31 0b 00 00 00 00 00 c7 05 14 31 0b 00 00 00 08 00 48 c7 c7 80 f3 b9 a0 49 8b 44 24 10 <48> 8b 10 31 c0 48 83 c2 40 e8 12 aa c5 ff 48 89 d8 4c 8b 65 f0
            <1>RIP  [<ffffffffa0aec290>] distribute_txn_get_next_transno+0xb0/0xd0 [ptlrpc]
            <4> RSP <ffff8808224c9d30>
            

            crash dump file can be provided on demand.

            heckes Frank Heckes (Inactive) added a comment - Crash happens again for b2_8 RC4 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 ) while umounting the MDTs on node lola-9 <4>general protection fault: 0000 [#1] <3>LustreError: 4715:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery <4>SMP <4>last sysfs file: /sys/devices/system/cpu/online <4>CPU 12 <4>Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif isci libsas ahci mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 4496, comm: tgt_recover_2 Tainted: P --------------- 2.6.32-504.30.3.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ <4>RIP: 0010:[<ffffffffa0aec290>] [<ffffffffa0aec290>] distribute_txn_get_next_transno+0xb0/0xd0 [ptlrpc] <4>RSP: 0018:ffff8808224c9d30 EFLAGS: 00010202 <4>RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000 <4>RDX: ffff8803cd1bd4e8 RSI: ffffffffa0b1a784 RDI: ffffffffa0b9f380 <4>RBP: ffff8808224c9d50 R08: 00000000fffffffb R09: 00000000fffffffe <4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff8803cd1bd480 <4>R13: ffff8803cd1bd508 R14: ffff8803cd2413e0 R15: ffff8803cd241038 <4>FS: 0000000000000000(0000) GS:ffff88044e480000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 00007f1efe1ac000 CR3: 0000000001a85000 CR4: 00000000000407e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process tgt_recover_2 (pid: 4496, threadinfo ffff8808224c8000, task ffff8808224cf520) <4>Stack: <4> ffff8808224c9dd0 ffff8803cd3af0b0 ffffffffa0a2fb80 ffff8808224c9dd0 <4><d> ffff8808224c9e30 ffffffffa0a2ce2a ffff8808224c9dd0 0000000000000286 <4><d> 0000000000000064 0000000056d7154b ffff8808224c9de0 ffff8808224cf520 <4>Call Trace: <4> [<ffffffffa0a2fb80>] ? check_for_next_transno+0x0/0x6d0 [ptlrpc] <4> [<ffffffffa0a2ce2a>] target_recovery_overseer+0x27a/0x630 [ptlrpc] <4> [<ffffffffa0a2af30>] ? exp_req_replay_healthy_or_from_mdt+0x0/0x40 [ptlrpc] <4> [<ffffffffa0aec827>] ? dtrq_destroy+0x497/0x630 [ptlrpc] <4> [<ffffffffa0a333db>] target_recovery_thread+0x8bb/0x1dd0 [ptlrpc] <4> [<ffffffffa0a32b20>] ? target_recovery_thread+0x0/0x1dd0 [ptlrpc] <4> [<ffffffff8109e78e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4>Code: 02 00 00 48 c7 c6 84 a7 b1 a0 48 c7 05 26 31 0b 00 00 00 00 00 c7 05 14 31 0b 00 00 00 08 00 48 c7 c7 80 f3 b9 a0 49 8b 44 24 10 <48> 8b 10 31 c0 48 83 c2 40 e8 12 aa c5 ff 48 89 d8 4c 8b 65 f0 <1>RIP [<ffffffffa0aec290>] distribute_txn_get_next_transno+0xb0/0xd0 [ptlrpc] <4> RSP <ffff8808224c9d30> crash dump file can be provided on demand.

            We are seeing this again on 2.8.0-RC2
            Errors before panic:

            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028910512 batchid = 141735366499 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911079 batchid = 141735366507 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911542 batchid = 141735366512 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911597 batchid = 141735366514 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911739 batchid = 141735366519 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911807 batchid = 141735366523 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911819 batchid = 141735366524 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911868 batchid = 141735366527 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911947 batchid = 141735366535 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028912075 batchid = 141735366538 flags = 0 ops = 5 params = 4
            Mar  1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028912093 batchid = 141735366539 flags = 0 ops = 5 params = 4
            

            Panic

            Mar  1 07:54:19 lola-8 kernel: general protection fault: 0000 [#1]
            Mar  1 07:54:19 lola-8 kernel: LustreError: 7064:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery
            Mar  1 07:54:19 lola-8 kernel: SMP
            Mar  1 07:54:19 lola-8 kernel: last sysfs file: /sys/devices/system/cpu/online
            Mar  1 07:54:19 lola-8 kernel: CPU 18
            Mar  1 07:54:19 lola-8 kernel: Modules linked in: mgs(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath iTCO_wdt iTCO_vendor_support microcode zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Mar  1 07:54:19 lola-8 kernel:
            Mar  1 07:54:19 lola-8 kernel: Pid: 6942, comm: tgt_recover_3 Tainted: P           ---------------    2.6.32-504.30.3.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
            Mar  1 07:54:19 lola-8 kernel: RIP: 0010:[<ffffffffa0b2121c>]  [<ffffffffa0b2121c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: RSP: 0018:ffff8807e068dca0  EFLAGS: 00010203
            Mar  1 07:54:19 lola-8 kernel: RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000
            Mar  1 07:54:19 lola-8 kernel: RDX: ffff880304ca7328 RSI: ffffffffa0a64b80 RDI: ffff880304ca7348
            Mar  1 07:54:19 lola-8 kernel: RBP: ffff8807e068dcc0 R08: 00000000fffffff0 R09: 00000000fffffff3
            Mar  1 07:54:19 lola-8 kernel: R10: 000000000000000b R11: 0000000000000000 R12: ffff880304ca72c0
            Mar  1 07:54:19 lola-8 kernel: R13: ffff880304ca7348 R14: ffff880304ca72c0 R15: 0000000000000000
            Mar  1 07:54:19 lola-8 kernel: FS:  0000000000000000(0000) GS:ffff880038340000(0000) knlGS:0000000000000000
            Mar  1 07:54:19 lola-8 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            Mar  1 07:54:19 lola-8 kernel: CR2: 0000003bd24acd50 CR3: 0000000001a85000 CR4: 00000000000407e0
            Mar  1 07:54:19 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Mar  1 07:54:19 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Mar  1 07:54:19 lola-8 kernel: Process tgt_recover_3 (pid: 6942, threadinfo ffff8807e068c000, task ffff8807fca66ab0)
            Mar  1 07:54:19 lola-8 kernel: Stack:
            Mar  1 07:54:19 lola-8 kernel: ffffffffa0bb53e0 ffff8803ea51e078 0000000000000000 ffff8803ea51e40c
            Mar  1 07:54:19 lola-8 kernel: <d> ffff8807e068dd50 ffffffffa0a64c07 ffff8807e068dde0 00000000a085c7be
            Mar  1 07:54:19 lola-8 kernel: <d> 0000000000000000 ffff8803e9d596da 00000054a0a5ffc0 ffffffffa0b51dad
            Mar  1 07:54:19 lola-8 kernel: Call Trace:
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a64c07>] check_for_next_transno+0x87/0x6d0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a64b80>] ? check_for_next_transno+0x0/0x6d0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a61c63>] target_recovery_overseer+0xb3/0x630 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a5ff30>] ? exp_req_replay_healthy_or_from_mdt+0x0/0x40 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa077bcf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a61ac0>] ? abort_lock_replay_queue+0x30/0x120 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a683db>] target_recovery_thread+0x8bb/0x1dd0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffff81064c12>] ? default_wake_function+0x12/0x20
            Mar  1 07:54:19 lola-8 kernel: [<ffffffffa0a67b20>] ? target_recovery_thread+0x0/0x1dd0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
            Mar  1 07:54:19 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Mar  1 07:54:19 lola-8 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
            Mar  1 07:54:19 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Mar  1 07:54:19 lola-8 kernel: Code: 89 6d f8 0f 1f 44 00 00 31 db 4c 8d af 88 00 00 00 49 89 fc 4c 89 ef e8 23 c6 a0 e0 49 8b 44 24 68 49 8d 54 24 68 48 39 d0 74 04 <48> 8b 58 10 4c 89 e8 66 ff 00 66 66 90 f6 05 d6 5d c7 ff 08 74
            Mar  1 07:54:19 lola-8 kernel: RIP  [<ffffffffa0b2121c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc]
            Mar  1 07:54:19 lola-8 kernel: RSP <ffff8807e068dca0>
            
            cliffw Cliff White (Inactive) added a comment - We are seeing this again on 2.8.0-RC2 Errors before panic: Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028910512 batchid = 141735366499 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911079 batchid = 141735366507 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911542 batchid = 141735366512 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911597 batchid = 141735366514 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911739 batchid = 141735366519 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911807 batchid = 141735366523 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911819 batchid = 141735366524 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911868 batchid = 141735366527 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028911947 batchid = 141735366535 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028912075 batchid = 141735366538 flags = 0 ops = 5 params = 4 Mar 1 07:54:19 lola-8 kernel: LustreError: 6942:0:(update_records.c:72:update_records_dump()) master transno = 146028912093 batchid = 141735366539 flags = 0 ops = 5 params = 4 Panic Mar 1 07:54:19 lola-8 kernel: general protection fault: 0000 [#1] Mar 1 07:54:19 lola-8 kernel: LustreError: 7064:0:(ldlm_lib.c:2562:target_stop_recovery_thread()) soaked-MDT0003: Aborting recovery Mar 1 07:54:19 lola-8 kernel: SMP Mar 1 07:54:19 lola-8 kernel: last sysfs file: /sys/devices/system/cpu/online Mar 1 07:54:19 lola-8 kernel: CPU 18 Mar 1 07:54:19 lola-8 kernel: Modules linked in: mgs(U) osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgc(U) osd_ldiskfs(U) ldiskfs(U) jbd2 lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) 8021q garp stp llc nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm scsi_dh_rdac dm_round_robin dm_multipath iTCO_wdt iTCO_vendor_support microcode zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate zavl(P)(U) zunicode(P)(U) sb_edac edac_core lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ptp pps_core ext3 jbd mbcache sd_mod crc_t10dif ahci isci libsas wmi mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Mar 1 07:54:19 lola-8 kernel: Mar 1 07:54:19 lola-8 kernel: Pid: 6942, comm: tgt_recover_3 Tainted: P --------------- 2.6.32-504.30.3.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ Mar 1 07:54:19 lola-8 kernel: RIP: 0010:[<ffffffffa0b2121c>] [<ffffffffa0b2121c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: RSP: 0018:ffff8807e068dca0 EFLAGS: 00010203 Mar 1 07:54:19 lola-8 kernel: RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000000 Mar 1 07:54:19 lola-8 kernel: RDX: ffff880304ca7328 RSI: ffffffffa0a64b80 RDI: ffff880304ca7348 Mar 1 07:54:19 lola-8 kernel: RBP: ffff8807e068dcc0 R08: 00000000fffffff0 R09: 00000000fffffff3 Mar 1 07:54:19 lola-8 kernel: R10: 000000000000000b R11: 0000000000000000 R12: ffff880304ca72c0 Mar 1 07:54:19 lola-8 kernel: R13: ffff880304ca7348 R14: ffff880304ca72c0 R15: 0000000000000000 Mar 1 07:54:19 lola-8 kernel: FS: 0000000000000000(0000) GS:ffff880038340000(0000) knlGS:0000000000000000 Mar 1 07:54:19 lola-8 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Mar 1 07:54:19 lola-8 kernel: CR2: 0000003bd24acd50 CR3: 0000000001a85000 CR4: 00000000000407e0 Mar 1 07:54:19 lola-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 1 07:54:19 lola-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Mar 1 07:54:19 lola-8 kernel: Process tgt_recover_3 (pid: 6942, threadinfo ffff8807e068c000, task ffff8807fca66ab0) Mar 1 07:54:19 lola-8 kernel: Stack: Mar 1 07:54:19 lola-8 kernel: ffffffffa0bb53e0 ffff8803ea51e078 0000000000000000 ffff8803ea51e40c Mar 1 07:54:19 lola-8 kernel: <d> ffff8807e068dd50 ffffffffa0a64c07 ffff8807e068dde0 00000000a085c7be Mar 1 07:54:19 lola-8 kernel: <d> 0000000000000000 ffff8803e9d596da 00000054a0a5ffc0 ffffffffa0b51dad Mar 1 07:54:19 lola-8 kernel: Call Trace: Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a64c07>] check_for_next_transno+0x87/0x6d0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a64b80>] ? check_for_next_transno+0x0/0x6d0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a61c63>] target_recovery_overseer+0xb3/0x630 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a5ff30>] ? exp_req_replay_healthy_or_from_mdt+0x0/0x40 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa077bcf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a61ac0>] ? abort_lock_replay_queue+0x30/0x120 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a683db>] target_recovery_thread+0x8bb/0x1dd0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffff81064c12>] ? default_wake_function+0x12/0x20 Mar 1 07:54:19 lola-8 kernel: [<ffffffffa0a67b20>] ? target_recovery_thread+0x0/0x1dd0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Mar 1 07:54:19 lola-8 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Mar 1 07:54:19 lola-8 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Mar 1 07:54:19 lola-8 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Mar 1 07:54:19 lola-8 kernel: Code: 89 6d f8 0f 1f 44 00 00 31 db 4c 8d af 88 00 00 00 49 89 fc 4c 89 ef e8 23 c6 a0 e0 49 8b 44 24 68 49 8d 54 24 68 48 39 d0 74 04 <48> 8b 58 10 4c 89 e8 66 ff 00 66 66 90 f6 05 d6 5d c7 ff 08 74 Mar 1 07:54:19 lola-8 kernel: RIP [<ffffffffa0b2121c>] distribute_txn_get_next_transno+0x3c/0xd0 [ptlrpc] Mar 1 07:54:19 lola-8 kernel: RSP <ffff8807e068dca0>

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18658
            Subject: LU-7809 lod: stop recovery before destory dtrq list
            Project: fs/lustre-release
            Branch: b2_8
            Current Patch Set: 1
            Commit: f8f8162b3f4dda6fc08afda91514131dbe14cc59

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18658 Subject: LU-7809 lod: stop recovery before destory dtrq list Project: fs/lustre-release Branch: b2_8 Current Patch Set: 1 Commit: f8f8162b3f4dda6fc08afda91514131dbe14cc59

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18651
            Subject: LU-7809 lod: stop recovery before destory dtrq list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b4892ca5b5c3787313c9256fc23add5a88d61855

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18651 Subject: LU-7809 lod: stop recovery before destory dtrq list Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b4892ca5b5c3787313c9256fc23add5a88d61855

            Crash file has been uploaded to lhn.hpdd.intel.com:/scratch/crashdumps/lu-7809/lola-8/127.0.0.1-2016-02-24-00:01:25/.

            heckes Frank Heckes (Inactive) added a comment - Crash file has been uploaded to lhn.hpdd.intel.com:/scratch/crashdumps/lu-7809/lola-8/127.0.0.1-2016-02-24-00:01:25/ .

            People

              di.wang Di Wang
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: