Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • 3
    • 4450

    Description

      Sep  5 21:59:48 hyperion-dit30 kernel: BUG: soft lockup - CPU#9 stuck for 67s! [ll_ost_io00_016:10900]
      Sep  5 21:59:48 hyperion-dit30 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ib_srp scsi_transport_srp ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sd_mod crc_t10dif serio_raw iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_generic pata_acpi ata_piix ioatdma i7core_edac edac_core qla2xxx scsi_transport_fc scsi_tgt ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca mlx4_en mlx4_core [last unloaded: scsi_wait_scan]
      Sep  5 21:59:48 hyperion-dit30 kernel: CPU 9
      Sep  5 21:59:48 hyperion-dit30 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ib_srp scsi_transport_srp ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm_intel kvm sg sd_mod crc_t10dif serio_raw iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ata_generic pata_acpi ata_piix ioatdma i7core_edac edac_core qla2xxx scsi_transport_fc scsi_tgt ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca mlx4_en mlx4_core [last unloaded: scsi_wait_scan]
      
      Sep  5 21:59:48 hyperion-dit30 kernel:
      Sep  5 21:59:48 hyperion-dit30 kernel: Pid: 10900, comm: ll_ost_io00_016 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1 Supermicro X8DTG-D/X8DTG-D
      Sep  5 21:59:48 hyperion-dit30 kernel: RIP: 0010:[<ffffffff8127e927>]  [<ffffffff8127e927>] __write_lock_failed+0x7/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: RSP: 0018:ffff8800282a3d68  EFLAGS: 00000287
      Sep  5 21:59:48 hyperion-dit30 kernel: RAX: ffffffffa0428ca0 RBX: ffff8800282a3d70 RCX: 0000000000000001
      Sep  5 21:59:48 hyperion-dit30 kernel: RDX: ffff8800282a3e28 RSI: 0000000000000001 RDI: ffff88033735b080
      Sep  5 21:59:48 hyperion-dit30 kernel: RBP: ffffffff8100bc13 R08: ffff8800282ae0e0 R09: 000018ff3747ec29
      Sep  5 21:59:48 hyperion-dit30 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800282a3ce0
      Sep  5 21:59:48 hyperion-dit30 kernel: R13: ffff8802dbbba810 R14: ffff88033735b080 R15: ffffffff8150604b
      Sep  5 21:59:48 hyperion-dit30 kernel: FS:  00002aaaab47e700(0000) GS:ffff8800282a0000(0000) knlGS:0000000000000000
      Sep  5 21:59:48 hyperion-dit30 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      Sep  5 21:59:48 hyperion-dit30 kernel: CR2: 00002aaaab3fa450 CR3: 0000000336d57000 CR4: 00000000000006e0
      Sep  5 21:59:48 hyperion-dit30 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Sep  5 21:59:48 hyperion-dit30 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Sep  5 21:59:48 hyperion-dit30 kernel: Process ll_ost_io00_016 (pid: 10900, threadinfo ffff880278b9c000, task ffff8802a70b9540)
      Sep  5 21:59:48 hyperion-dit30 kernel: Stack:
      Sep  5 21:59:48 hyperion-dit30 kernel: ffffffff815005b7 ffff8800282a3d80 ffffffffa040b702 ffff8800282a3da0
      Sep  5 21:59:48 hyperion-dit30 kernel: <d> ffffffffa040b677 0000000000000000 ffff88033735b080 ffff8800282a3e00
      Sep  5 21:59:48 hyperion-dit30 kernel: <d> ffffffffa040d588 000000018110dd48 ffff8800282a3e28 0000000000000092
      Sep  5 21:59:48 hyperion-dit30 kernel: Call Trace:
      Sep  5 21:59:48 hyperion-dit30 kernel: <IRQ>
      
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff815005b7>] ? _write_lock+0x17/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040b702>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040b677>] ? cfs_hash_for_each_enter+0x37/0xb0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040d588>] ? cfs_hash_for_each_tight+0x48/0x2a0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0793fd0>] ? job_cleanup_callback+0x0/0x60 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040d853>] ? cfs_hash_for_each_safe+0x13/0x20 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa079400b>] ? job_cleanup_callback+0x3b/0x60 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81057fac>] ? scheduler_tick+0xcc/0x260
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff810a21d0>] ? tick_sched_timer+0x0/0xc0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81506050>] ? smp_apic_timer_interrupt+0x70/0x9b
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: <EOI>
      
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81279955>] ? strncmp+0x45/0x60
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0793fae>] ? job_stat_keycmp+0x5e/0x80 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040ad95>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040af17>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040c1a4>] ? cfs_hash_lookup+0x54/0xa0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81115438>] ? find_or_create_page+0x68/0xb0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa079458b>] ? lprocfs_job_stats_log+0x9b/0x870 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0e0134a>] ? filter_preprw_write+0x1d5a/0x2320 [obdfilter]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa04b7edb>] ? lnet_ni_send+0x4b/0x110 [lnet]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa096aaab>] ? null_alloc_rs+0x1ab/0x3b0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0e02710>] ? filter_preprw+0x80/0xa0 [obdfilter]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08a381c>] ? obd_preprw+0x12c/0x3d0 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08aa98a>] ? ost_brw_write+0x87a/0x1600 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8127ce76>] ? vsnprintf+0x2b6/0x5f0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa092909c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09291f8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08b102c>] ? ost_handle+0x360c/0x4850 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0405561>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0401364>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa093885d>] ? ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa03f565e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa092fdf7>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0939e49>] ? ptlrpc_main+0xb59/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: Code: d8 1f 00 00 48 8b 5b 20 48 83 eb 07 48 39 d9 73 06 48 89 01 31 c0 c3 b8 f2 ff ff ff c3 90 90 90 90 90 90 90 f0 81 07 00 00 00 01 <f3> 90 81 3f 00 00 00 01 75 f6 f0 81 2f 00 00 00 01 0f 85 e2 ff
      Sep  5 21:59:48 hyperion-dit30 kernel: Call Trace:
      
      Sep  5 21:59:48 hyperion-dit30 kernel: <IRQ>  [<ffffffff815005b7>] ? _write_lock+0x17/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040b702>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040b677>] ? cfs_hash_for_each_enter+0x37/0xb0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040d588>] ? cfs_hash_for_each_tight+0x48/0x2a0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0793fd0>] ? job_cleanup_callback+0x0/0x60 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040d853>] ? cfs_hash_for_each_safe+0x13/0x20 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa079400b>] ? job_cleanup_callback+0x3b/0x60 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81057fac>] ? scheduler_tick+0xcc/0x260
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff810a21d0>] ? tick_sched_timer+0x0/0xc0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81506050>] ? smp_apic_timer_interrupt+0x70/0x9b
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: <EOI>  [<ffffffff81279955>] ? strncmp+0x45/0x60
      
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0793fae>] ? job_stat_keycmp+0x5e/0x80 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040ad95>] ? cfs_hash_bd_lookup_intent+0x65/0x130 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040af17>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa040c1a4>] ? cfs_hash_lookup+0x54/0xa0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff81115438>] ? find_or_create_page+0x68/0xb0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa079458b>] ? lprocfs_job_stats_log+0x9b/0x870 [obdclass]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0e0134a>] ? filter_preprw_write+0x1d5a/0x2320 [obdfilter]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa04b7edb>] ? lnet_ni_send+0x4b/0x110 [lnet]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa096aaab>] ? null_alloc_rs+0x1ab/0x3b0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0e02710>] ? filter_preprw+0x80/0xa0 [obdfilter]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08a381c>] ? obd_preprw+0x12c/0x3d0 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08aa98a>] ? ost_brw_write+0x87a/0x1600 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8127ce76>] ? vsnprintf+0x2b6/0x5f0
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa092909c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09291f8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa08b102c>] ? ost_handle+0x360c/0x4850 [ost]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0405561>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0401364>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa093885d>] ? ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa03f565e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa092fdf7>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff810533f3>] ? __wake_up+0x53/0x70
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa0939e49>] ? ptlrpc_main+0xb59/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffffa09392f0>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
      Sep  5 21:59:48 hyperion-dit30 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Activity

          [LU-1872] OST soft lockup in recovery scale

          patch landed.

          niu Niu Yawei (Inactive) added a comment - patch landed.

          Since it's not practical to create dedicated thread for jobstats cleanup work, I'm going to move it into each jobstats logging, then the hash table would be shrinked when there are any activities. If there is no activity, the hash table will stop growing too, so it's not a big deal, user can cleanup it manually if he want.

          niu Niu Yawei (Inactive) added a comment - Since it's not practical to create dedicated thread for jobstats cleanup work, I'm going to move it into each jobstats logging, then the hash table would be shrinked when there are any activities. If there is no activity, the hash table will stop growing too, so it's not a big deal, user can cleanup it manually if he want.

          I just discussed with Niu, cfs_hash functions are NOT safe in interrupt context, it's implemented with plain spin_lock or rwlock. Also, it's not a very good idea to call cfs_hash_for_each_safe() in timer callback because it could be time consuming if the hash table is big.

          liang Liang Zhen (Inactive) added a comment - I just discussed with Niu, cfs_hash functions are NOT safe in interrupt context, it's implemented with plain spin_lock or rwlock. Also, it's not a very good idea to call cfs_hash_for_each_safe() in timer callback because it could be time consuming if the hash table is big.
          green Oleg Drokin added a comment -

          This looks like a deadlock potentially in jobid code.

          Note how normal stack is in the middle of jobid key compare function and the interrupt stack also tries to modify that same hash and stuck trying to take a lock.

          I imagine calling job_cleanup_callback from interrupt context is not such a great idea or there might be a different primitive that's more licking-compatible wrt interrupts. PErhaps Liang can shed some light on this.

          green Oleg Drokin added a comment - This looks like a deadlock potentially in jobid code. Note how normal stack is in the middle of jobid key compare function and the interrupt stack also tries to modify that same hash and stuck trying to take a lock. I imagine calling job_cleanup_callback from interrupt context is not such a great idea or there might be a different primitive that's more licking-compatible wrt interrupts. PErhaps Liang can shed some light on this.

          People

            niu Niu Yawei (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: