Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14148

OSS crashes due to grants

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.12.5
    • RHEL7, RHEL OFED, Lustre-2.12.5, ZFS-0.8
    • 2
    • 9223372036854775807

    Description

      One of our production filesystems that is running Lustre-2.12.5 crashed 3 times over the weekend. The first two events happened late at night, so I didn't capture a crash dump. For both of these outages, the server would crash again if I allowed recovery to take place, so I had to abort recovery. For the third crash, I allowed a crash dump to complete. For all 3 of the cases I have a stack trace (see below). The first problem appeared to just be heavy load (though I’m not an expert on the traces) – see Trace #1 below. However, the second time, the trace had mention of grant issues (see trace #2 below). There were also messages in the log like the following: 

      Nov 21 02:29:04 f2-oss1a6 kernel: LustreError: 20860:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli a3dbc252-8c91-c91f-230d-2a2f81a4a245 claims 1703936 GRANT, real grant 0

      James had mentioned last week that there was a grant patch that was needed for both the servers and clients in Lustre-2.12.5, but James mentioned that this problem we were seeing may or may not be related, and it would be a good idea to create a ticket. We are not sure if these different crashes are all related or not...

      Can we please get another set of eyes on the issue so we can work towards a resolution? These 3 outages were impactful to our users over the weekend. 

       

      Thanks,

      Dustin

      ############# TRACE #1 ###############
      
      [2974912.706991] Kernel panic - not syncing: Hard LOCKUP
      [2974912.706993] CPU: 22 PID: 0 Comm: swapper/22 Kdump: loaded Tainted: P           OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [2974912.706994] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [2974912.706994] Call Trace:
      [2974912.706997]  <NMI>  [<ffffffffba17ac23>] dump_stack+0x19/0x1b
      [2974912.706999]  [<ffffffffba174967>] panic+0xe8/0x21f
      [2974912.707006]  [<ffffffffb9a9b6af>] nmi_panic+0x3f/0x40
      [2974912.707009]  [<ffffffffb9b4eb81>] watchdog_overflow_callback+0x121/0x140
      [2974912.707011]  [<ffffffffb9ba8277>] __perf_event_overflow+0x57/0x100
      [2974912.707014]  [<ffffffffb9bb1a14>] perf_event_overflow+0x14/0x20
      [2974912.707016]  [<ffffffffb9a0ac70>] handle_pmi_common+0x1a0/0x250
      [2974912.707028]  [<ffffffffb9a0af4f>] intel_pmu_handle_irq+0xcf/0x1d0
      [2974912.707030]  [<ffffffffba184031>] perf_event_nmi_handler+0x31/0x50
      [2974912.707032]  [<ffffffffba18593c>] nmi_handle.isra.0+0x8c/0x150
      [2974912.707034]  [<ffffffffba185c18>] do_nmi+0x218/0x460
      [2974912.707036]  [<ffffffffba184d9c>] end_repeat_nmi+0x1e/0x81
      [2974912.707043]  <EOE>  <IRQ>  [<ffffffffba17544a>] queued_spin_lock_slowpath+0xb/0xf
      [2974912.707044]  [<ffffffffba1833b7>] _raw_spin_lock_irqsave+0x37/0x40
      [2974912.707046]  [<ffffffffb9b5b769>] rcu_check_callbacks+0x549/0x70
      [2974912.707050]  [<ffffffffb9aaf536>] update_process_times+0x46/0x80
      [2974912.707051]  [<ffffffffb9b0fcb0>] tick_sched_handle+0x30/0x70
      [2974912.707052]  [<ffffffffb9b0ff79>] tick_sched_timer+0x39/0x80
      [2974912.707054]  [<ffffffffb9aca5ee>] __hrtimer_run_queues+0x10e/0x270
      [2974912.707056]  [<ffffffffb9acab4f>] hrtimer_interrupt+0xaf/0x1d0
      [2974912.707058]  [<ffffffffb9a5c60b>] local_apic_timer_interrupt+0x3b/0x60
      [2974912.707060]  [<ffffffffba1929d3>] smp_apic_timer_interrupt+0x43/0x60
      [2974912.707061]  [<ffffffffba18eefa>] apic_timer_interrupt+0x16a/0x170
      [2974912.707064]  [<ffffffffb9fc1b5e>] cpuidle_idle_call+0xde/0x230
      [2974912.707066]  [<ffffffffb9a37c6e>] arch_cpu_idle+0xe/0xc0
      [2974912.707067]  [<ffffffffb9b0158a>] cpu_startup_entry+0x14a/0x1e0
      [2974912.707069]  [<ffffffffb9a5a0b7>] start_secondary+0x1f7/0x270
      [2974912.707070]  [<ffffffffb9a000d5>] start_cpu+0x5/0x14
      
      ####################### TRACE #2 ########################
      [ 5378.059559] Kernel panic - not syncing: LBUG
      [ 5378.059561] Pid: 44556, comm: ll_ost01_079 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019
      [ 5378.059563] Call Trace:
      [ 5378.059582]  [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 5378.059585]  [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 5378.059659]  [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc]
      [ 5378.059668]  [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd]
      [ 5378.059685]  [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 5378.059711]  [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 5378.059735]  [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [ 5378.059755]  [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 5378.059775]  [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [ 5378.059779]  [<ffffffff8e6c61f1>] kthread+0xd1/0xe0
      [ 5378.059784]  [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 5378.164169] CPU: 10 PID: 23550 Comm: ll_ost00_022 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [ 5378.177532] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [ 5378.185781] Call Trace:
      [ 5378.188925]  [<ffffffff8ed7ac23>] dump_stack+0x19/0x1b
      [ 5378.194743]  [<ffffffff8ed74967>] panic+0xe8/0x21f
      [ 5378.200189]  [<ffffffffc0b9c8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [ 5378.207052]  [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc]
      [ 5378.214824]  [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd]
      [ 5378.237175]  [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 5378.245008]  [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 5378.251976]  [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [ 5378.275204]  [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 5378.296541]  [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [ 5378.317273]  [<ffffffff8e6c61f1>] kthread+0xd1/0xe0
      [ 5378.329273]  [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
      
      ####################### TRACE #3 ########################
      [59064.267110] Kernel panic - not syncing: Hard LOCKUP
      [59064.267111] CPU: 15 PID: 21200 Comm: ll_ost01_015 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [59064.267112] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [59064.267112] Call Trace:
      [59064.267115] <NMI> [<ffffffffb857ac23>] dump_stack+0x19/0x1b
      [59064.267117] [<ffffffffb8574967>] panic+0xe8/0x21f
      [59064.267122] [<ffffffffb7e9b6af>] nmi_panic+0x3f/0x40
      [59064.267124] [<ffffffffb7f4eb81>] watchdog_overflow_callback+0x121/0x140
      [59064.267127] [<ffffffffb7fa8277>] __perf_event_overflow+0x57/0x100
      [59064.267129] [<ffffffffb7fb1a14>] perf_event_overflow+0x14/0x20
      [59064.267131] [<ffffffffb7e0ac70>] handle_pmi_common+0x1a0/0x250
      [59064.267142] [<ffffffffb7e0af4f>] intel_pmu_handle_irq+0xcf/0x1d0
      [59064.267144] [<ffffffffb8584031>] perf_event_nmi_handler+0x31/0x50
      [59064.267146] [<ffffffffb858593c>] nmi_handle.isra.0+0x8c/0x150
      [59064.267147] [<ffffffffb8585c18>] do_nmi+0x218/0x460
      [59064.267149] [<ffffffffb8584d9c>] end_repeat_nmi+0x1e/0x81
      [59064.267155] <EOE> <IRQ> [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf
      [59064.267156] [<ffffffffb85833b7>] _raw_spin_lock_irqsave+0x37/0x40
      [59064.267158] [<ffffffffb7f57f5b>] rcu_dump_cpu_stacks+0x4b/0xd0
      [59064.267160] [<ffffffffb7f5b662>] rcu_check_callbacks+0x442/0x730
      [59064.267164] [<ffffffffb7eaf536>] update_process_times+0x46/0x80
      [59064.267166] [<ffffffffb7f0fcb0>] tick_sched_handle+0x30/0x70
      [59064.267167] [<ffffffffb7f0ff79>] tick_sched_timer+0x39/0x80
      [59064.267169] [<ffffffffb7eca5ee>] __hrtimer_run_queues+0x10e/0x270
      [59064.267171] [<ffffffffb7ecab4f>] hrtimer_interrupt+0xaf/0x1d0
      [59064.267173] [<ffffffffb7e5c60b>] local_apic_timer_interrupt+0x3b/0x60
      [59064.267175] [<ffffffffb85929d3>] smp_apic_timer_interrupt+0x43/0x60
      [59064.267176] [<ffffffffb858eefa>] apic_timer_interrupt+0x16a/0x170
      [59064.267180] [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf
      [59064.267181] [<ffffffffb8583330>] _raw_spin_lock+0x20/0x30
      [59064.267194] [<ffffffffc173902c>] lock_res_and_lock+0x2c/0x50 [ptlrpc]
      [59064.267208] [<ffffffffc1740c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc]
      [59064.267245] [<ffffffffc1769516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [59064.267288] [<ffffffffc17f1bd2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [59064.267311] [<ffffffffc17f89ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [59064.267356] [<ffffffffc179d48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [59064.267396] [<ffffffffc17a0df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [59064.267417] [<ffffffffb7ec61f1>] kthread+0xd1/0xe0
      

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            dustb100 Dustin Leverman
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: