Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.12.5
    • RHEL7, RHEL OFED, Lustre-2.12.5, ZFS-0.8
    • 2
    • 9223372036854775807

    Description

      One of our production filesystems that is running Lustre-2.12.5 crashed 3 times over the weekend. The first two events happened late at night, so I didn't capture a crash dump. For both of these outages, the server would crash again if I allowed recovery to take place, so I had to abort recovery. For the third crash, I allowed a crash dump to complete. For all 3 of the cases I have a stack trace (see below). The first problem appeared to just be heavy load (though I’m not an expert on the traces) – see Trace #1 below. However, the second time, the trace had mention of grant issues (see trace #2 below). There were also messages in the log like the following: 

      Nov 21 02:29:04 f2-oss1a6 kernel: LustreError: 20860:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli a3dbc252-8c91-c91f-230d-2a2f81a4a245 claims 1703936 GRANT, real grant 0

      James had mentioned last week that there was a grant patch that was needed for both the servers and clients in Lustre-2.12.5, but James mentioned that this problem we were seeing may or may not be related, and it would be a good idea to create a ticket. We are not sure if these different crashes are all related or not...

      Can we please get another set of eyes on the issue so we can work towards a resolution? These 3 outages were impactful to our users over the weekend. 

       

      Thanks,

      Dustin

      ############# TRACE #1 ###############
      
      [2974912.706991] Kernel panic - not syncing: Hard LOCKUP
      [2974912.706993] CPU: 22 PID: 0 Comm: swapper/22 Kdump: loaded Tainted: P           OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [2974912.706994] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [2974912.706994] Call Trace:
      [2974912.706997]  <NMI>  [<ffffffffba17ac23>] dump_stack+0x19/0x1b
      [2974912.706999]  [<ffffffffba174967>] panic+0xe8/0x21f
      [2974912.707006]  [<ffffffffb9a9b6af>] nmi_panic+0x3f/0x40
      [2974912.707009]  [<ffffffffb9b4eb81>] watchdog_overflow_callback+0x121/0x140
      [2974912.707011]  [<ffffffffb9ba8277>] __perf_event_overflow+0x57/0x100
      [2974912.707014]  [<ffffffffb9bb1a14>] perf_event_overflow+0x14/0x20
      [2974912.707016]  [<ffffffffb9a0ac70>] handle_pmi_common+0x1a0/0x250
      [2974912.707028]  [<ffffffffb9a0af4f>] intel_pmu_handle_irq+0xcf/0x1d0
      [2974912.707030]  [<ffffffffba184031>] perf_event_nmi_handler+0x31/0x50
      [2974912.707032]  [<ffffffffba18593c>] nmi_handle.isra.0+0x8c/0x150
      [2974912.707034]  [<ffffffffba185c18>] do_nmi+0x218/0x460
      [2974912.707036]  [<ffffffffba184d9c>] end_repeat_nmi+0x1e/0x81
      [2974912.707043]  <EOE>  <IRQ>  [<ffffffffba17544a>] queued_spin_lock_slowpath+0xb/0xf
      [2974912.707044]  [<ffffffffba1833b7>] _raw_spin_lock_irqsave+0x37/0x40
      [2974912.707046]  [<ffffffffb9b5b769>] rcu_check_callbacks+0x549/0x70
      [2974912.707050]  [<ffffffffb9aaf536>] update_process_times+0x46/0x80
      [2974912.707051]  [<ffffffffb9b0fcb0>] tick_sched_handle+0x30/0x70
      [2974912.707052]  [<ffffffffb9b0ff79>] tick_sched_timer+0x39/0x80
      [2974912.707054]  [<ffffffffb9aca5ee>] __hrtimer_run_queues+0x10e/0x270
      [2974912.707056]  [<ffffffffb9acab4f>] hrtimer_interrupt+0xaf/0x1d0
      [2974912.707058]  [<ffffffffb9a5c60b>] local_apic_timer_interrupt+0x3b/0x60
      [2974912.707060]  [<ffffffffba1929d3>] smp_apic_timer_interrupt+0x43/0x60
      [2974912.707061]  [<ffffffffba18eefa>] apic_timer_interrupt+0x16a/0x170
      [2974912.707064]  [<ffffffffb9fc1b5e>] cpuidle_idle_call+0xde/0x230
      [2974912.707066]  [<ffffffffb9a37c6e>] arch_cpu_idle+0xe/0xc0
      [2974912.707067]  [<ffffffffb9b0158a>] cpu_startup_entry+0x14a/0x1e0
      [2974912.707069]  [<ffffffffb9a5a0b7>] start_secondary+0x1f7/0x270
      [2974912.707070]  [<ffffffffb9a000d5>] start_cpu+0x5/0x14
      
      ####################### TRACE #2 ########################
      [ 5378.059559] Kernel panic - not syncing: LBUG
      [ 5378.059561] Pid: 44556, comm: ll_ost01_079 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019
      [ 5378.059563] Call Trace:
      [ 5378.059582]  [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [ 5378.059585]  [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [ 5378.059659]  [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc]
      [ 5378.059668]  [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd]
      [ 5378.059685]  [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 5378.059711]  [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 5378.059735]  [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [ 5378.059755]  [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 5378.059775]  [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [ 5378.059779]  [<ffffffff8e6c61f1>] kthread+0xd1/0xe0
      [ 5378.059784]  [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
      [ 5378.164169] CPU: 10 PID: 23550 Comm: ll_ost00_022 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [ 5378.177532] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [ 5378.185781] Call Trace:
      [ 5378.188925]  [<ffffffff8ed7ac23>] dump_stack+0x19/0x1b
      [ 5378.194743]  [<ffffffff8ed74967>] panic+0xe8/0x21f
      [ 5378.200189]  [<ffffffffc0b9c8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [ 5378.207052]  [<ffffffffc15b9d30>] tgt_grant_sanity_check+0x520/0x560 [ptlrpc]
      [ 5378.214824]  [<ffffffffc18a6579>] ofd_obd_disconnect+0x139/0x220 [ofd]
      [ 5378.237175]  [<ffffffffc14f4247>] target_handle_disconnect+0xd7/0x450 [ptlrpc]
      [ 5378.245008]  [<ffffffffc15948c8>] tgt_disconnect+0x58/0x170 [ptlrpc]
      [ 5378.251976]  [<ffffffffc159b9ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [ 5378.275204]  [<ffffffffc154048b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 5378.296541]  [<ffffffffc1543df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [ 5378.317273]  [<ffffffff8e6c61f1>] kthread+0xd1/0xe0
      [ 5378.329273]  [<ffffffff8ed8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
      
      ####################### TRACE #3 ########################
      [59064.267110] Kernel panic - not syncing: Hard LOCKUP
      [59064.267111] CPU: 15 PID: 21200 Comm: ll_ost01_015 Kdump: loaded Tainted: P OEL ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
      [59064.267112] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.6.13 12/17/2018
      [59064.267112] Call Trace:
      [59064.267115] <NMI> [<ffffffffb857ac23>] dump_stack+0x19/0x1b
      [59064.267117] [<ffffffffb8574967>] panic+0xe8/0x21f
      [59064.267122] [<ffffffffb7e9b6af>] nmi_panic+0x3f/0x40
      [59064.267124] [<ffffffffb7f4eb81>] watchdog_overflow_callback+0x121/0x140
      [59064.267127] [<ffffffffb7fa8277>] __perf_event_overflow+0x57/0x100
      [59064.267129] [<ffffffffb7fb1a14>] perf_event_overflow+0x14/0x20
      [59064.267131] [<ffffffffb7e0ac70>] handle_pmi_common+0x1a0/0x250
      [59064.267142] [<ffffffffb7e0af4f>] intel_pmu_handle_irq+0xcf/0x1d0
      [59064.267144] [<ffffffffb8584031>] perf_event_nmi_handler+0x31/0x50
      [59064.267146] [<ffffffffb858593c>] nmi_handle.isra.0+0x8c/0x150
      [59064.267147] [<ffffffffb8585c18>] do_nmi+0x218/0x460
      [59064.267149] [<ffffffffb8584d9c>] end_repeat_nmi+0x1e/0x81
      [59064.267155] <EOE> <IRQ> [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf
      [59064.267156] [<ffffffffb85833b7>] _raw_spin_lock_irqsave+0x37/0x40
      [59064.267158] [<ffffffffb7f57f5b>] rcu_dump_cpu_stacks+0x4b/0xd0
      [59064.267160] [<ffffffffb7f5b662>] rcu_check_callbacks+0x442/0x730
      [59064.267164] [<ffffffffb7eaf536>] update_process_times+0x46/0x80
      [59064.267166] [<ffffffffb7f0fcb0>] tick_sched_handle+0x30/0x70
      [59064.267167] [<ffffffffb7f0ff79>] tick_sched_timer+0x39/0x80
      [59064.267169] [<ffffffffb7eca5ee>] __hrtimer_run_queues+0x10e/0x270
      [59064.267171] [<ffffffffb7ecab4f>] hrtimer_interrupt+0xaf/0x1d0
      [59064.267173] [<ffffffffb7e5c60b>] local_apic_timer_interrupt+0x3b/0x60
      [59064.267175] [<ffffffffb85929d3>] smp_apic_timer_interrupt+0x43/0x60
      [59064.267176] [<ffffffffb858eefa>] apic_timer_interrupt+0x16a/0x170
      [59064.267180] [<ffffffffb857544a>] queued_spin_lock_slowpath+0xb/0xf
      [59064.267181] [<ffffffffb8583330>] _raw_spin_lock+0x20/0x30
      [59064.267194] [<ffffffffc173902c>] lock_res_and_lock+0x2c/0x50 [ptlrpc]
      [59064.267208] [<ffffffffc1740c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc]
      [59064.267245] [<ffffffffc1769516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [59064.267288] [<ffffffffc17f1bd2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [59064.267311] [<ffffffffc17f89ea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [59064.267356] [<ffffffffc179d48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [59064.267396] [<ffffffffc17a0df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [59064.267417] [<ffffffffb7ec61f1>] kthread+0xd1/0xe0
      

      Attachments

        Activity

          [LU-14148] OSS crashes due to grants
          nilesj Jeff Niles added a comment - - edited

          Olaf,

          Looks like one of them displayed what you suggest just prior to the LBUG:

          [ 5068.158355] LustreError: 201114:0:(tgt_grant.c:463:tgt_grant_space_left()) f2-OST0042: cli 7ed51cf7-9fc0-b864-d835-d4581a25c33f/ffff9f56a80bcc00 left 247957952659456 < tot_grant 4611686018666135551 unstable 0 pending 0 dirty 0
          [ 5068.180417] LustreError: 201114:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1993 previous similar messages
          [ 5377.919091] LustreError: 23550:0:(tgt_grant.c:151:tgt_check_export_grants()) f2-OST0042: cli 713de897-9a57-ad95-1aa1-508704ffa2eb/ffff9f6f799bb800 ted_grant(4611686018525560832) + ted_pending(0) > maxsize(516768656261120)
          [ 5377.940727] LustreError: 23550:0:(tgt_grant.c:223:tgt_grant_sanity_check()) LBUG

          Another other one seems to have run out of grant space and then gone into a soft CPU lockup that progressed into a hard lockup, causing the crash.

          [ 1409.931676] LustreError: 31221:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli c470cc23-582d-7f43-87ca-5be8d783e23b claims 1703936 GRANT, real grant 0
          [ 1409.947466] LustreError: 31221:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 50 previous similar messages
          

          And the other maybe looks like it had some network issues or something? The only grant issues are hours prior to the crash.

           

          These are running 2.12.5 + patches should be LU-12136-38428 and LU-11276. James Simmons can provide better details on the exact patches in that build, but I think that's accurate.

          nilesj Jeff Niles added a comment - - edited Olaf, Looks like one of them displayed what you suggest just prior to the LBUG: [ 5068.158355] LustreError: 201114:0:(tgt_grant.c:463:tgt_grant_space_left()) f2-OST0042: cli 7ed51cf7-9fc0-b864-d835-d4581a25c33f/ffff9f56a80bcc00 left 247957952659456 < tot_grant 4611686018666135551 unstable 0 pending 0 dirty 0 [ 5068.180417] LustreError: 201114:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 1993 previous similar messages [ 5377.919091] LustreError: 23550:0:(tgt_grant.c:151:tgt_check_export_grants()) f2-OST0042: cli 713de897-9a57-ad95-1aa1-508704ffa2eb/ffff9f6f799bb800 ted_grant(4611686018525560832) + ted_pending(0) > maxsize(516768656261120) [ 5377.940727] LustreError: 23550:0:(tgt_grant.c:223:tgt_grant_sanity_check()) LBUG Another other one seems to have run out of grant space and then gone into a soft CPU lockup that progressed into a hard lockup, causing the crash. [ 1409.931676] LustreError: 31221:0:(tgt_grant.c:758:tgt_grant_check()) f2-OST001e: cli c470cc23-582d-7f43-87ca-5be8d783e23b claims 1703936 GRANT, real grant 0 [ 1409.947466] LustreError: 31221:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 50 previous similar messages And the other maybe looks like it had some network issues or something? The only grant issues are hours prior to the crash.   These are running 2.12.5 + patches should be LU-12136 -38428 and LU-11276 . James Simmons can provide better details on the exact patches in that build, but I think that's accurate.
          pjones Peter Jones added a comment -

          Mike is investigating

          pjones Peter Jones added a comment - Mike is investigating

          Hi Dustin,

          We haven't seen the issue you've hit.

          According to the 2.12.5 tag code, you should have seen one of these LustreError messages preceeding the LBUG, in either the console log or in the vmcore-dmesg.txt file that is dumped along with the core itself:

          tgt_grant_sanity_check and "%s: cli %s/%p dirty %ld pend %ld grant %ld\n"
          or
          tgt_grant_sanity_check and "%s: cli %s/%p ted_grant(%ld) + ted_pending(%ld)"
          or
          tgt_grant_sanity_check and "%s: cli %s/%p ted_dirty(%ld) > maxsize(%llu)\n"
          

          Do you have those?

           

          ofaaland Olaf Faaland added a comment - Hi Dustin, We haven't seen the issue you've hit. According to the 2.12.5 tag code, you should have seen one of these LustreError messages preceeding the LBUG, in either the console log or in the vmcore-dmesg.txt file that is dumped along with the core itself: tgt_grant_sanity_check and "%s: cli %s/%p dirty %ld pend %ld grant %ld\n" or tgt_grant_sanity_check and "%s: cli %s/%p ted_grant(%ld) + ted_pending(%ld)" or tgt_grant_sanity_check and "%s: cli %s/%p ted_dirty(%ld) > maxsize(%llu)\n" Do you have those?  
          ofaaland Olaf Faaland added a comment -

          Hi Dustin,

          Was the server running exactly Lustre 2.12.5, or 2.12.5 + patches?  If patches, what patches were there beyond the 2.12.5 tag?

          ofaaland Olaf Faaland added a comment - Hi Dustin, Was the server running exactly Lustre 2.12.5, or 2.12.5 + patches?  If patches, what patches were there beyond the 2.12.5 tag?

          People

            tappro Mikhail Pershin
            dustb100 Dustin Leverman
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: