Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6049

General Protection Fault at echo_session_key_fini+0xa9

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.1.6, Lustre 2.5.3
    • kernel 2.6.32-279.5.2
    • 3
    • 16855

    Description

      General Protection Fault at echo_session_key_fini+0xa9

      A crash on an OSS with the following trace has been encountered with lustre 2.1.6 and kernel 2.6.32-279.5.2

      2013-10-18 07:57:21 general protection fault: 0000 [#1] SMP
      2013-10-18 07:57:21 last sysfs file: /sys/devices/pci0000:80/0000:80:02.2/0000:84:00.0/host10/port-10:3/end_device-10:3/target10:0:3/10:0:3:4/state
      2013-10-18 07:57:21 CPU 0
      2013-10-18 07:57:21 Modules linked in: nfs fscache obdecho(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportf
      s sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_rdac dm_multipath dm_mod uinput usbhid
       hid sg mpt2sas scsi_transport_sas raid_class sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext4 mbcache jbd2 ehci_hcd sd_mod crc_t10dif ahci megaraid_sas [last unloaded: scsi_wait_scan]
      2013-10-18 07:57:21
      2013-10-18 07:57:21 Pid: 14573, comm: obd_zombid Not tainted 2.6.32-279.5.2.bl6.Bull.35_restricted.x86_64 #1 Bull SAS bullx R/X9DRH-7TF/7F/iTF/iF
      2013-10-18 07:57:21 RIP: 0010:[<ffffffffa0a2dc49>] [<ffffffffa0a2dc49>] echo_session_key_fini+0xa9/0x100 [obdecho]
      2013-10-18 07:57:21 RSP: 0018:ffff88105451bce0 EFLAGS: 00010246
      2013-10-18 07:57:21 RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000002
      2013-10-18 07:57:21 RDX: ffff8808587c7540 RSI: 5a5a5a5a5a5a5a5a RDI: 0000000000000000
      2013-10-18 07:57:21 RBP: ffff88105451bcf0 R08: ffffffff81ac5ee8 R09: 0000000000000140
      2013-10-18 07:57:21 R10: 0000000000000000 R11: 000000000000000c R12: ffff880f1ad9a2f8
      2013-10-18 07:57:21 R13: 0000000000000010 R14: ffff88105451be20 R15: ffff8802f65a46d0
      2013-10-18 07:57:21 FS: 00007f14eea7f700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
      2013-10-18 07:57:21 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2013-10-18 07:57:21 CR2: 00007f5cb4fc8000 CR3: 0000000001a06000 CR4: 00000000000406f0
      2013-10-18 07:57:21 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2013-10-18 07:57:21 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2013-10-18 07:57:21 Process obd_zombid (pid: 14573, threadinfo ffff881054518000, task ffff881068e0c080)
      2013-10-18 07:57:21 Stack:
      2013-10-18 07:57:21 0000000000000010 ffffffffa0a3cf40 ffff88105451bd20 ffffffffa0554d90
      2013-10-18 07:57:21 <d> ffff88105451bd20 ffffffffa0a3cf40 ffff880f1ad9a2f8 ffff88071cd302c0
      2013-10-18 07:57:21 <d> ffff88105451bd40 ffffffffa055514b 0000000000000000 ffff8802f65a46c8
      2013-10-18 07:57:21 Call Trace:
      2013-10-18 07:57:21 [<ffffffffa0554d90>] key_fini+0x60/0x1e0 [obdclass]
      2013-10-18 07:57:21 [<ffffffffa055514b>] lu_context_key_quiesce+0x5b/0x90 [obdclass]
      2013-10-18 07:57:21 [<ffffffffa05551d9>] lu_context_key_quiesce_many+0x59/0x80 [obdclass]
      2013-10-18 07:57:21 [<ffffffffa0a2d910>] echo_type_stop+0x20/0x30 [obdecho]
      2013-10-18 07:57:21 [<ffffffffa0554322>] lu_device_fini+0x52/0xd0 [obdclass]
      2013-10-18 07:57:21 [<ffffffffa0a2fed7>] echo_device_free+0x247/0x510 [obdecho]
      2013-10-18 07:57:22 [<ffffffffa053871d>] class_decref+0x46d/0x590 [obdclass]
      2013-10-18 07:57:22 [<ffffffffa0523bae>] obd_zombie_impexp_cull+0x31e/0x620 [obdclass]
      2013-10-18 07:57:22 [<ffffffffa0523f75>] obd_zombie_impexp_thread+0xc5/0x1c0 [obdclass]
      2013-10-18 07:57:22 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
      2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      2013-10-18 07:57:22 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      2013-10-18 07:57:22 [<ffffffffa0523eb0>] ? obd_zombie_impexp_thread+0x0/0x1c0 [obdclass]
      2013-10-18 07:57:22 [<ffffffff81004120>] ? child_rip+0x0/0x20
      2013-10-18 07:57:22 Code: 99 02 00 00 48 c7 05 d3 49 01 00 00 00 00 00 c7 05 c1 49 01 00 10 00 00 00 e8 14 74 a2 ff 48 b8 5a 5a 5a 5a 5a 5a 5a 5a 48 89 de <48> 89 03 48 8b 3d d5 69 01 00 e8 98 bb a1 ff 48 83 c4 08 5b c9
      2013-10-18 07:57:22 RIP [<ffffffffa0a2dc49>] echo_session_key_fini+0xa9/0x100 [obdecho]
      2013-10-18 07:57:22 RSP <ffff88105451bce0>
      2013-10-18 07:57:22 Initializing cgroup subsys cpuset
      2013-10-18 07:57:22 Initializing cgroup subsys cpu
      2013-10-18 07:57:22 Linux version 2.6.32-279.5.2.bl6.Bull.35_restricted.x86_64 (derr@atlas.frec.bull.fr) (gcc version 4.4.6 20120305 (Bull 4.4.6-4) (GCC) ) #1 SMP Wed Apr 24 14:29:54 CEST 2013
      

      The crash occured in echo_session_key_fini().
      key_fini() was calling key->lct_fini(), pointing on echo_session_key_fini(), using ctx->lc_value[index] as third parameter, but this array was containing POISON'ed values.

      static void echo_session_key_fini(...)
      {
              struct echo_session_info *session = data;
              OBD_SLAB_FREE_PTR(session, echo_session_kmem);    <================ GPF here, session is 0x5a5a5a5a5a5a5a5a
      }
      
      After OBD_SLAB_FREE_PTR() expansion:
      ------------------------------------
      
      static void echo_session_key_fini(...)
      {
              struct echo_session_info *session = data;
              LASSERT(session);
              lprocfs_counter_sub(obd_memory, OBD_MEMORY_STAT, (long)(sizeof *(session)));
              CDEBUG(D_MALLOC, name " '" #session "': %d at %p.\n", (int)(sizeof *(session)), session);
              memset(session, 0x5a, sizeof *(session));         <================ GPF here, session is 0x5a5a5a5a5a5a5a5a
              cfs_mem_cache_free(echo_session_kmem, session);
              (session) = (void *)0xdeadbeef;
      }
      

      Attachments

        Activity

          [LU-6049] General Protection Fault at echo_session_key_fini+0xa9
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13164/
          Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 46f24c6d043a7d73fa68ef3541426603e0b86d52

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13164/ Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 46f24c6d043a7d73fa68ef3541426603e0b86d52

          patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13166
          Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister()
          Project: fs/lustre-release
          Branch: b2_1
          Current Patch Set: 1
          Commit: 3f72b5bb48200d85bf38c584a83e2b6cd8bd2583

          gerrit Gerrit Updater added a comment - patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13166 Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister() Project: fs/lustre-release Branch: b2_1 Current Patch Set: 1 Commit: 3f72b5bb48200d85bf38c584a83e2b6cd8bd2583

          patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13165
          Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister()
          Project: fs/lustre-release
          Branch: b2_5
          Current Patch Set: 1
          Commit: 61cb3d1b54e30d306fc7758c998b43c6d33b313f

          gerrit Gerrit Updater added a comment - patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13165 Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister() Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 61cb3d1b54e30d306fc7758c998b43c6d33b313f

          patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13164
          Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: a692652985159ab214c9558f8623f0203b37d57a

          gerrit Gerrit Updater added a comment - patrick-valentin (patrick.valentin@bull.net) uploaded a new patch: http://review.whamcloud.com/13164 Subject: LU-6049 obdclass: Add synchro in lu_context_key_degister() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a692652985159ab214c9558f8623f0203b37d57a

          Andreas,
          There is only one patch to fix the three different issues (ASSERTION failed) described above. I wrote "patches" because I have a patch for the master branch, and I have also prepared the backports on lustre 2.1.6 and on lustre 2.5.3. With this patch, the reproducer has been running during several days without any crash on both lustre 2.1.6 and 2.5.3.
          I have to push them to gerrit, but unfortunately our development server is stopped for maintenance today. I will push them at the beginning of next week.

          Bruno,
          I looked in the LU-5264, but I can't say if it is related or not. In my case it's a race with lu_context_key_degister(), but in your case lu_context_key_degister() shouldn't have been already called, as there are still several unmount in progress.

          patrick.valentin Patrick Valentin (Inactive) added a comment - Andreas, There is only one patch to fix the three different issues (ASSERTION failed) described above. I wrote "patches" because I have a patch for the master branch, and I have also prepared the backports on lustre 2.1.6 and on lustre 2.5.3. With this patch, the reproducer has been running during several days without any crash on both lustre 2.1.6 and 2.5.3. I have to push them to gerrit, but unfortunately our development server is stopped for maintenance today. I will push them at the beginning of next week. Bruno, I looked in the LU-5264 , but I can't say if it is related or not. In my case it's a race with lu_context_key_degister(), but in your case lu_context_key_degister() shouldn't have been already called, as there are still several unmount in progress.

          I wonder if this could be related to LU-5264?

          bfaccini Bruno Faccini (Inactive) added a comment - I wonder if this could be related to LU-5264 ?

          Patrick, when you write "Patches to fix this issue are available", are you referring to just the last lu_context_exit() assertion, or do you have patches for the original problem? I don't see any patches referenced here.

          adilger Andreas Dilger added a comment - Patrick, when you write "Patches to fix this issue are available", are you referring to just the last lu_context_exit() assertion, or do you have patches for the original problem? I don't see any patches referenced here.

          Emoly,
          Would you be able to look in to this one?
          Thank you!

          jlevi Jodi Levi (Inactive) added a comment - Emoly, Would you be able to look in to this one? Thank you!

          I was not able to reproduce the same crash (probably a too small window), but with the following script, I was able to reproduce several crashes on a lustre client without any lustre file system mounted.
          All these crashes are corresponding to failed assertions, because when removing a key with lu_context_key_degister() at module unload, there is no synchronization with lu_context_init/refill/exit/fini.

          Reproducer:

          N=0
          while true; do
              N=$((N+1))
              D=$(((N%2000)*10))
              modprobe obdecho
              modprobe lustre > /dev/null &
              usleep $D 
              rmmod obdecho
              wait
              rmmod lustre
              [ -f /tmp/stop ] && break
              echo -e ".\c"
              [ $((N%100)) -eq 0 ] && echo "$N: (`date`)"
          done 2> /dev/null
          

          With this reproducer, we have one of the following crashes in a few tenth of minutes to one or two hours (lustre 2.1.6 and lustre 2.5.3).

          crash 1: key_fini(): ASSERTION( atomic_read(&key->lct_used) > 1 ) failed

                KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux
              DUMPFILE: /var/crash/127.0.0.1-2014-12-11-13:13:52/vmcore  [PARTIAL DUMP]
                  CPUS: 24
                  DATE: Thu Dec 11 13:12:04 2014
                UPTIME: 00:33:04
          LOAD AVERAGE: 1.07, 1.22, 1.01
                 TASKS: 641
              NODENAME: mo48
               RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64
               VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014
               MACHINE: x86_64  (2700 Mhz)
                MEMORY: 64 GB
                 PANIC: "Kernel panic - not syncing: LBUG"
                   PID: 35296
               COMMAND: "rmmod"
                  TASK: ffff88085dc78040  [THREAD_INFO: ffff88085d736000]
                   CPU: 1
                 STATE: TASK_RUNNING (PANIC)
          
          crash> log | tail -60
          Lustre: Echo OBD driver; http://www.lustre.org/
          ...
          Lustre: Echo OBD driver; http://www.lustre.org/
          slab error in kmem_cache_destroy(): cache `echo_thread_kmem': Can't free all objects
          Pid: 35271, comm: rmmod Not tainted 2.6.32-504.1.3.el6.Bull.66.x86_64 #1
          Call Trace:
           [<ffffffff8117849b>] ? kmem_cache_destroy+0xbb/0xf0
           [<ffffffffa070779d>] ? lu_kmem_fini+0x2d/0x50 [obdclass]
           [<ffffffffa07ea181>] ? echo_client_exit+0x21/0x30 [obdecho]
           [<ffffffffa07ea19e>] ? obdecho_exit+0xe/0x30 [obdecho]
           [<ffffffff810bdf14>] ? sys_delete_module+0x194/0x260
           [<ffffffff81531b7e>] ? do_page_fault+0x3e/0xa0
           [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
          LustreError: 35296:0:(lu_object.c:1406:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed:
          LustreError: 35296:0:(lu_object.c:1406:key_fini()) LBUG
          ...
          
          crash> bt 35296
          PID: 35296  TASK: ffff88085dc78040  CPU: 1   COMMAND: "rmmod"
           #0 [ffff88085d737be0] machine_kexec at ffffffff8103b7eb
           #1 [ffff88085d737c40] crash_kexec at ffffffff810ca852
           #2 [ffff88085d737d10] panic at ffffffff8152ae84
           #3 [ffff88085d737d90] lbug_with_loc at ffffffffa05b7eeb [libcfs]
           #4 [ffff88085d737db0] key_fini at ffffffffa0707bab [obdclass]
           #5 [ffff88085d737de0] lu_context_key_quiesce at ffffffffa0707e8d [obdclass]
           #6 [ffff88085d737e00] lu_context_key_degister at ffffffffa0707f5d [obdclass]
           #7 [ffff88085d737e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass]
           #8 [ffff88085d737e80] vvp_type_fini at ffffffffa0d08f3f [lustre]
           #9 [ffff88085d737e90] lu_device_type_fini at ffffffffa07065e3 [obdclass]
          #10 [ffff88085d737ea0] ccc_global_fini at ffffffffa0d0522c [lustre]
          #11 [ffff88085d737ec0] vvp_global_fini at ffffffffa0d09d55 [lustre]
          #12 [ffff88085d737ed0] cleanup_module at ffffffffa0d0f60a [lustre]
          #13 [ffff88085d737ee0] sys_delete_module at ffffffff810bdf14
          #14 [ffff88085d737f80] system_call_fastpath at ffffffff8100b072
              RIP: 00007f43dbd61077  RSP: 00007fffbc8ea458  RFLAGS: 00010202
              RAX: 00000000000000b0  RBX: ffffffff8100b072  RCX: 00007fffbc8ea380
              RDX: 00007fffbc8ea44f  RSI: 0000000000000880  RDI: 00007fffbc8ea460
              RBP: 00007fffbc8ea4b0   R8: 00007f43dc21d700   R9: 0000000000000100
              R10: 00007fffbc8ea1e0  R11: 0000000000000202  R12: 00007fffbc8ec6d7
              R13: 0000000000000880  R14: 00007fffbc8ea460  R15: 0000000000000001
              ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
          

          crash 2: lu_context_key_degister(): ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2

                KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux
              DUMPFILE: /var/crash/127.0.0.1-2014-12-11-14:19:24/vmcore  [PARTIAL DUMP]
                  CPUS: 24
                  DATE: Thu Dec 11 14:18:03 2014
                UPTIME: 01:02:10
          LOAD AVERAGE: 1.30, 1.32, 1.23
                 TASKS: 648
              NODENAME: mo48
               RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64
               VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014
               MACHINE: x86_64  (2699 Mhz)
                MEMORY: 64 GB
                 PANIC: "Kernel panic - not syncing: LBUG"
                   PID: 30694
               COMMAND: "rmmod"
                  TASK: ffff88107b2a8040  [THREAD_INFO: ffff881069404000]
                   CPU: 14
                 STATE: TASK_RUNNING (PANIC)
          
          
          crash> log | tail -60
          Lustre: Echo OBD driver; http://www.lustre.org/
          ...
          Lustre: Echo OBD driver; http://www.lustre.org/
          LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2
          LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) LBUG
          ...
          
          crash> bt 30694
          PID: 30694  TASK: ffff88107b2a8040  CPU: 14  COMMAND: "rmmod"
           #0 [ffff881069405c30] machine_kexec at ffffffff8103b7eb
           #1 [ffff881069405c90] crash_kexec at ffffffff810ca852
           #2 [ffff881069405d60] panic at ffffffff8152ae84
           #3 [ffff881069405de0] lbug_with_loc at ffffffffa05b7eeb [libcfs]
           #4 [ffff881069405e00] lu_context_key_degister at ffffffffa070801d [obdclass]
           #5 [ffff881069405e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass]
           #6 [ffff881069405e80] echo_type_fini at ffffffffa07ea8c0 [obdecho]
           #7 [ffff881069405e90] lu_device_type_fini at ffffffffa07065e3 [obdclass]
           #8 [ffff881069405ea0] class_unregister_type at ffffffffa06cf921 [obdclass]
           #9 [ffff881069405ec0] echo_client_exit at ffffffffa07ea175 [obdecho]
          #10 [ffff881069405ed0] cleanup_module at ffffffffa07ea19e [obdecho]
          #11 [ffff881069405ee0] sys_delete_module at ffffffff810bdf14
          #12 [ffff881069405f80] system_call_fastpath at ffffffff8100b072
              RIP: 00007f4b052d0077  RSP: 00007fff962eea48  RFLAGS: 00010206
              RAX: 00000000000000b0  RBX: ffffffff8100b072  RCX: 00007fff962ee970
              RDX: 00007fff962eea3f  RSI: 0000000000000880  RDI: 00007fff962eea50
              RBP: 00007fff962eeaa0   R8: 00007f4b0578c700   R9: 0000000000000100
              R10: 00007fff962ee7d0  R11: 0000000000000206  R12: 00007fff962f06d6
              R13: 0000000000000880  R14: 00007fff962eea50  R15: 0000000000000001
              ORIG_RAX: 00000000000000b0  CS: 0033  SS: 002b
          

          Sometimes we also have the following crash:

          lu_context_exit(): ASSERTION( key != ((void *)0) ) failed:

          Patches to fix this issue are available for lustre 2.1.6, lustre 2.5.3 and master branches.

          patrick.valentin Patrick Valentin (Inactive) added a comment - I was not able to reproduce the same crash (probably a too small window), but with the following script, I was able to reproduce several crashes on a lustre client without any lustre file system mounted. All these crashes are corresponding to failed assertions, because when removing a key with lu_context_key_degister() at module unload, there is no synchronization with lu_context_init/refill/exit/fini. Reproducer: N=0 while true; do N=$((N+1)) D=$(((N%2000)*10)) modprobe obdecho modprobe lustre > /dev/null & usleep $D rmmod obdecho wait rmmod lustre [ -f /tmp/stop ] && break echo -e ".\c" [ $((N%100)) -eq 0 ] && echo "$N: (`date`)" done 2> /dev/null With this reproducer, we have one of the following crashes in a few tenth of minutes to one or two hours (lustre 2.1.6 and lustre 2.5.3). crash 1: key_fini(): ASSERTION( atomic_read(&key->lct_used) > 1 ) failed KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2014-12-11-13:13:52/vmcore [PARTIAL DUMP] CPUS: 24 DATE: Thu Dec 11 13:12:04 2014 UPTIME: 00:33:04 LOAD AVERAGE: 1.07, 1.22, 1.01 TASKS: 641 NODENAME: mo48 RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64 VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014 MACHINE: x86_64 (2700 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 35296 COMMAND: "rmmod" TASK: ffff88085dc78040 [THREAD_INFO: ffff88085d736000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> log | tail -60 Lustre: Echo OBD driver; http://www.lustre.org/ ... Lustre: Echo OBD driver; http://www.lustre.org/ slab error in kmem_cache_destroy(): cache `echo_thread_kmem': Can't free all objects Pid: 35271, comm: rmmod Not tainted 2.6.32-504.1.3.el6.Bull.66.x86_64 #1 Call Trace: [<ffffffff8117849b>] ? kmem_cache_destroy+0xbb/0xf0 [<ffffffffa070779d>] ? lu_kmem_fini+0x2d/0x50 [obdclass] [<ffffffffa07ea181>] ? echo_client_exit+0x21/0x30 [obdecho] [<ffffffffa07ea19e>] ? obdecho_exit+0xe/0x30 [obdecho] [<ffffffff810bdf14>] ? sys_delete_module+0x194/0x260 [<ffffffff81531b7e>] ? do_page_fault+0x3e/0xa0 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b LustreError: 35296:0:(lu_object.c:1406:key_fini()) ASSERTION( atomic_read(&key->lct_used) > 1 ) failed: LustreError: 35296:0:(lu_object.c:1406:key_fini()) LBUG ... crash> bt 35296 PID: 35296 TASK: ffff88085dc78040 CPU: 1 COMMAND: "rmmod" #0 [ffff88085d737be0] machine_kexec at ffffffff8103b7eb #1 [ffff88085d737c40] crash_kexec at ffffffff810ca852 #2 [ffff88085d737d10] panic at ffffffff8152ae84 #3 [ffff88085d737d90] lbug_with_loc at ffffffffa05b7eeb [libcfs] #4 [ffff88085d737db0] key_fini at ffffffffa0707bab [obdclass] #5 [ffff88085d737de0] lu_context_key_quiesce at ffffffffa0707e8d [obdclass] #6 [ffff88085d737e00] lu_context_key_degister at ffffffffa0707f5d [obdclass] #7 [ffff88085d737e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass] #8 [ffff88085d737e80] vvp_type_fini at ffffffffa0d08f3f [lustre] #9 [ffff88085d737e90] lu_device_type_fini at ffffffffa07065e3 [obdclass] #10 [ffff88085d737ea0] ccc_global_fini at ffffffffa0d0522c [lustre] #11 [ffff88085d737ec0] vvp_global_fini at ffffffffa0d09d55 [lustre] #12 [ffff88085d737ed0] cleanup_module at ffffffffa0d0f60a [lustre] #13 [ffff88085d737ee0] sys_delete_module at ffffffff810bdf14 #14 [ffff88085d737f80] system_call_fastpath at ffffffff8100b072 RIP: 00007f43dbd61077 RSP: 00007fffbc8ea458 RFLAGS: 00010202 RAX: 00000000000000b0 RBX: ffffffff8100b072 RCX: 00007fffbc8ea380 RDX: 00007fffbc8ea44f RSI: 0000000000000880 RDI: 00007fffbc8ea460 RBP: 00007fffbc8ea4b0 R8: 00007f43dc21d700 R9: 0000000000000100 R10: 00007fffbc8ea1e0 R11: 0000000000000202 R12: 00007fffbc8ec6d7 R13: 0000000000000880 R14: 00007fffbc8ea460 R15: 0000000000000001 ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b crash 2: lu_context_key_degister(): ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2 KERNEL: /usr/lib/debug/lib/modules/2.6.32-504.1.3.el6.Bull.66.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2014-12-11-14:19:24/vmcore [PARTIAL DUMP] CPUS: 24 DATE: Thu Dec 11 14:18:03 2014 UPTIME: 01:02:10 LOAD AVERAGE: 1.30, 1.32, 1.23 TASKS: 648 NODENAME: mo48 RELEASE: 2.6.32-504.1.3.el6.Bull.66.x86_64 VERSION: #1 SMP Mon Dec 1 15:32:13 CET 2014 MACHINE: x86_64 (2699 Mhz) MEMORY: 64 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 30694 COMMAND: "rmmod" TASK: ffff88107b2a8040 [THREAD_INFO: ffff881069404000] CPU: 14 STATE: TASK_RUNNING (PANIC) crash> log | tail -60 Lustre: Echo OBD driver; http://www.lustre.org/ ... Lustre: Echo OBD driver; http://www.lustre.org/ LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) ASSERTION( cfs_atomic_read(&key->lct_used) == 1 ) failed: key has instances: 2 LustreError: 30694:0:(lu_object.c:1442:lu_context_key_degister()) LBUG ... crash> bt 30694 PID: 30694 TASK: ffff88107b2a8040 CPU: 14 COMMAND: "rmmod" #0 [ffff881069405c30] machine_kexec at ffffffff8103b7eb #1 [ffff881069405c90] crash_kexec at ffffffff810ca852 #2 [ffff881069405d60] panic at ffffffff8152ae84 #3 [ffff881069405de0] lbug_with_loc at ffffffffa05b7eeb [libcfs] #4 [ffff881069405e00] lu_context_key_degister at ffffffffa070801d [obdclass] #5 [ffff881069405e20] lu_context_key_degister_many at ffffffffa07080e9 [obdclass] #6 [ffff881069405e80] echo_type_fini at ffffffffa07ea8c0 [obdecho] #7 [ffff881069405e90] lu_device_type_fini at ffffffffa07065e3 [obdclass] #8 [ffff881069405ea0] class_unregister_type at ffffffffa06cf921 [obdclass] #9 [ffff881069405ec0] echo_client_exit at ffffffffa07ea175 [obdecho] #10 [ffff881069405ed0] cleanup_module at ffffffffa07ea19e [obdecho] #11 [ffff881069405ee0] sys_delete_module at ffffffff810bdf14 #12 [ffff881069405f80] system_call_fastpath at ffffffff8100b072 RIP: 00007f4b052d0077 RSP: 00007fff962eea48 RFLAGS: 00010206 RAX: 00000000000000b0 RBX: ffffffff8100b072 RCX: 00007fff962ee970 RDX: 00007fff962eea3f RSI: 0000000000000880 RDI: 00007fff962eea50 RBP: 00007fff962eeaa0 R8: 00007f4b0578c700 R9: 0000000000000100 R10: 00007fff962ee7d0 R11: 0000000000000206 R12: 00007fff962f06d6 R13: 0000000000000880 R14: 00007fff962eea50 R15: 0000000000000001 ORIG_RAX: 00000000000000b0 CS: 0033 SS: 002b Sometimes we also have the following crash: lu_context_exit(): ASSERTION( key != ((void *)0) ) failed: Patches to fix this issue are available for lustre 2.1.6, lustre 2.5.3 and master branches.

          People

            emoly.liu Emoly Liu
            patrick.valentin Patrick Valentin (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: