Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-919

Multiple wrong LBUGs checking cfs_atomic_t vars/fields with inacurate poison value of 0x5a5a5a

Details

    • 3
    • 4756

    Description

      At both TGCC and Tera100 we have recently experienced 3 different LBUGs of the same kind/family :

      _ "(genops.c:757:class_export_put()) ASSERTION(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a) failed", on a TGCC MDS.

      _ "(genops.c:911:class_import_get()) ASSERTION(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a) failed", on a T100 Client.

      _ "(genops.c:925:class_import_put()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a) failed", on an other T100 Client.

      in each case, I have been able to confirm that the value xxx_refcount value triggering the assert was good and not poisoned, but simply reflecting a huge number of references due to high/slow activity.

      Having a look to the concerned sources/code, it seems that all this 3 Assert()s/LBUGs and also 2 others one :

      _ lustre/include/lustre_log.h:449 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
      _ lustre/obdclass/llog_obd.c:139 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);

      are wrong regarding the 4 cfs_atomic_t variables/fields they check, and must be at least coded as per the following patch/changes :
      ============================================================
      [root@curie1 lustre-2.0.0.1] # diff -urN lustre/include/lustre_log.h lustre/include/lustre_log.h.bfi
      — lustre/include/lustre_log.h 2010-08-04 13:13:04.000000000 +0200
      +++ lustre/include/lustre_log.h.bfi 2011-12-13 14:38:56.071839517 +0100
      @@ -446,7 +446,7 @@
      if (ctxt == NULL)
      return;
      LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 0);

      • LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
        + LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
        CDEBUG(D_INFO, "PUTting ctxt %p : new refcount %d\n", ctxt,
        cfs_atomic_read(&ctxt->loc_refcount) - 1);
        __llog_ctxt_put(ctxt);
        [root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/genops.c lustre/obdclass/genops.c.bfi
          • lustre/obdclass/genops.c 2010-08-04 13:13:03.000000000 +0200
            +++ lustre/obdclass/genops.c.bfi 2011-12-13 14:39:46.961868491 +0100
            @@ -754,7 +754,7 @@
            CDEBUG(D_INFO, "PUTting export %p : new refcount %d\n", exp,
            cfs_atomic_read(&exp->exp_refcount) - 1);
            LASSERT(cfs_atomic_read(&exp->exp_refcount) > 0);
      • LASSERT(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a);
        + LASSERT(cfs_atomic_read(&exp->exp_refcount) < LI_POISON);

      if (cfs_atomic_dec_and_test(&exp->exp_refcount)) {
      LASSERT(!cfs_list_empty(&exp->exp_obd_chain));
      @@ -908,7 +908,7 @@
      struct obd_import *class_import_get(struct obd_import *import)
      {
      LASSERT(cfs_atomic_read(&import->imp_refcount) >= 0);

      • LASSERT(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a);
        + LASSERT(cfs_atomic_read(&import->imp_refcount) < LI_POISON);
        cfs_atomic_inc(&import->imp_refcount);
        CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", import,
        cfs_atomic_read(&import->imp_refcount),
        @@ -922,7 +922,7 @@
        ENTRY;

      LASSERT(cfs_atomic_read(&imp->imp_refcount) > 0);

      • LASSERT(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a);
        + LASSERT(cfs_atomic_read(&imp->imp_refcount) < LI_POISON);
        LASSERT(cfs_list_empty(&imp->imp_zombie_chain));

      CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", imp,
      [root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/llog_obd.c lustre/obdclass/llog_obd.c.bfi
      — lustre/obdclass/llog_obd.c 2010-08-04 13:13:03.000000000 +0200
      +++ lustre/obdclass/llog_obd.c.bfi 2011-12-13 14:40:13.921587979 +0100
      @@ -136,7 +136,7 @@
      /*

      • Banlance the ctxt get when calling llog_cleanup()
        */
      • LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
        + LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
        LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 1);
        llog_ctxt_put(ctxt);

      [root@curie1 lustre-2.0.0.1] #
      ============================================================

      and may be need to be enhanced by checking that more fields in the same struct are not poisoned ...

      Attachments

        Issue Links

          Activity

            [LU-919] Multiple wrong LBUGs checking cfs_atomic_t vars/fields with inacurate poison value of 0x5a5a5a

            I can reproduce this by running rsync of a directory containing a few TBs of unpacked kernel trees. Some time it takes longer and some times it takes shorter to hit the bug but eventually it will, very annoying.

            wjt27 Wojciech Turek added a comment - I can reproduce this by running rsync of a directory containing a few TBs of unpacked kernel trees. Some time it takes longer and some times it takes shorter to hit the bug but eventually it will, very annoying.

            This still seem to be broken in 1.8.8 client

            LustreError: 31122:0:(genops.c:803:class_import_get()) ASSERTION(atomic_read(&import->imp_refcount) < 0x5a5a5a) failed
            LustreError: 31122:0:(genops.c:803:class_import_get()) LBUG
            Sep 3 13:44:01 Pid: 31122, comm: rsync
            sand-1-3 kernel:
            LustreError: 31Call Trace:
            122:0:(genops.c:BUG: unable to handle kernel 803:class_impNortULL pointer dereference_get()) ASSERTIO at (null)
            N(atomic_read(&iIP: [<(null)>] (null)
            mport->imp_refcoPGD 1a68aa067 PUD 449eea067 unt) < 0x5a5a5a)PMD 0
            failed
            Sep 3 1Oops: 0010 1 SMP
            3:44:01 sand-1-3last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/link_layer
            kernel: LustreECPU 4
            rror: 31122:0:(gModules linked in:enops.c:803:c lasmgc(U)s_import_get()) lustre(U)LBUG
            Sep 3 13:4 lov(U)4:01 sand-1-3 kemdc(U)rnel: Pid: 31122 lquota(U), comm: rsync
            Se osc(U) ko2iblndp 3 13:44:01 sa(U) ptlrpcnd-1-3 kernel:
            (U) obdclassSep 3 13:44:01 (U) lnetsand-1-3 kernel:(U) lvfs Call Trace:
            (U) libcfs(U) nfsd exportfs acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) ext4
            mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm wmi microcode dcdbas sg sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod c
            rc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]

            Pid: 31122, comm: rsync Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T
            RIP: 0010:[<0000000000000000>] [<(null)>] (null)
            RSP: 0018:ffff8804d52b5440 EFLAGS: 00010246
            RAX: ffff8804d52b54a4 RBX: ffff8804d52b5498 RCX: ffffffffa045c320
            RDX: ffff8804d52b54d8 RSI: ffff8804d52b5498 RDI: ffff8804d52b4000
            RBP: ffff8804d52b54d8 R08: 0000000000000000 R09: 0000000000000000
            R10: 0000000000000001 R11: 0000000000000000 R12: 000000000000cbe0
            R13: ffffffffa045c320 R14: 0000000000000000 R15: ffff880048683fc0
            FS: 00007f030976a700(0000) GS:ffff880048680000(0000) knlGS:0000000000000000
            CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            CR2: 0000000000000000 CR3: 00000008a1a22000 CR4: 00000000000406e0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Process rsync (pid: 31122, threadinfo ffff8804d52b4000, task ffff8806bac29580)
            Stack:
            ffffffff8100e520 ffff8804d52b54a4 ffff8806bac29580 ffffffffa0545210
            <0> 00000000a05491e8 ffff8804d52b4000 ffff8804d52b5fd8 ffff8804d52b4000
            <0> 0000000000000004 ffff880048680000 ffff8804d52b54d8 ffff8804d52b54a8
            Call Trace:
            [<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
            [<ffffffffa04508f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]

            wjt27 Wojciech Turek added a comment - This still seem to be broken in 1.8.8 client LustreError: 31122:0:(genops.c:803:class_import_get()) ASSERTION(atomic_read(&import->imp_refcount) < 0x5a5a5a) failed LustreError: 31122:0:(genops.c:803:class_import_get()) LBUG Sep 3 13:44:01 Pid: 31122, comm: rsync sand-1-3 kernel: LustreError: 31Call Trace: 122:0:(genops.c:BUG: unable to handle kernel 803:class_impNortULL pointer dereference_get()) ASSERTIO at (null) N(atomic_read(&iIP: [<(null)>] (null) mport->imp_refcoPGD 1a68aa067 PUD 449eea067 unt) < 0x5a5a5a)PMD 0 failed Sep 3 1Oops: 0010 1 SMP 3:44:01 sand-1-3last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/link_layer kernel: LustreECPU 4 rror: 31122:0:(gModules linked in:enops.c:803:c lasmgc(U)s_import_get()) lustre(U)LBUG Sep 3 13:4 lov(U)4:01 sand-1-3 kemdc(U)rnel: Pid: 31122 lquota(U), comm: rsync Se osc(U) ko2iblndp 3 13:44:01 sa(U) ptlrpcnd-1-3 kernel: (U) obdclassSep 3 13:44:01 (U) lnetsand-1-3 kernel:(U) lvfs Call Trace: (U) libcfs(U) nfsd exportfs acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm wmi microcode dcdbas sg sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod c rc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan] Pid: 31122, comm: rsync Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T RIP: 0010: [<0000000000000000>] [<(null)>] (null) RSP: 0018:ffff8804d52b5440 EFLAGS: 00010246 RAX: ffff8804d52b54a4 RBX: ffff8804d52b5498 RCX: ffffffffa045c320 RDX: ffff8804d52b54d8 RSI: ffff8804d52b5498 RDI: ffff8804d52b4000 RBP: ffff8804d52b54d8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 000000000000cbe0 R13: ffffffffa045c320 R14: 0000000000000000 R15: ffff880048683fc0 FS: 00007f030976a700(0000) GS:ffff880048680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000008a1a22000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process rsync (pid: 31122, threadinfo ffff8804d52b4000, task ffff8806bac29580) Stack: ffffffff8100e520 ffff8804d52b54a4 ffff8806bac29580 ffffffffa0545210 <0> 00000000a05491e8 ffff8804d52b4000 ffff8804d52b5fd8 ffff8804d52b4000 <0> 0000000000000004 ffff880048680000 ffff8804d52b54d8 ffff8804d52b54a8 Call Trace: [<ffffffff8100e520>] ? dump_trace+0x190/0x3b0 [<ffffffffa04508f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]

            Integrated in lustre-master » i686,client,el6,ofa #480
            LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

            Result = ABORTED
            Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
            Files :

            • lustre/include/lustre_log.h
            • lustre/obdclass/genops.c
            • lustre/obdclass/llog_obd.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el6,ofa #480 LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942) Result = ABORTED Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942 Files : lustre/include/lustre_log.h lustre/obdclass/genops.c lustre/obdclass/llog_obd.c

            Integrated in lustre-master » x86_64,client,el6,ofa #480
            LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

            Result = FAILURE
            Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
            Files :

            • lustre/include/lustre_log.h
            • lustre/obdclass/llog_obd.c
            • lustre/obdclass/genops.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el6,ofa #480 LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942) Result = FAILURE Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942 Files : lustre/include/lustre_log.h lustre/obdclass/llog_obd.c lustre/obdclass/genops.c

            Integrated in lustre-master » x86_64,server,el6,ofa #480
            LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

            Result = FAILURE
            Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
            Files :

            • lustre/obdclass/llog_obd.c
            • lustre/obdclass/genops.c
            • lustre/include/lustre_log.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,ofa #480 LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942) Result = FAILURE Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942 Files : lustre/obdclass/llog_obd.c lustre/obdclass/genops.c lustre/include/lustre_log.h

            Integrated in lustre-master » i686,client,el5,ofa #445
            LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

            Result = SUCCESS
            Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
            Files :

            • lustre/obdclass/llog_obd.c
            • lustre/include/lustre_log.h
            • lustre/obdclass/genops.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el5,ofa #445 LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942) Result = SUCCESS Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942 Files : lustre/obdclass/llog_obd.c lustre/include/lustre_log.h lustre/obdclass/genops.c

            Integrated in lustre-master » i686,client,el5,inkernel #445
            LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

            Result = SUCCESS
            Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
            Files :

            • lustre/obdclass/genops.c
            • lustre/include/lustre_log.h
            • lustre/obdclass/llog_obd.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el5,inkernel #445 LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942) Result = SUCCESS Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942 Files : lustre/obdclass/genops.c lustre/include/lustre_log.h lustre/obdclass/llog_obd.c

            People

              niu Niu Yawei (Inactive)
              louveta Alexandre Louvet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: