[LU-919] Multiple wrong LBUGs checking cfs_atomic_t vars/fields with inacurate poison value of 0x5a5a5a Created: 13/Dec/11  Updated: 22/Feb/13  Resolved: 25/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.2.0, Lustre 2.1.1, Lustre 1.8.9

Type: Bug Priority: Major
Reporter: Alexandre Louvet Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-1765 client LBUGs with ASSERTION(atomic_re... Resolved
Severity: 3
Rank (Obsolete): 4756

 Description   

At both TGCC and Tera100 we have recently experienced 3 different LBUGs of the same kind/family :

_ "(genops.c:757:class_export_put()) ASSERTION(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a) failed", on a TGCC MDS.

_ "(genops.c:911:class_import_get()) ASSERTION(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a) failed", on a T100 Client.

_ "(genops.c:925:class_import_put()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a) failed", on an other T100 Client.

in each case, I have been able to confirm that the value xxx_refcount value triggering the assert was good and not poisoned, but simply reflecting a huge number of references due to high/slow activity.

Having a look to the concerned sources/code, it seems that all this 3 Assert()s/LBUGs and also 2 others one :

_ lustre/include/lustre_log.h:449 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
_ lustre/obdclass/llog_obd.c:139 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);

are wrong regarding the 4 cfs_atomic_t variables/fields they check, and must be at least coded as per the following patch/changes :
============================================================
[root@curie1 lustre-2.0.0.1] # diff -urN lustre/include/lustre_log.h lustre/include/lustre_log.h.bfi
— lustre/include/lustre_log.h 2010-08-04 13:13:04.000000000 +0200
+++ lustre/include/lustre_log.h.bfi 2011-12-13 14:38:56.071839517 +0100
@@ -446,7 +446,7 @@
if (ctxt == NULL)
return;
LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 0);

  • LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
    + LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
    CDEBUG(D_INFO, "PUTting ctxt %p : new refcount %d\n", ctxt,
    cfs_atomic_read(&ctxt->loc_refcount) - 1);
    __llog_ctxt_put(ctxt);
    [root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/genops.c lustre/obdclass/genops.c.bfi
      • lustre/obdclass/genops.c 2010-08-04 13:13:03.000000000 +0200
        +++ lustre/obdclass/genops.c.bfi 2011-12-13 14:39:46.961868491 +0100
        @@ -754,7 +754,7 @@
        CDEBUG(D_INFO, "PUTting export %p : new refcount %d\n", exp,
        cfs_atomic_read(&exp->exp_refcount) - 1);
        LASSERT(cfs_atomic_read(&exp->exp_refcount) > 0);
  • LASSERT(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a);
    + LASSERT(cfs_atomic_read(&exp->exp_refcount) < LI_POISON);

if (cfs_atomic_dec_and_test(&exp->exp_refcount)) {
LASSERT(!cfs_list_empty(&exp->exp_obd_chain));
@@ -908,7 +908,7 @@
struct obd_import *class_import_get(struct obd_import *import)
{
LASSERT(cfs_atomic_read(&import->imp_refcount) >= 0);

  • LASSERT(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a);
    + LASSERT(cfs_atomic_read(&import->imp_refcount) < LI_POISON);
    cfs_atomic_inc(&import->imp_refcount);
    CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", import,
    cfs_atomic_read(&import->imp_refcount),
    @@ -922,7 +922,7 @@
    ENTRY;

LASSERT(cfs_atomic_read(&imp->imp_refcount) > 0);

  • LASSERT(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a);
    + LASSERT(cfs_atomic_read(&imp->imp_refcount) < LI_POISON);
    LASSERT(cfs_list_empty(&imp->imp_zombie_chain));

CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", imp,
[root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/llog_obd.c lustre/obdclass/llog_obd.c.bfi
— lustre/obdclass/llog_obd.c 2010-08-04 13:13:03.000000000 +0200
+++ lustre/obdclass/llog_obd.c.bfi 2011-12-13 14:40:13.921587979 +0100
@@ -136,7 +136,7 @@
/*

  • Banlance the ctxt get when calling llog_cleanup()
    */
  • LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
    + LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
    LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 1);
    llog_ctxt_put(ctxt);

[root@curie1 lustre-2.0.0.1] #
============================================================

and may be need to be enhanced by checking that more fields in the same struct are not poisoned ...



 Comments   
Comment by Peter Jones [ 13/Dec/11 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 13/Dec/11 ]

Yes, those hardcoded 0x5a5a5a should be replaced by LI_POISION.

Hi, Bruno: Could you put the patch on gerrit for review? Thanks.

Comment by Thomas Roth [ 03/Jan/12 ]

At GSI, we are seeing this on Lustre 1.8.4, on the MDS:
(client.c:178:ptlrpc_free_bulk()) ASSERTION(atomic_read(&(desc->bd_export)->exp_refcount) < 0x5a5a5a) failed
(ldlm_lock.c:165:ldlm_lock_put()) ASSERTION(atomic_read(&(lock->l_export)->exp_refcount) < 0x5a5a5a) failed
(service.c:869:ptlrpc_at_send_early_reply()) ASSERTION(atomic_read(&(reqcopy->rq_export)->exp_refcount) < 0x5a5a5a) failed
(ldlm_lib.c:849:target_handle_connect()) ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed
(mds_reint.c:72:mds_commit_cb()) ASSERTION(atomic_read(&(exp)->exp_refcount) < 0x5a5a5a) failed

Comment by Niu Yawei (Inactive) [ 10/Jan/12 ]

patch for master: http://review.whamcloud.com/1953

Comment by Niu Yawei (Inactive) [ 10/Jan/12 ]

patch for b1_8: http://review.whamcloud.com/1954

Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
Comment by Peter Jones [ 25/Jan/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
  • lustre/obdclass/llog_obd.c
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
  • lustre/obdclass/llog_obd.c
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
  • lustre/obdclass/llog_obd.c
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/genops.c
  • lustre/obdclass/llog_obd.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/genops.c
  • lustre/obdclass/llog_obd.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/include/lustre_log.h
  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
Comment by Aurelien Degremont (Inactive) [ 25/Jan/12 ]

I think this makes sense for Lustre 2.1.1 ?

Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
  • lustre/obdclass/llog_obd.c
Comment by Build Master (Inactive) [ 25/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = FAILURE
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
  • lustre/include/lustre_log.h
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = FAILURE
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/include/lustre_log.h
  • lustre/obdclass/llog_obd.c
  • lustre/obdclass/genops.c
Comment by Build Master (Inactive) [ 17/Feb/12 ]

Integrated in lustre-master » i686,client,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)

Result = ABORTED
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :

  • lustre/include/lustre_log.h
  • lustre/obdclass/genops.c
  • lustre/obdclass/llog_obd.c
Comment by Wojciech Turek (Inactive) [ 03/Sep/12 ]

This still seem to be broken in 1.8.8 client

LustreError: 31122:0:(genops.c:803:class_import_get()) ASSERTION(atomic_read(&import->imp_refcount) < 0x5a5a5a) failed
LustreError: 31122:0:(genops.c:803:class_import_get()) LBUG
Sep 3 13:44:01 Pid: 31122, comm: rsync
sand-1-3 kernel:
LustreError: 31Call Trace:
122:0:(genops.c:BUG: unable to handle kernel 803:class_impNortULL pointer dereference_get()) ASSERTIO at (null)
N(atomic_read(&iIP: [<(null)>] (null)
mport->imp_refcoPGD 1a68aa067 PUD 449eea067 unt) < 0x5a5a5a)PMD 0
failed
Sep 3 1Oops: 0010 1 SMP
3:44:01 sand-1-3last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/link_layer
kernel: LustreECPU 4
rror: 31122:0:(gModules linked in:enops.c:803:c lasmgc(U)s_import_get()) lustre(U)LBUG
Sep 3 13:4 lov(U)4:01 sand-1-3 kemdc(U)rnel: Pid: 31122 lquota(U), comm: rsync
Se osc(U) ko2iblndp 3 13:44:01 sa(U) ptlrpcnd-1-3 kernel:
(U) obdclassSep 3 13:44:01 (U) lnetsand-1-3 kernel:(U) lvfs Call Trace:
(U) libcfs(U) nfsd exportfs acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) ext4
mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm wmi microcode dcdbas sg sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod c
rc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]

Pid: 31122, comm: rsync Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T
RIP: 0010:[<0000000000000000>] [<(null)>] (null)
RSP: 0018:ffff8804d52b5440 EFLAGS: 00010246
RAX: ffff8804d52b54a4 RBX: ffff8804d52b5498 RCX: ffffffffa045c320
RDX: ffff8804d52b54d8 RSI: ffff8804d52b5498 RDI: ffff8804d52b4000
RBP: ffff8804d52b54d8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 000000000000cbe0
R13: ffffffffa045c320 R14: 0000000000000000 R15: ffff880048683fc0
FS: 00007f030976a700(0000) GS:ffff880048680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000008a1a22000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rsync (pid: 31122, threadinfo ffff8804d52b4000, task ffff8806bac29580)
Stack:
ffffffff8100e520 ffff8804d52b54a4 ffff8806bac29580 ffffffffa0545210
<0> 00000000a05491e8 ffff8804d52b4000 ffff8804d52b5fd8 ffff8804d52b4000
<0> 0000000000000004 ffff880048680000 ffff8804d52b54d8 ffff8804d52b54a8
Call Trace:
[<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
[<ffffffffa04508f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]

Comment by Wojciech Turek (Inactive) [ 05/Sep/12 ]

I can reproduce this by running rsync of a directory containing a few TBs of unpacked kernel trees. Some time it takes longer and some times it takes shorter to hit the bug but eventually it will, very annoying.

Generated at Sat Feb 10 05:50:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.