|
At both TGCC and Tera100 we have recently experienced 3 different LBUGs of the same kind/family :
_ "(genops.c:757:class_export_put()) ASSERTION(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a) failed", on a TGCC MDS.
_ "(genops.c:911:class_import_get()) ASSERTION(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a) failed", on a T100 Client.
_ "(genops.c:925:class_import_put()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a) failed", on an other T100 Client.
in each case, I have been able to confirm that the value xxx_refcount value triggering the assert was good and not poisoned, but simply reflecting a huge number of references due to high/slow activity.
Having a look to the concerned sources/code, it seems that all this 3 Assert()s/LBUGs and also 2 others one :
_ lustre/include/lustre_log.h:449 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
_ lustre/obdclass/llog_obd.c:139 LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
are wrong regarding the 4 cfs_atomic_t variables/fields they check, and must be at least coded as per the following patch/changes :
============================================================
[root@curie1 lustre-2.0.0.1] # diff -urN lustre/include/lustre_log.h lustre/include/lustre_log.h.bfi
— lustre/include/lustre_log.h 2010-08-04 13:13:04.000000000 +0200
+++ lustre/include/lustre_log.h.bfi 2011-12-13 14:38:56.071839517 +0100
@@ -446,7 +446,7 @@
if (ctxt == NULL)
return;
LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 0);
- LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
+ LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
CDEBUG(D_INFO, "PUTting ctxt %p : new refcount %d\n", ctxt,
cfs_atomic_read(&ctxt->loc_refcount) - 1);
__llog_ctxt_put(ctxt);
[root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/genops.c lustre/obdclass/genops.c.bfi
-
- lustre/obdclass/genops.c 2010-08-04 13:13:03.000000000 +0200
+++ lustre/obdclass/genops.c.bfi 2011-12-13 14:39:46.961868491 +0100
@@ -754,7 +754,7 @@
CDEBUG(D_INFO, "PUTting export %p : new refcount %d\n", exp,
cfs_atomic_read(&exp->exp_refcount) - 1);
LASSERT(cfs_atomic_read(&exp->exp_refcount) > 0);
- LASSERT(cfs_atomic_read(&exp->exp_refcount) < 0x5a5a5a);
+ LASSERT(cfs_atomic_read(&exp->exp_refcount) < LI_POISON);
if (cfs_atomic_dec_and_test(&exp->exp_refcount)) {
LASSERT(!cfs_list_empty(&exp->exp_obd_chain));
@@ -908,7 +908,7 @@
struct obd_import *class_import_get(struct obd_import *import)
{
LASSERT(cfs_atomic_read(&import->imp_refcount) >= 0);
- LASSERT(cfs_atomic_read(&import->imp_refcount) < 0x5a5a5a);
+ LASSERT(cfs_atomic_read(&import->imp_refcount) < LI_POISON);
cfs_atomic_inc(&import->imp_refcount);
CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", import,
cfs_atomic_read(&import->imp_refcount),
@@ -922,7 +922,7 @@
ENTRY;
LASSERT(cfs_atomic_read(&imp->imp_refcount) > 0);
- LASSERT(cfs_atomic_read(&imp->imp_refcount) < 0x5a5a5a);
+ LASSERT(cfs_atomic_read(&imp->imp_refcount) < LI_POISON);
LASSERT(cfs_list_empty(&imp->imp_zombie_chain));
CDEBUG(D_INFO, "import %p refcount=%d obd=%s\n", imp,
[root@curie1 lustre-2.0.0.1] # diff -urN lustre/obdclass/llog_obd.c lustre/obdclass/llog_obd.c.bfi
— lustre/obdclass/llog_obd.c 2010-08-04 13:13:03.000000000 +0200
+++ lustre/obdclass/llog_obd.c.bfi 2011-12-13 14:40:13.921587979 +0100
@@ -136,7 +136,7 @@
/*
- Banlance the ctxt get when calling llog_cleanup()
*/
- LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < 0x5a5a5a);
+ LASSERT(cfs_atomic_read(&ctxt->loc_refcount) < LI_POISON);
LASSERT(cfs_atomic_read(&ctxt->loc_refcount) > 1);
llog_ctxt_put(ctxt);
[root@curie1 lustre-2.0.0.1] #
============================================================
and may be need to be enhanced by checking that more fields in the same struct are not poisoned ...
|
|
Niu
Could you please look into this one?
Thanks
Peter
|
|
Yes, those hardcoded 0x5a5a5a should be replaced by LI_POISION.
Hi, Bruno: Could you put the patch on gerrit for review? Thanks.
|
|
At GSI, we are seeing this on Lustre 1.8.4, on the MDS:
(client.c:178:ptlrpc_free_bulk()) ASSERTION(atomic_read(&(desc->bd_export)->exp_refcount) < 0x5a5a5a) failed
(ldlm_lock.c:165:ldlm_lock_put()) ASSERTION(atomic_read(&(lock->l_export)->exp_refcount) < 0x5a5a5a) failed
(service.c:869:ptlrpc_at_send_early_reply()) ASSERTION(atomic_read(&(reqcopy->rq_export)->exp_refcount) < 0x5a5a5a) failed
(ldlm_lib.c:849:target_handle_connect()) ASSERTION(atomic_read(&(export)->exp_refcount) < 0x5a5a5a) failed
(mds_reint.c:72:mds_commit_cb()) ASSERTION(atomic_read(&(exp)->exp_refcount) < 0x5a5a5a) failed
|
|
patch for master: http://review.whamcloud.com/1953
|
|
patch for b1_8: http://review.whamcloud.com/1954
|
|
Integrated in lustre-master » x86_64,client,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
|
|
Landed for 2.2
|
|
Integrated in lustre-master » x86_64,client,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » i686,client,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
- lustre/obdclass/llog_obd.c
|
|
Integrated in lustre-master » x86_64,server,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
- lustre/obdclass/llog_obd.c
|
|
Integrated in lustre-master » x86_64,client,sles11,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » x86_64,server,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
- lustre/obdclass/llog_obd.c
|
|
Integrated in lustre-master » x86_64,server,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/genops.c
- lustre/obdclass/llog_obd.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » x86_64,client,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/genops.c
- lustre/obdclass/llog_obd.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » i686,server,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/include/lustre_log.h
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
|
|
Integrated in lustre-master » i686,server,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » i686,server,el6,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
|
|
I think this makes sense for Lustre 2.1.1 ?
|
|
Integrated in lustre-master » i686,client,el5,inkernel #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
- lustre/obdclass/llog_obd.c
|
|
Integrated in lustre-master » i686,client,el5,ofa #445
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = SUCCESS
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
|
|
Integrated in lustre-master » x86_64,server,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = FAILURE
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
- lustre/include/lustre_log.h
|
|
Integrated in lustre-master » x86_64,client,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = FAILURE
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/include/lustre_log.h
- lustre/obdclass/llog_obd.c
- lustre/obdclass/genops.c
|
|
Integrated in lustre-master » i686,client,el6,ofa #480
LU-919 obdclass: remove hard coded 0x5a5a5a (Revision afe043c883c7b833a702dfe00d3814f0e18d3942)
Result = ABORTED
Oleg Drokin : afe043c883c7b833a702dfe00d3814f0e18d3942
Files :
- lustre/include/lustre_log.h
- lustre/obdclass/genops.c
- lustre/obdclass/llog_obd.c
|
|
This still seem to be broken in 1.8.8 client
LustreError: 31122:0:(genops.c:803:class_import_get()) ASSERTION(atomic_read(&import->imp_refcount) < 0x5a5a5a) failed
LustreError: 31122:0:(genops.c:803:class_import_get()) LBUG
Sep 3 13:44:01 Pid: 31122, comm: rsync
sand-1-3 kernel:
LustreError: 31Call Trace:
122:0:(genops.c:BUG: unable to handle kernel 803:class_impNortULL pointer dereference_get()) ASSERTIO at (null)
N(atomic_read(&iIP: [<(null)>] (null)
mport->imp_refcoPGD 1a68aa067 PUD 449eea067 unt) < 0x5a5a5a)PMD 0
failed
Sep 3 1Oops: 0010 1 SMP
3:44:01 sand-1-3last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/link_layer
kernel: LustreECPU 4
rror: 31122:0:(gModules linked in:enops.c:803:c lasmgc(U)s_import_get()) lustre(U)LBUG
Sep 3 13:4 lov(U)4:01 sand-1-3 kemdc(U)rnel: Pid: 31122 lquota(U), comm: rsync
Se osc(U) ko2iblndp 3 13:44:01 sa(U) ptlrpcnd-1-3 kernel:
(U) obdclassSep 3 13:44:01 (U) lnetsand-1-3 kernel:(U) lvfs Call Trace:
(U) libcfs(U) nfsd exportfs acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) ext4
mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm wmi microcode dcdbas sg sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod c
rc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]
Pid: 31122, comm: rsync Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T
RIP: 0010:[<0000000000000000>] [<(null)>] (null)
RSP: 0018:ffff8804d52b5440 EFLAGS: 00010246
RAX: ffff8804d52b54a4 RBX: ffff8804d52b5498 RCX: ffffffffa045c320
RDX: ffff8804d52b54d8 RSI: ffff8804d52b5498 RDI: ffff8804d52b4000
RBP: ffff8804d52b54d8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 000000000000cbe0
R13: ffffffffa045c320 R14: 0000000000000000 R15: ffff880048683fc0
FS: 00007f030976a700(0000) GS:ffff880048680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000008a1a22000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rsync (pid: 31122, threadinfo ffff8804d52b4000, task ffff8806bac29580)
Stack:
ffffffff8100e520 ffff8804d52b54a4 ffff8806bac29580 ffffffffa0545210
<0> 00000000a05491e8 ffff8804d52b4000 ffff8804d52b5fd8 ffff8804d52b4000
<0> 0000000000000004 ffff880048680000 ffff8804d52b54d8 ffff8804d52b54a8
Call Trace:
[<ffffffff8100e520>] ? dump_trace+0x190/0x3b0
[<ffffffffa04508f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
|
|
I can reproduce this by running rsync of a directory containing a few TBs of unpacked kernel trees. Some time it takes longer and some times it takes shorter to hit the bug but eventually it will, very annoying.
|
Generated at Sat Feb 10 05:50:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.