[LU-1092] NULL pointer dereference in filter_export_stats_init() Created: 10/Feb/12  Updated: 12/Jun/12  Resolved: 29/Mar/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.3.0, Lustre 2.1.2

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: paj
Environment:

https://github.com/chaos/lustre/commits/2.1.0-llnl
RHEL 6.2


Severity: 3
Rank (Obsolete): 4682

 Description   

We had three occurrences of this crash on our classified 2.1 Lustre cluster, all on OSS nodes.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [<ffffffffa0a8e061>] filter_export_stats_init+0x1f1/0x500 [obdfilter]

machine_kexec
crash_kexec
oops_end
no_context
__bad_area_nosemaphore
bad_area_nosemaphore
__do_page_fault
do_page_fault
page_fault
[exception RIP: filter_export_stats_init+497]
filter_reconnect
target_handle_connect
ost_handle
ptlrpc_main
kernel_thread

The timeframe conincided with the ASSERT reported in LU-1085. As in the other bugs we hit during that window, this crash was preceded by hundreds of messages like this:

LustreError: 14210:0:(genops.c:1270:class_disconnect_stale_exports()) ls5-OST0349: disconnect stale client [UUID]@<unknown>

Oleg has suggested that the patch for LU-106 may help here, and we have pulled it into our branch but haven't pushed it out yet.



 Comments   
Comment by Peter Jones [ 10/Feb/12 ]

Lai

Does this look like it would be related to LU-106?

Peter

Comment by Bruno Faccini (Inactive) [ 02/Mar/12 ]

I don't think so since we just suffered this same crash/Oops with our Lustre version already including LU-106 patch !!... :
======================================
crash> sys
SYSTEM MAP: /boot/System.map-2.6.32-131.17.1.bl6.Bull.27.0.x86_64
DEBUG KERNEL:
/usr/lib/debug/lib/modules/2.6.32-131.17.1.bl6.Bull.27.0.x86_64/vmlinux
(2.6.32-131.17.1.bl6.Bull.27.0.x86_64)
DUMPFILE: ./vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Wed Feb 29 16:06:48 2012
UPTIME: 01:24:59
LOAD AVERAGE: 16.72, 15.79, 14.23
TASKS: 1855
NODENAME: curie228
RELEASE: 2.6.32-131.17.1.bl6.Bull.27.0.x86_64
VERSION: https://bugtracker.ccc.cea.fr/view.php?id=1 SMP Mon Nov 7
15:21:24 CET 2011
MACHINE: x86_64 (2266 Mhz)
MEMORY: 64 GB
PANIC: "Oops: 0002 1 SMP "
(check log for details)
crash>
crash> bt
PID: 26089 TASK: ffff880bd9164080 CPU: 7 COMMAND: "ll_ost_116"
#0 [ffff880bc823b540] machine_kexec at ffffffff81027a4b
#1 [ffff880bc823b5a0] crash_kexec at ffffffff810a2db2
#2 [ffff880bc823b670] oops_end at ffffffff81481730
#3 [ffff880bc823b6a0] no_context at ffffffff81031d1b
#4 [ffff880bc823b6f0] __bad_area_nosemaphore at ffffffff81031fa5
#5 [ffff880bc823b740] bad_area_nosemaphore at ffffffff81032073
#6 [ffff880bc823b750] __do_page_fault at ffffffff810326fd
#7 [ffff880bc823b870] do_page_fault at ffffffff8148373e
#8 [ffff880bc823b8a0] page_fault at ffffffff81480ac5
[exception RIP: filter_export_stats_init+665]
RIP: ffffffffa0900e49 RSP: ffff880bc823b950 RFLAGS: 00010286
RAX: ffff880fae2ddc60 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8807e5ab1c00
RBP: ffff880bc823b9c0 R8: 0000000000000000 R9: ffff880f895bc000
R10: 0000000000000001 R11: dead000000200200 R12: ffff88040b14e340
R13: ffff880faf6795f0 R14: ffff880c136ee2f8 R15: ffff880faf6795f0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880bc823b9c8] filter_reconnect at ffffffffa090d32e [obdfilter]
#10 [ffff880bc823ba38] target_handle_connect at ffffffffa0598122 [ptlrpc]
#11 [ffff880bc823bbf8] ost_handle at ffffffffa08dfdfb [ost]
#12 [ffff880bc823bd68] ptlrpc_main at ffffffffa05df459 [ptlrpc]
#13 [ffff880bc823bf48] kernel_thread at ffffffff810041aa
crash>
======================================

BTW, first crash-dump analysis indicate a race over "((struct obd_export *)0xffff880b4b4df000)->exp_nid_stats" which has been NULL'ed during panic thread reference when calling/inlining lprocfs_init_rw_stats() in filter_export_stats_init() ...

So there may be some more protective code to be added in lprocfs !!

More debugging stuff/infos to come a soon as power will be back on our system ...

Comment by Bruno Faccini (Inactive) [ 15/Mar/12 ]

I made some progress in the crash-dump (in fact we faced several occurences of this crash on different OSSes) and my assumption is that the race occurs during a huge number of Clients/exports disconnects on multiple OSTs/targets at a time, according to the huge number of "LustreError: <tgt_recov_PID>:0:(genops.c:1270:class_disconnect_stale_exports()) <OST_name>: disconnect stale client <Client_UUID>@<unknown>" msgs beeing printed at the Console at the time of the crash ..., when a reconnect request arrive !!

I think the race can occur because the "stale" exports disconnect algorithm works by protecting itself with obd_device->obd_dev_lock and then removing each obd_export from their "exp_obd_chain" list, when the [re-]connect algorithm finds the concerned obd_export via cfs_hash_lookup() using the obd_device->obd_uuid_hash ...

This allows an obd_export reference to be taken in target_handle_connect() prior to call filter_export_stats_init()/lprocfs_init_rw_stats() when during the same time on the other side/thread, the disconnect algorithm will finally, but too late, unlink the obd_export from hashing (obd_disconnect()/fiter_disconnect()/server_disconnect_export()/class_disconnect()/class_unlink_export() and also clear/NULL obd_export->exp_nid_stats via class_export_put()/lprocfs_exp_cleanup().

Does this sound you like a reasonable scenario for the race ??

Comment by Lai Siyao [ 16/Mar/12 ]

Bruno, your analysis looks reasonable: disconnected export may still get found, this is wrong; IMHO during connect, if the export is being disconnected, this export should not be reused, but create a new one instead.

Comment by Lai Siyao [ 19/Mar/12 ]

In last comment I said connect/disconnect should be serialized, however the real issue here is that 'connect' doesn't hold an export refcount in the process, and at the meantime another disconnect is called and put the last refcount of the export. Thought the way I said should work, there is a simpler way to fix: take export refcount in connect.

I'll commit the fix later.

Comment by Lai Siyao [ 19/Mar/12 ]

Review is on http://review.whamcloud.com/#change,2345.

Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » i686,client,el6,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » i686,server,el6,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » i686,client,el5,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 27/Mar/12 ]

Integrated in lustre-master » x86_64,client,el6,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » i686,server,el5,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,server,el6,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 28/Mar/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #527
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Peter Jones [ 29/Mar/12 ]

Landed for 2.3

Comment by Christopher Morrone [ 19/Apr/12 ]

Needed on 2.1. The ball was dropped here, it appears.

Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,sles11,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el6,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el6,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el5,ofa #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,server,el5,ofa #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el6,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,ofa #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,client,el5,ofa #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » i686,client,el5,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 20/Apr/12 ]

Integrated in lustre-b2_1 » x86_64,server,el6,inkernel #48
LU-1092 ptlrpc: take export refcount during connect (Revision 2c8b70becc8da305b687509febbdd6f8de95cf10)

Result = SUCCESS
Oleg Drokin : 2c8b70becc8da305b687509febbdd6f8de95cf10
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el5,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el6,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,server,el5,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el6,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » i686,client,el5,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,server,el5,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 02/May/12 ]

Integrated in lustre-dev » x86_64,client,el6,inkernel #340
LU-1092 ptlrpc: take export refcount during connect (Revision 893cf2014a38c5bd94890d3522fafe55f024a958)

Result = SUCCESS
Oleg Drokin : 893cf2014a38c5bd94890d3522fafe55f024a958
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Jay Lan (Inactive) [ 11/Jun/12 ]

Hi Siyao,

We here at NASA Ames had a LBUG on
LustreError: 6023:0:(genops.c:933:class_import_put()) ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed

That ASSERTION was mentioned in LU-1085, which was dup'ed to this LU.
Does our failure look like the same problem as this LU?

LustreError: 6023:0:(genops.c:933:class_import_put()) ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed^M
LustreError: 6023:0:(genops.c:933:class_import_put()) LBUG^M
Pid: 6023, comm: ll_ost_365^M
^M
Call Trace:^M
[<ffffffffa0553855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa0553e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
[<ffffffffa055eda6>] libcfs_assertion_failed+0x66/0x70 [libcfs]^M
[<ffffffffa061fb67>] class_import_put+0x2a7/0x300 [obdclass]^M
[<ffffffffa06346dc>] ? class_handle_unhash+0x3c/0x50 [obdclass]^M
[<ffffffffa061fc1c>] ? class_destroy_import+0x5c/0xb0 [obdclass]^M
[<ffffffffa070e92e>] client_destroy_import+0x2e/0x40 [ptlrpc]^M
[<ffffffffa071058e>] target_handle_connect+0xece /0 x72 do4u0t [ofpt 8lr cpcpu]^Ms
[<ffffffff81052ed5>] ? select_idle_sibling+0x95/0x150^M
[<ffffffffa0749b7c>] ? lustre_msg_get_version+0x7c/0xe0 [ptlrpc]^M
[<ffffffffa0749ca8>] ? lustre_msg_check_version+0xc8/0xe0 [ptlrpc]^M
[<ffffffffa081276b>] ost_handle+0x244b/0x4ba0 [ost]^M
[<ffffffffa0a02ac6>] ? vvp_session_key_init+0x76/0x1d0 [lustre]^M
[<ffffffffa0747954>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
[<ffffffffa075842e>] ptlrpc_main+0xb7e/0x18f0 [ptlrpc]^M
[<ffffffffa07578b0>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]^M
[<ffffffff8100c14a>] child_rip+0xa/0x20^M
[<ffffffffa07578b0>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]^M
[<ffffffffa07578b0>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]^M
[<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Comment by Lai Siyao [ 12/Jun/12 ]

No, IMO it's not the same issue. The backtrace shows a zombie import is being destroyed again, but this LU is about a issue of export race. At first sight, the export->exp_imp_reverse handling code in target_handle_connect() looks unsafe (no lock protected). In fact, the whole bunch of code about export/import refcounting looks a bit messy, if such panic still happens with all the patches mentioned by Oleg, we may need tidy up these code as a whole.

Generated at Sat Feb 10 05:58:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.