[LU-39] ASSERTION(atomic_read(&client_stat->nid_exp_ref_count) == 0) failed: count 1 Created: 11/Jan/11  Updated: 25/May/12  Resolved: 23/Mar/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.6

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre 1.8.3.0-6chaos


Attachments: Text File lu-39v2-master.patch    
Severity: 3
Bugzilla ID: 23,499
Rank (Obsolete): 5093

 Description   

A sysadmin was shutting down an MDS node cleanly in preparation for scheduled upgrades. During the umount of the MGS device, we hit the following assertion:

LustreError ... (lprocfs_status.c:1060:lprocfs_free_client_stats())
ASSERTION(atomic_read(&client_stat->nid_exp_ref_count) == 0) failed: count 1

And the stack trace was:

:obdclass:lprocfs_free_client_stats
:obdclass:lprocfs_free_per_client_stats
:mgs:lproc_mgs_cleanup
:mgs:mgs_cleanup
:obdclass:class_decref
:obdclass:class_export_destroy
:obdclass:obd_zombie_impexp_cull
:obdclass:obd_zombie_impexp_thread

We have seen this same assertion from OSTs as well. Some investigation was done in bug 23499, but there is not yet a solution.



 Comments   
Comment by Dan Ferber (Inactive) [ 11/Jan/11 ]

Bobijam is taking this bug.

Comment by Zhenyu Xu [ 13/Jan/11 ]

There is a race window between server obd cleanup and its handling
of client connection.

In class_cleanup() before the server obd's stopping flag is set, a new
client connection request can be handled in target_handle_connect();
after the target obd (server obd) is found, and before its refcount is
increased, class_cleanup() can be processed and run through
class_decref(); it find that the server obd's refcount becomes to 0;
at this point, target_handle_connect() can go on and add target obd
(server obd) refcount and create the client export.

So the race makes a supposed 0 referenced server obd has export on it,
thus the server obd's obd_nid_stats contains non 0 referenced client
nid stat.

Comment by Zhenyu Xu [ 13/Jan/11 ]

patch is tracked at http://review.whamcloud.com/211

Comment by Zhenyu Xu [ 21/Jan/11 ]

As Niu mentioned in Gerrit review "The nid_stats_hash will be destroyed before the obd refcount drops to zero (see class_cleanup()), so the target_handle_reconnect() will not able to add any elements into the nid_stats_hash. I'm afraid this crash is caused by other defects. Maybe in the lprocfs_exp_setup(), it handles the (old_stat != new_stat) case incorrectly?"

Yes, the patch I posted will solve another bug issue, which will not go this further to this assertion failure.

I checked lprocfs_exp_setup() code, it handles the (old_stat != new_stat) case correctly, there should be other race place bring this extra nid_exp_ref_count at cleanup phase (in obd cleanup phase, obd_nid_stats_hash was destroyed in class_cleanup(), and mgs_cleanup() will not called until the obd's refcount becomes to 0 which means all client exports are dereferenced - class_export_destroy() decreases obd's refcount).

Comment by Zhenyu Xu [ 24/Jan/11 ]

Christopher,

Do some of the clients use multip NIDs to connect servers?

If a client is configed with multiple NIDs, when the connected NID encounters problem, it will reconnect server with another new NID, the lprocfs_exp_setup() misses releasing the old NID's stats refcount.

Comment by D. Marc Stearman (Inactive) [ 24/Jan/11 ]

The OSS nodes and the MDS have multiple NIDs, and the clients only have a single NID that they use to talk to the servers. Could this be caused by the MDS losing connection with the OSS nodes, and reconnecting using a different NID? The MDS would be similar to a client in this case.

Comment by Zhenyu Xu [ 24/Jan/11 ]

yes, that could be the case.

Comment by Liang Zhen (Inactive) [ 25/Jan/11 ]

bobi, I think we might want this patch for 2.x as well, I just made a quick look and found it's also in 2.x.

Thanks

Comment by Zhenyu Xu [ 25/Jan/11 ]

HEAD version patch.

Comment by Build Master (Inactive) [ 21/Mar/11 ]

Integrated in reviews-centos5 #529
LU-39 ASSERTION(atomic_read(&client_stat->nid_exp_ref_count) == 0)

Bobi Jam : 8efdff9aeb1933f9b1e7320ff48ad84983e4daa3
Files :

  • lustre/mgs/mgs_fs.c
  • lustre/mdt/mdt_fs.c
  • lustre/mdt/mdt_internal.h
  • lustre/include/lprocfs_status.h
  • lustre/mgs/mgs_handler.c
  • lustre/obdfilter/filter.c
  • lustre/obdclass/lprocfs_status.c
  • lustre/mdt/mdt_handler.c
  • lustre/mgs/mgs_internal.h
Comment by Build Master (Inactive) [ 22/Mar/11 ]

Integrated in lustre-master-centos5 #159
LU-39 ASSERTION(atomic_read(&client_stat->nid_exp_ref_count) == 0)

Oleg Drokin : 2a6045403fbd46bb6501df907f0321f5401924ba
Files :

  • lustre/mdt/mdt_fs.c
  • lustre/mgs/mgs_internal.h
  • lustre/mdt/mdt_internal.h
  • lustre/include/lprocfs_status.h
  • lustre/mgs/mgs_fs.c
  • lustre/mgs/mgs_handler.c
  • lustre/obdfilter/filter.c
  • lustre/obdclass/lprocfs_status.c
  • lustre/mdt/mdt_handler.c
Generated at Sat Feb 10 01:03:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.