[LU-39] ASSERTION(atomic_read(&client_stat->nid_exp_ref_count) == 0) failed: count 1 Created: 11/Jan/11 Updated: 25/May/12 Resolved: 23/Mar/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 1.8.3.0-6chaos |
||
| Attachments: |
|
| Severity: | 3 |
| Bugzilla ID: | 23,499 |
| Rank (Obsolete): | 5093 |
| Description |
|
A sysadmin was shutting down an MDS node cleanly in preparation for scheduled upgrades. During the umount of the MGS device, we hit the following assertion: LustreError ... (lprocfs_status.c:1060:lprocfs_free_client_stats()) And the stack trace was: :obdclass:lprocfs_free_client_stats We have seen this same assertion from OSTs as well. Some investigation was done in bug 23499, but there is not yet a solution. |
| Comments |
| Comment by Dan Ferber (Inactive) [ 11/Jan/11 ] |
|
Bobijam is taking this bug. |
| Comment by Zhenyu Xu [ 13/Jan/11 ] |
|
There is a race window between server obd cleanup and its handling In class_cleanup() before the server obd's stopping flag is set, a new So the race makes a supposed 0 referenced server obd has export on it, |
| Comment by Zhenyu Xu [ 13/Jan/11 ] |
|
patch is tracked at http://review.whamcloud.com/211 |
| Comment by Zhenyu Xu [ 21/Jan/11 ] |
|
As Niu mentioned in Gerrit review "The nid_stats_hash will be destroyed before the obd refcount drops to zero (see class_cleanup()), so the target_handle_reconnect() will not able to add any elements into the nid_stats_hash. I'm afraid this crash is caused by other defects. Maybe in the lprocfs_exp_setup(), it handles the (old_stat != new_stat) case incorrectly?" Yes, the patch I posted will solve another bug issue, which will not go this further to this assertion failure. I checked lprocfs_exp_setup() code, it handles the (old_stat != new_stat) case correctly, there should be other race place bring this extra nid_exp_ref_count at cleanup phase (in obd cleanup phase, obd_nid_stats_hash was destroyed in class_cleanup(), and mgs_cleanup() will not called until the obd's refcount becomes to 0 which means all client exports are dereferenced - class_export_destroy() decreases obd's refcount). |
| Comment by Zhenyu Xu [ 24/Jan/11 ] |
|
Christopher, Do some of the clients use multip NIDs to connect servers? If a client is configed with multiple NIDs, when the connected NID encounters problem, it will reconnect server with another new NID, the lprocfs_exp_setup() misses releasing the old NID's stats refcount. |
| Comment by D. Marc Stearman (Inactive) [ 24/Jan/11 ] |
|
The OSS nodes and the MDS have multiple NIDs, and the clients only have a single NID that they use to talk to the servers. Could this be caused by the MDS losing connection with the OSS nodes, and reconnecting using a different NID? The MDS would be similar to a client in this case. |
| Comment by Zhenyu Xu [ 24/Jan/11 ] |
|
yes, that could be the case. |
| Comment by Liang Zhen (Inactive) [ 25/Jan/11 ] |
|
bobi, I think we might want this patch for 2.x as well, I just made a quick look and found it's also in 2.x. Thanks |
| Comment by Zhenyu Xu [ 25/Jan/11 ] |
|
HEAD version patch. |
| Comment by Build Master (Inactive) [ 21/Mar/11 ] |
|
Integrated in Bobi Jam : 8efdff9aeb1933f9b1e7320ff48ad84983e4daa3
|
| Comment by Build Master (Inactive) [ 22/Mar/11 ] |
|
Integrated in Oleg Drokin : 2a6045403fbd46bb6501df907f0321f5401924ba
|