[LU-7530] upcall_cache_flush()) ASSERTION( !atomic_read(&entry->ue_refcount) ) failed Created: 09/Dec/15  Updated: 14/Dec/15  Resolved: 14/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7199 Null pointer dereference in old_init_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Just about a week ago I started to see these assertions on master:

<4>[90142.530259] Lustre: DEBUG MARKER: == recovery-small test 27: fail LOV while using OSC's == 14:16:08 (1449602168)
<4>[90144.006796] Lustre: Failing over lustre-MDT0000
<3>[90144.016902] LustreError: 528:0:(mdt_lib.c:493:old_init_ucred_common()) lustre-MDT0000: cli 7f023601-9100-9039-6ea4-5561717d207a/ffff880092a417f0 nodemap not set.
<6>[90144.018588] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
<0>[90144.116494] LustreError: 1400:0:(upcall_cache.c:378:upcall_cache_flush()) ASSERTION( !atomic_read(&entry->ue_refcount) ) failed: 
<0>[90144.117605] LustreError: 1400:0:(upcall_cache.c:378:upcall_cache_flush()) LBUG
<4>[90144.118625] Pid: 1400, comm: umount
<4>[90144.123256] 
<4>[90144.123257] Call Trace:
<4>[90144.124529]  [<ffffffffa0ada885>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[90144.130691]  [<ffffffffa0adae87>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[90144.132956]  [<ffffffffa0e13a64>] upcall_cache_flush+0x164/0x1c0 [obdclass]
<4>[90144.134555]  [<ffffffffa0e13ae0>] upcall_cache_cleanup+0x20/0xc0 [obdclass]
<4>[90144.135671]  [<ffffffffa08a6627>] mdt_device_fini+0x8c7/0x1450 [mdt]
<4>[90144.136490]  [<ffffffffa0dcfa1c>] ? class_disconnect_exports+0x17c/0x2f0 [obdclass]
<4>[90144.138119]  [<ffffffffa0de8d22>] class_cleanup+0x572/0xd20 [obdclass]
<4>[90144.139591]  [<ffffffffa0dcac8c>] ? class_name2dev+0x7c/0xe0 [obdclass]
<4>[90144.140493]  [<ffffffffa0deb246>] class_process_config+0x1d76/0x26d0 [obdclass]
<4>[90144.142081]  [<ffffffffa0ae6701>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4>[90144.143093]  [<ffffffffa0dec05f>] class_manual_cleanup+0x4bf/0xe10 [obdclass]
<4>[90144.144273]  [<ffffffffa0dcac8c>] ? class_name2dev+0x7c/0xe0 [obdclass]
<4>[90144.145183]  [<ffffffffa0e2099c>] server_put_super+0x9bc/0xe80 [obdclass]
<4>[90144.146170]  [<ffffffff811b141a>] ? invalidate_inodes+0xfa/0x180
<4>[90144.147277]  [<ffffffff8119564b>] generic_shutdown_super+0x5b/0xe0
<4>[90144.148106]  [<ffffffff81195736>] kill_anon_super+0x16/0x60
<4>[90144.150251]  [<ffffffffa0def756>] lustre_kill_super+0x36/0x60 [obdclass]
<4>[90144.156662]  [<ffffffff81195ed7>] deactivate_super+0x57/0x80
<4>[90144.158768]  [<ffffffff811b5e2f>] mntput_no_expire+0xbf/0x110
<4>[90144.162013]  [<ffffffff811b699b>] sys_umount+0x7b/0x3a0
<4>[90144.164183]  [<ffffffff8100b112>] system_call_fastpath+0x16/0x1b
<4>[90144.166668] 

I think this is introduced by http://review.whamcloud.com/16802 :

+               identity = mdt_identity_get(mdt->mdt_identity_cache,
...
+       uc->uc_identity = identity;
 
+       if (nodemap == NULL) {
+               CERROR("%s: cli %s/%p nodemap not set.\n",
+                     mdt2obd_dev(mdt)->obd_name,
+                     info->mti_exp->exp_client_uuid.uuid, info->mti_exp);
+               RETURN(-EACCES);

So we are leaking the identity refcount in this case where as we did not have that before.
Probably just need to move the if (nodemap==NULL) check to the start of the function before the mdt_get_identity call?



 Comments   
Comment by Gerrit Updater [ 09/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/17519
Subject: LU-7530 mdt: Do not leak identity when no nodemap is present
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54275f4fb643b8386114ce760ee58142600ae34f

Comment by Kit Westneat [ 09/Dec/15 ]

Interesting, I wonder how the nodemap became null. I guess it must be that the client was disconnected by mdt_obd_disconnect while old_init_ucred was run in a different thread?

Comment by Oleg Drokin [ 09/Dec/15 ]

Yes, that's plausible, all tests it usually fails on are related to doing stuff while disconnections/reconnections are initiated.

Comment by Gerrit Updater [ 13/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17519/
Subject: LU-7530 mdt: Do not leak identity when no nodemap is present
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 10e516109d7bb9863f1ca066a7e0842b9c7fbcb2

Comment by Peter Jones [ 14/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:09:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.