[LU-2306] class_decref can release obd_device which is being handled now Created: 08/Nov/12  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Hiroya Nozaki Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: patch
Environment:

FEFS, based on Lustre-1.8.5
MDSx1, OSSx1(OSTx3), Clientx1


Severity: 3
Rank (Obsolete): 5516

 Description   

I've often observed that a couple of quick up and down obd_refcount tends to cause two problems at class_decref() in Lustre-1.8.x. so I had took into this problem for a while, and now I believe that I've found what's behind and that the both problems can happen in Lustre-2.3.x too. So please let me notify the both problems.

One is that calling class_decref() from two different threads at the almost same timing tends to result in that one thread can free obd_device while another's processing the operation of class_decref()'s obd_refcount==1 with the same obd_device.

The other is that if thread A increases and decreases an obd_refcount in a short period of time after thread B decreased the refcount and the refcount reached 1, it allow thread A to process obd_refcount==1's operations again. This case results in LBUG in obd_precleanup() or class_unlink_export() in Lustre-1.8.x



 Comments   
Comment by Hiroya Nozaki [ 08/Nov/12 ]

The below call stack will be printed when obd_device has been released while obd_precleanup() is still going on.

PID: 19436  TASK: ffff810c3f513820  CPU: 9   COMMAND: "obd_zombid" 
 #0 [ffff81051a92ba90] crash_kexec at ffffffff800aeb6b
 #1 [ffff81051a92bb50] __die at ffffffff80066157
 #2 [ffff81051a92bb90] die at ffffffff8006cce5
 #3 [ffff81051a92bbc0] do_general_protection at ffffffff8006659f
 #4 [ffff81051a92bc00] error_exit at ffffffff8005ede9
    [exception RIP: class_disconnect+656]
    RIP: ffffffff886e8300  RSP: ffff81051a92bcb0  RFLAGS: 00010206
    RAX: 5a5a5a5a5a5a5a5a  RBX: ffff810373258400  RCX: 000000000000012f
    RDX: ffff8103732584a8  RSI: 5a5a5a5a5a5a5a5a  RDI: ffff810399370480
    RBP: ffff810c384f84a0   R8: 00000000fffffffd   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff810c384f8038
    R13: ffff810373258400  R14: 0000000000000000  R15: 00002abf83959010
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff81051a92bcf8] lov_disconnect at ffffffff88a72f88
 #6 [ffff81051a92bd58] mds_lov_clean at ffffffff888fefa4
 #7 [ffff81051a92bdb8] mds_precleanup at ffffffff8890e1a9
 #8 [ffff81051a92be08] class_decref at ffffffff886fd02e
 #9 [ffff81051a92be58] class_import_destroy at ffffffff886e4722
#10 [ffff81051a92bea8] obd_zombie_impexp_cull at ffffffff886e4ae5
#11 [ffff81051a92bec8] obd_zombie_impexp_thread at ffffffff886eb0ac
#12 [ffff81051a92bf48] kernel_thread at ffffffff8005efb1

PID: 19740  TASK: ffff810c3bd95860  CPU: 9   COMMAND: "umount" 
 #0 [ffff810c3b3fdc58] schedule at ffffffff80063f96
 #1 [ffff810c3b3fdd30] schedule_timeout at ffffffff800648ab
 #2 [ffff810c3b3fdd80] server_put_super at ffffffff88712e59
 #3 [ffff810c3b3fddb0] vfs_quota_sync at ffffffff80051e00
 #4 [ffff810c3b3fde60] generic_shutdown_super at ffffffff800e4bc0
 #5 [ffff810c3b3fde80] kill_anon_super at ffffffff800e4c90
 #6 [ffff810c3b3fde90] deactivate_super at ffffffff800e4d41
 #7 [ffff810c3b3fdeb0] sys_umount at ffffffff800ee830
 #8 [ffff810c3b3fdf80] system_call at ffffffff8005e116
    RIP: 000000394bad3dd7  RSP: 00007fffb3c04cf8  RFLAGS: 00010246
    RAX: 00000000000000a6  RBX: ffffffff8005e116  RCX: 00000000004068df
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 000000001eca3030
    RBP: 000000001eca3010   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fffb3c05208
    R13: 000000001eca3030  R14: 000000001eca3090  R15: 0000000000000000
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

And the below two call stacks will be printed when class_decref() has proceeded refcount==1's operation twice.

PID: 24561  TASK: ffff810951294100  CPU: 0   COMMAND: "obd_zombid"
 #0 [ffffffff80449de0] crash_kexec at ffffffff800aeb6b
 #1 [ffffffff80449ea0] __die at ffffffff80066157
 #2 [ffffffff80449ee0] die at ffffffff8006cce5
 #3 [ffffffff80449f10] do_stack_segment at ffffffff8006d632
 #4 [ffffffff80449f50] stack_segment at ffffffff8005f221
    [exception RIP: __list_add+46]
    RIP: ffffffff80154df4  RSP: ffff81057e01fd50  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff810c3ba503b8  RCX: 0000000000000066
    RDX: 5a5a5a5a5a5a5a5a  RSI: 5a5a5a5a5a5a5a5a  RDI: ffff81057e01fd70
    RBP: 5a5a5a5a5a5a5a5a   R8: 00000000ffffffff   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff81057e01fd70
    R13: ffff810c3ba50038  R14: 0000000000000000  R15: 00002ae881c8e010
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <STACKFAULT exception stack> ---
 #5 [ffff81057e01fd50] __list_add at ffffffff80154df4
 #6 [ffff81057e01fd68] __down_write_nested at ffffffff800655fe
 #7 [ffff81057e01fda8] mds_lov_clean at ffffffff888fec55
 #8 [ffff81057e01fe08] mds_precleanup at ffffffff8890e1f9
 #9 [ffff81057e01fe58] class_decref at ffffffff886fd08e
#10 [ffff81057e01fea8] obd_zombie_impexp_cull at ffffffff886e4af2
#11 [ffff81057e01fec8] obd_zombie_impexp_thread at ffffffff886eb0ec
#12 [ffff81057e01ff48] kernel_thread at ffffffff8005efb1

PID: 14687  TASK: ffff8105e6eb7860  CPU: 13  COMMAND: "ldlm_elt" 
 #0 [ffff810c3fc63de0] crash_kexec at ffffffff800aeb6b
 #1 [ffff810c3fc63ea0] __die at ffffffff80066157
 #2 [ffff810c3fc63ee0] die at ffffffff8006cce5
 #3 [ffff810c3fc63f10] do_stack_segment at ffffffff8006d632
 #4 [ffff810c3fc63f50] stack_segment at ffffffff8005f221
    [exception RIP: __list_add+46]
    RIP: ffffffff80154df4  RSP: ffff810303ea5c80  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff810c3b36c3b8  RCX: 0000000000000187
    RDX: 5a5a5a5a5a5a5a5a  RSI: 5a5a5a5a5a5a5a5a  RDI: ffff810303ea5ca0
    RBP: 5a5a5a5a5a5a5a5a   R8: 00000000ffffffff   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff810303ea5ca0
    R13: ffff810c3b36c038  R14: 0000000000000001  R15: ffff810303ea5ee8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <STACKFAULT exception stack> ---
 #5 [ffff810303ea5c80] __list_add at ffffffff80154df4
 #6 [ffff810303ea5c98] __down_write_nested at ffffffff800655fe
 #7 [ffff810303ea5cd8] mds_lov_clean at ffffffff888fec55
 #8 [ffff810303ea5d38] mds_precleanup at ffffffff8890e1f9
 #9 [ffff810303ea5d88] class_decref at ffffffff886fd08e
#10 [ffff810303ea5dd8] class_evict_all_exports_by_uuid at ffffffff886e92fa
#11 [ffff810303ea5e58] expired_lock_main at ffffffff887d7b39
#12 [ffff810303ea5f48] kernel_thread at ffffffff8005efb1
Comment by Lai Siyao [ 08/Nov/12 ]

Hmm, class refcounting is messy, IMO the correct way to fix is:

  • refcount decrease and check should be atomic, when refcount reaches zero, obd device will be unlinked, cleaned up, detached and released. And it should never be seen by others.
  • class_incref() should LASSERT obd refcount > 0.
Comment by Hiroya Nozaki [ 08/Nov/12 ]

hm, but class_unlink_export() has class_export_put() in it.
so I belive that the refcount never reachs 0 until class_unlink_export() has been finished.

by the way, could this be a kind of the way to fix it ?
I know I have to refrain from adding new members to obd_device, but I think this kind of fixes could be very simple and understandable.

class_decref()
void class_decref(struct obd_device *obd)
{
        int err;
        int refs;

        mutex_down(&obd->obd_dev_cleanup_sem); <----- adds a new mutex to obd_device in order to progress operations here atomically
        spin_lock(&obd->obd_dev_lock);
        atomic_dec(&obd->obd_refcount);
        refs = atomic_read(&obd->obd_refcount);
        spin_unlock(&obd->obd_dev_lock);

        CDEBUG(D_CONFIG, "Decref %s (%p) now %d\n", obd->obd_name, obd, refs);

        if ((refs == 1) && obd->obd_stopping &&
            !obd->obd_precleaned_up) {      <----- adds a new member of stage to obd_device and check here
                obd->obd_precleaned_up = 1; <----- set obd_precleaned_up 1 so as not to process here again

                ...

                mutex_up(&obd->obd_dev_cleanup_sem);
                return;
        }
        mutex_up(&obd->obd_dev_cleanup_sem);

        if (refs == 0) {

                ...

        }
}

Comment by Lai Siyao [ 09/Nov/12 ]

Yes, I mean the refcount class_unlink_export() put is excessive (it should not be taken), so anytime when obd device refcount reaches zero, it will do class_unlink_export() and all below, and IMHO at this moment obd->obd_stopping should be set. If there is case obd device refcount is zero but should not be released, IMHO an extra refcount should be taken somewhere to avoid this happening.

I'd like to see a tidy fix for all this.

Comment by Hiroya Nozaki [ 09/Nov/12 ]

OK, i got it.
then, I'll try to deal with this problem. Please wait a while. I'll upload a draft way.

Comment by Hiroya Nozaki [ 09/Nov/12 ]

I uploaded a proto type. So could you see whether or not my direction is corret ?

http://review.whamcloud.com/4502

Comment by Lai Siyao [ 09/Nov/12 ]

I think your patch is incorrect. The reason why current class_decref() checks refcount 1, and then calls class_unlink_export() is that export is culled asynchronously (see https://bugzilla.lustre.org/show_bug.cgi?id=15716), and the last refcount of this obd device should be put by the zombie cull thread (see class_export_destroy()).

But if others get and put this export in this period, it may cause the issue you mentioned. IMO in this period, this obd device should not be get by others, and it's destined to get cleanup and released soon.

Please check related code and logic, and find a clean way to fix.

Comment by Hiroya Nozaki [ 12/Nov/12 ]

Thank you for your advice and the useful link !!!

I uploaded the new patch. The main difference to the previous one is using obd_set_up insted of using obd_stopping in class_decref(). It's because obd_device is released by class_decref in class_detach() even when class_setup failed, I mean that obd_device is not set obd_stopping at that kind of cases.

Comment by Hiroya Nozaki [ 12/Nov/12 ]

hm, Maloo said that my patch was affected by LU-2200 judging from the log of test_32a.
I think that my patch is still rougth so I'd like to remake it after hearing your comment, if you have.

Comment by Hiroya Nozaki [ 15/Nov/12 ]

I've completed acceptance-small with a new patch, thought it's for 2.3.54 instead of 2.3.53 which is the version which I made the previous patches for. this is because I've often watched LU-2129 in my environment.

Even after finishing acc-sm, I'm still wandering if I'm allowed to put obd_self_export in obd_detach(), and then, call class_decref() after that. I mean that the class_decref() in obd_detach() may do the last down of obd_refcount in almost all of cases. So I think it couldn't be true that the last down of obd_refcount should be done by zombie thread, you said before. So that's still available to this case, probably I have to a bit change the order of class_decref() and class_cleanup_self_export().

Comment by Lai Siyao [ 15/Nov/12 ]

Previously I mean the obd refcount taken by self_export will be put by zombie thread, but generally the last refcount of obd should be put by class_detach() (if xxx_barrier() is called like your patch).

Comment by Hiroya Nozaki [ 15/Nov/12 ]

Thank you for your advice, I really appriciate it !!!

By the way, I'm planning to fix lprocfs_obd_setup(). I think it's safer to hand in obd->obd_minor to procfs entry when it's registered than to hand in an address of obd_device directly. When a procfs holds obd_minor, its proc hander can check the existence of the odd_device, like the below pseudo code.

int minor = data;
struct obd_device *obd;
cfs_read_lock(&obd_dev_lock);
obd = class_num2obd(minor);
if (obd->obd_stopping) { ... class_export_get(obd->obd_self_export) ... } /* or */ 
if (obd->obd_attached) { ... class_export_get(obd->obd_self_export) ... } /* or something along with these lines depending on context*/
cfs_read_unlock(&obd_dev_lock);

I think that getting self export's refcount like this way allows a thread which is using a procfs entry to hold obd_refcount, and it also allows the thread to release obd_device by the zombie thread because self export is never released. I mean we can avoid that _lprocfs_lock problem which we discussed about in LU-2258 with this way.

Could I make a new ticket for this issue ?

Comment by Hiroya Nozaki [ 15/Nov/12 ]

Ehh, but in this way, I may lose obd->obd_self_export after class_cleanup() because of my patch for this issue ... hm, it's a kind of difficult ...

Comment by Hiroya Nozaki [ 16/Nov/12 ]

patch for b1_8

http://review.whamcloud.com/#change,4595

Comment by Hiroya Nozaki [ 16/Nov/12 ]

Apparently Hudson is bound to fail to compile packages for b1_8. I always get error in the compilation for ubuntu1004.
What should I do in this case ?

dpkg-buildpackage: error: tail of debian/changelog gave error exit status 1
export KPKG_DEST_DIR="$(pwd)/.." && \
	version=$(sed -ne '1s/^lustre (\(.*\)).*$/\1/p' debian/changelog) && \
	rm -rf debian/tmp/modules-deb && \
	mkdir debian/tmp/modules-deb && \
	pushd debian/tmp/modules-deb && \
	dpkg -x ../../../../lustre-source_${version}_all.deb $(pwd) && \
	mkdir usr_src/ && \
	tar -C usr_src/ -xjf usr/src/lustre.tar.bz2 && \
	chmod 755 usr_src/modules/lustre/debian/rules && \
	mkdir -p usr_share_modass && \
	ln -s /usr/share/modass/include/ usr_share_modass/ && \
	ln -s /usr/share/modass/packages/ usr_share_modass/ && \
	echo "lustre" > usr_share_modass/compliant.list && \
	export MA_DIR=$(pwd)/usr_share_modass && \
	KVERS=${KVERS:-2.6.32-41-server}; \
	m-a build ${KVERS:+-l $KVERS} -i -u $(pwd) lustre && \
	popd && \
	VER=$(sed -ne '1s/^lustre (\(.*-[0-9][0-9]*\)).*$/\1/p' debian/changelog); \
	mkdir -p debs && \
	mv ../liblustre_${VER}_*.deb ../linux-patch-lustre_${VER}_all.deb ../lustre-dev_${VER}_*.deb ../lustre-source_${VER}_all.deb ../lustre-tests_${VER}_*.deb ../lustre-utils_${VER}_*.deb ../lustre_${VER}.dsc ../lustre_${VER}_*.changes ../lustre_${VER%-[0-9]*}.orig.tar.gz ../lustre_${VER}.diff.gz ../lustre-client-modules-${KVERS}_${VER}_*.deb debs/
sed: can't read debian/changelog: No such file or directory
lustre, what is lustre?
mv: cannot stat `../liblustre__*.deb': No such file or directory
mv: cannot stat `../linux-patch-lustre__all.deb': No such file or directory
mv: cannot stat `../lustre-dev__*.deb': No such file or directory
mv: cannot stat `../lustre-source__all.deb': No such file or directory
mv: cannot stat `../lustre-tests__*.deb': No such file or directory
mv: cannot stat `../lustre-utils__*.deb': No such file or directory
mv: cannot stat `../lustre_.dsc': No such file or directory
mv: cannot stat `../lustre__*.changes': No such file or directory
mv: cannot stat `../lustre_.orig.tar.gz': No such file or directory
mv: cannot stat `../lustre_.diff.gz': No such file or directory
mv: cannot stat `../lustre-client-modules-__*.deb': No such file or directory
make: *** [debs] Error 1
+ rc=2
+ '[' 2 '!=' 0 ']'
+ echo 'Build failed'
Build failed
+ exit 2
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE
Comment by Andreas Dilger [ 09/Jan/20 ]

Close old ticket.

Generated at Sat Feb 10 01:24:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.