[LU-2306] class_decref can release obd_device which is being handled now Created: 08/Nov/12 Updated: 09/Jan/20 Resolved: 09/Jan/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Hiroya Nozaki | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
FEFS, based on Lustre-1.8.5 |
||
| Severity: | 3 |
| Rank (Obsolete): | 5516 |
| Description |
|
I've often observed that a couple of quick up and down obd_refcount tends to cause two problems at class_decref() in Lustre-1.8.x. so I had took into this problem for a while, and now I believe that I've found what's behind and that the both problems can happen in Lustre-2.3.x too. So please let me notify the both problems. One is that calling class_decref() from two different threads at the almost same timing tends to result in that one thread can free obd_device while another's processing the operation of class_decref()'s obd_refcount==1 with the same obd_device. The other is that if thread A increases and decreases an obd_refcount in a short period of time after thread B decreased the refcount and the refcount reached 1, it allow thread A to process obd_refcount==1's operations again. This case results in LBUG in obd_precleanup() or class_unlink_export() in Lustre-1.8.x |
| Comments |
| Comment by Hiroya Nozaki [ 08/Nov/12 ] |
|
The below call stack will be printed when obd_device has been released while obd_precleanup() is still going on. PID: 19436 TASK: ffff810c3f513820 CPU: 9 COMMAND: "obd_zombid"
#0 [ffff81051a92ba90] crash_kexec at ffffffff800aeb6b
#1 [ffff81051a92bb50] __die at ffffffff80066157
#2 [ffff81051a92bb90] die at ffffffff8006cce5
#3 [ffff81051a92bbc0] do_general_protection at ffffffff8006659f
#4 [ffff81051a92bc00] error_exit at ffffffff8005ede9
[exception RIP: class_disconnect+656]
RIP: ffffffff886e8300 RSP: ffff81051a92bcb0 RFLAGS: 00010206
RAX: 5a5a5a5a5a5a5a5a RBX: ffff810373258400 RCX: 000000000000012f
RDX: ffff8103732584a8 RSI: 5a5a5a5a5a5a5a5a RDI: ffff810399370480
RBP: ffff810c384f84a0 R8: 00000000fffffffd R9: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: ffff810c384f8038
R13: ffff810373258400 R14: 0000000000000000 R15: 00002abf83959010
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff81051a92bcf8] lov_disconnect at ffffffff88a72f88
#6 [ffff81051a92bd58] mds_lov_clean at ffffffff888fefa4
#7 [ffff81051a92bdb8] mds_precleanup at ffffffff8890e1a9
#8 [ffff81051a92be08] class_decref at ffffffff886fd02e
#9 [ffff81051a92be58] class_import_destroy at ffffffff886e4722
#10 [ffff81051a92bea8] obd_zombie_impexp_cull at ffffffff886e4ae5
#11 [ffff81051a92bec8] obd_zombie_impexp_thread at ffffffff886eb0ac
#12 [ffff81051a92bf48] kernel_thread at ffffffff8005efb1
PID: 19740 TASK: ffff810c3bd95860 CPU: 9 COMMAND: "umount"
#0 [ffff810c3b3fdc58] schedule at ffffffff80063f96
#1 [ffff810c3b3fdd30] schedule_timeout at ffffffff800648ab
#2 [ffff810c3b3fdd80] server_put_super at ffffffff88712e59
#3 [ffff810c3b3fddb0] vfs_quota_sync at ffffffff80051e00
#4 [ffff810c3b3fde60] generic_shutdown_super at ffffffff800e4bc0
#5 [ffff810c3b3fde80] kill_anon_super at ffffffff800e4c90
#6 [ffff810c3b3fde90] deactivate_super at ffffffff800e4d41
#7 [ffff810c3b3fdeb0] sys_umount at ffffffff800ee830
#8 [ffff810c3b3fdf80] system_call at ffffffff8005e116
RIP: 000000394bad3dd7 RSP: 00007fffb3c04cf8 RFLAGS: 00010246
RAX: 00000000000000a6 RBX: ffffffff8005e116 RCX: 00000000004068df
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000001eca3030
RBP: 000000001eca3010 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffb3c05208
R13: 000000001eca3030 R14: 000000001eca3090 R15: 0000000000000000
ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
And the below two call stacks will be printed when class_decref() has proceeded refcount==1's operation twice. PID: 24561 TASK: ffff810951294100 CPU: 0 COMMAND: "obd_zombid"
#0 [ffffffff80449de0] crash_kexec at ffffffff800aeb6b
#1 [ffffffff80449ea0] __die at ffffffff80066157
#2 [ffffffff80449ee0] die at ffffffff8006cce5
#3 [ffffffff80449f10] do_stack_segment at ffffffff8006d632
#4 [ffffffff80449f50] stack_segment at ffffffff8005f221
[exception RIP: __list_add+46]
RIP: ffffffff80154df4 RSP: ffff81057e01fd50 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff810c3ba503b8 RCX: 0000000000000066
RDX: 5a5a5a5a5a5a5a5a RSI: 5a5a5a5a5a5a5a5a RDI: ffff81057e01fd70
RBP: 5a5a5a5a5a5a5a5a R8: 00000000ffffffff R9: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81057e01fd70
R13: ffff810c3ba50038 R14: 0000000000000000 R15: 00002ae881c8e010
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <STACKFAULT exception stack> ---
#5 [ffff81057e01fd50] __list_add at ffffffff80154df4
#6 [ffff81057e01fd68] __down_write_nested at ffffffff800655fe
#7 [ffff81057e01fda8] mds_lov_clean at ffffffff888fec55
#8 [ffff81057e01fe08] mds_precleanup at ffffffff8890e1f9
#9 [ffff81057e01fe58] class_decref at ffffffff886fd08e
#10 [ffff81057e01fea8] obd_zombie_impexp_cull at ffffffff886e4af2
#11 [ffff81057e01fec8] obd_zombie_impexp_thread at ffffffff886eb0ec
#12 [ffff81057e01ff48] kernel_thread at ffffffff8005efb1
PID: 14687 TASK: ffff8105e6eb7860 CPU: 13 COMMAND: "ldlm_elt"
#0 [ffff810c3fc63de0] crash_kexec at ffffffff800aeb6b
#1 [ffff810c3fc63ea0] __die at ffffffff80066157
#2 [ffff810c3fc63ee0] die at ffffffff8006cce5
#3 [ffff810c3fc63f10] do_stack_segment at ffffffff8006d632
#4 [ffff810c3fc63f50] stack_segment at ffffffff8005f221
[exception RIP: __list_add+46]
RIP: ffffffff80154df4 RSP: ffff810303ea5c80 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff810c3b36c3b8 RCX: 0000000000000187
RDX: 5a5a5a5a5a5a5a5a RSI: 5a5a5a5a5a5a5a5a RDI: ffff810303ea5ca0
RBP: 5a5a5a5a5a5a5a5a R8: 00000000ffffffff R9: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: ffff810303ea5ca0
R13: ffff810c3b36c038 R14: 0000000000000001 R15: ffff810303ea5ee8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <STACKFAULT exception stack> ---
#5 [ffff810303ea5c80] __list_add at ffffffff80154df4
#6 [ffff810303ea5c98] __down_write_nested at ffffffff800655fe
#7 [ffff810303ea5cd8] mds_lov_clean at ffffffff888fec55
#8 [ffff810303ea5d38] mds_precleanup at ffffffff8890e1f9
#9 [ffff810303ea5d88] class_decref at ffffffff886fd08e
#10 [ffff810303ea5dd8] class_evict_all_exports_by_uuid at ffffffff886e92fa
#11 [ffff810303ea5e58] expired_lock_main at ffffffff887d7b39
#12 [ffff810303ea5f48] kernel_thread at ffffffff8005efb1
|
| Comment by Lai Siyao [ 08/Nov/12 ] |
|
Hmm, class refcounting is messy, IMO the correct way to fix is:
|
| Comment by Hiroya Nozaki [ 08/Nov/12 ] |
|
hm, but class_unlink_export() has class_export_put() in it. by the way, could this be a kind of the way to fix it ? class_decref() void class_decref(struct obd_device *obd)
{
int err;
int refs;
mutex_down(&obd->obd_dev_cleanup_sem); <----- adds a new mutex to obd_device in order to progress operations here atomically
spin_lock(&obd->obd_dev_lock);
atomic_dec(&obd->obd_refcount);
refs = atomic_read(&obd->obd_refcount);
spin_unlock(&obd->obd_dev_lock);
CDEBUG(D_CONFIG, "Decref %s (%p) now %d\n", obd->obd_name, obd, refs);
if ((refs == 1) && obd->obd_stopping &&
!obd->obd_precleaned_up) { <----- adds a new member of stage to obd_device and check here
obd->obd_precleaned_up = 1; <----- set obd_precleaned_up 1 so as not to process here again
...
mutex_up(&obd->obd_dev_cleanup_sem);
return;
}
mutex_up(&obd->obd_dev_cleanup_sem);
if (refs == 0) {
...
}
}
|
| Comment by Lai Siyao [ 09/Nov/12 ] |
|
Yes, I mean the refcount class_unlink_export() put is excessive (it should not be taken), so anytime when obd device refcount reaches zero, it will do class_unlink_export() and all below, and IMHO at this moment obd->obd_stopping should be set. If there is case obd device refcount is zero but should not be released, IMHO an extra refcount should be taken somewhere to avoid this happening. I'd like to see a tidy fix for all this. |
| Comment by Hiroya Nozaki [ 09/Nov/12 ] |
|
OK, i got it. |
| Comment by Hiroya Nozaki [ 09/Nov/12 ] |
|
I uploaded a proto type. So could you see whether or not my direction is corret ? |
| Comment by Lai Siyao [ 09/Nov/12 ] |
|
I think your patch is incorrect. The reason why current class_decref() checks refcount 1, and then calls class_unlink_export() is that export is culled asynchronously (see https://bugzilla.lustre.org/show_bug.cgi?id=15716), and the last refcount of this obd device should be put by the zombie cull thread (see class_export_destroy()). But if others get and put this export in this period, it may cause the issue you mentioned. IMO in this period, this obd device should not be get by others, and it's destined to get cleanup and released soon. Please check related code and logic, and find a clean way to fix. |
| Comment by Hiroya Nozaki [ 12/Nov/12 ] |
|
Thank you for your advice and the useful link !!! I uploaded the new patch. The main difference to the previous one is using obd_set_up insted of using obd_stopping in class_decref(). It's because obd_device is released by class_decref in class_detach() even when class_setup failed, I mean that obd_device is not set obd_stopping at that kind of cases. |
| Comment by Hiroya Nozaki [ 12/Nov/12 ] |
|
hm, Maloo said that my patch was affected by |
| Comment by Hiroya Nozaki [ 15/Nov/12 ] |
|
I've completed acceptance-small with a new patch, thought it's for 2.3.54 instead of 2.3.53 which is the version which I made the previous patches for. this is because I've often watched Even after finishing acc-sm, I'm still wandering if I'm allowed to put obd_self_export in obd_detach(), and then, call class_decref() after that. I mean that the class_decref() in obd_detach() may do the last down of obd_refcount in almost all of cases. So I think it couldn't be true that the last down of obd_refcount should be done by zombie thread, you said before. So that's still available to this case, probably I have to a bit change the order of class_decref() and class_cleanup_self_export(). |
| Comment by Lai Siyao [ 15/Nov/12 ] |
|
Previously I mean the obd refcount taken by self_export will be put by zombie thread, but generally the last refcount of obd should be put by class_detach() (if xxx_barrier() is called like your patch). |
| Comment by Hiroya Nozaki [ 15/Nov/12 ] |
|
Thank you for your advice, I really appriciate it !!! By the way, I'm planning to fix lprocfs_obd_setup(). I think it's safer to hand in obd->obd_minor to procfs entry when it's registered than to hand in an address of obd_device directly. When a procfs holds obd_minor, its proc hander can check the existence of the odd_device, like the below pseudo code. int minor = data;
struct obd_device *obd;
cfs_read_lock(&obd_dev_lock);
obd = class_num2obd(minor);
if (obd->obd_stopping) { ... class_export_get(obd->obd_self_export) ... } /* or */
if (obd->obd_attached) { ... class_export_get(obd->obd_self_export) ... } /* or something along with these lines depending on context*/
cfs_read_unlock(&obd_dev_lock);
I think that getting self export's refcount like this way allows a thread which is using a procfs entry to hold obd_refcount, and it also allows the thread to release obd_device by the zombie thread because self export is never released. I mean we can avoid that _lprocfs_lock problem which we discussed about in Could I make a new ticket for this issue ? |
| Comment by Hiroya Nozaki [ 15/Nov/12 ] |
|
Ehh, but in this way, I may lose obd->obd_self_export after class_cleanup() because of my patch for this issue ... hm, it's a kind of difficult ... |
| Comment by Hiroya Nozaki [ 16/Nov/12 ] |
|
patch for b1_8 |
| Comment by Hiroya Nozaki [ 16/Nov/12 ] |
|
Apparently Hudson is bound to fail to compile packages for b1_8. I always get error in the compilation for ubuntu1004. dpkg-buildpackage: error: tail of debian/changelog gave error exit status 1
export KPKG_DEST_DIR="$(pwd)/.." && \
version=$(sed -ne '1s/^lustre (\(.*\)).*$/\1/p' debian/changelog) && \
rm -rf debian/tmp/modules-deb && \
mkdir debian/tmp/modules-deb && \
pushd debian/tmp/modules-deb && \
dpkg -x ../../../../lustre-source_${version}_all.deb $(pwd) && \
mkdir usr_src/ && \
tar -C usr_src/ -xjf usr/src/lustre.tar.bz2 && \
chmod 755 usr_src/modules/lustre/debian/rules && \
mkdir -p usr_share_modass && \
ln -s /usr/share/modass/include/ usr_share_modass/ && \
ln -s /usr/share/modass/packages/ usr_share_modass/ && \
echo "lustre" > usr_share_modass/compliant.list && \
export MA_DIR=$(pwd)/usr_share_modass && \
KVERS=${KVERS:-2.6.32-41-server}; \
m-a build ${KVERS:+-l $KVERS} -i -u $(pwd) lustre && \
popd && \
VER=$(sed -ne '1s/^lustre (\(.*-[0-9][0-9]*\)).*$/\1/p' debian/changelog); \
mkdir -p debs && \
mv ../liblustre_${VER}_*.deb ../linux-patch-lustre_${VER}_all.deb ../lustre-dev_${VER}_*.deb ../lustre-source_${VER}_all.deb ../lustre-tests_${VER}_*.deb ../lustre-utils_${VER}_*.deb ../lustre_${VER}.dsc ../lustre_${VER}_*.changes ../lustre_${VER%-[0-9]*}.orig.tar.gz ../lustre_${VER}.diff.gz ../lustre-client-modules-${KVERS}_${VER}_*.deb debs/
sed: can't read debian/changelog: No such file or directory
lustre, what is lustre?
mv: cannot stat `../liblustre__*.deb': No such file or directory
mv: cannot stat `../linux-patch-lustre__all.deb': No such file or directory
mv: cannot stat `../lustre-dev__*.deb': No such file or directory
mv: cannot stat `../lustre-source__all.deb': No such file or directory
mv: cannot stat `../lustre-tests__*.deb': No such file or directory
mv: cannot stat `../lustre-utils__*.deb': No such file or directory
mv: cannot stat `../lustre_.dsc': No such file or directory
mv: cannot stat `../lustre__*.changes': No such file or directory
mv: cannot stat `../lustre_.orig.tar.gz': No such file or directory
mv: cannot stat `../lustre_.diff.gz': No such file or directory
mv: cannot stat `../lustre-client-modules-__*.deb': No such file or directory
make: *** [debs] Error 1
+ rc=2
+ '[' 2 '!=' 0 ']'
+ echo 'Build failed'
Build failed
+ exit 2
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE
|
| Comment by Andreas Dilger [ 09/Jan/20 ] |
|
Close old ticket. |