[LU-9791] When umount client, kobject_put crashed the kernel Created: 23/Jul/17  Updated: 11/Sep/17  Resolved: 10/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Li Xi (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-9873 parallel-scale-nfsv4 no sub tests fai... Closed
Related
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I was testing the latest master branch (b2c8846, LU-6210 utils: Use C99 struct initializer for long_opt_start). All the Lustre servers and client runs on the same host. And when I umount the client. The kobject_put() crashed the kernel.

 

[  118.118013] -----------[ cut here ]-----------
[  118.118025] WARNING: at lib/kobject.c:612 kobject_put+0x50/0x60()
[  118.118028] kobject: '(null)' (ffff88001240eec0): is not initialized, yet kobject_put() is being called.
[  118.118030] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_mod ppdev sg pcspkr virtio_balloon i2c_piix4 i2c_core parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi virtio_scsi virtio_net virtio_blk ata_piix serio_raw virtio_pci virtio_ring virtio libata floppy
[  118.118140] CPU: 1 PID: 9487 Comm: umount Tainted: G           OE  ------------   3.10.0-514.26.2.el7_lustre.2.10.50_69_g8793c5b.x86_64 #1
[  118.118144] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  118.118151]  ffff88001203f9e0 00000000ac9255be ffff88001203f998 ffffffff81687383
[  118.118156]  ffff88001203f9d0 ffffffff81085cb0 ffff88001240eec0 ffff8800123a1000
[  118.118160]  ffff88001240e410 ffff880012219300 ffff88001212d800 ffff88001203fa38
[  118.118165] Call Trace:
[  118.118174]  [<ffffffff81687383>] dump_stack+0x19/0x1b
[  118.118180]  [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[  118.118184]  [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
[  118.118188]  [<ffffffff8131aee0>] kobject_put+0x50/0x60
[  118.118239]  [<ffffffffa04d1596>] lprocfs_obd_cleanup+0x56/0x70 [obdclass]
[  118.118252]  [<ffffffffa0f9dcc7>] osc_precleanup+0xe7/0x2c0 [osc]
[  118.118295]  [<ffffffffa04e4f91>] class_cleanup+0x2a1/0xcf0 [obdclass]
[  118.118334]  [<ffffffffa04e79e2>] class_process_config+0x1992/0x23f0 [obdclass]
[  118.118352]  [<ffffffffa0dfe9c5>] ? lov_putref+0x2f5/0xa80 [lov]
[  118.118370]  [<ffffffffa03a2b97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  118.118408]  [<ffffffffa04e8606>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[  118.118421]  [<ffffffffa0dfe9d2>] lov_putref+0x302/0xa80 [lov]
[  118.118434]  [<ffffffffa0e05d92>] lov_disconnect+0x172/0x420 [lov]
[  118.118461]  [<ffffffffa0ecc853>] obd_disconnect+0xb3/0x330 [lustre]
[  118.118483]  [<ffffffffa0ecfc90>] ll_put_super+0x610/0xaa0 [lustre]
[  118.118490]  [<ffffffff81138fcd>] ? call_rcu_sched+0x1d/0x20
[  118.118531]  [<ffffffffa0efa30c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[  118.118538]  [<ffffffff8121a8f8>] ? destroy_inode+0x38/0x60
[  118.118542]  [<ffffffff8121aa26>] ? evict+0x106/0x170
[  118.118546]  [<ffffffff8121aace>] ? dispose_list+0x3e/0x50
[  118.118550]  [<ffffffff8121b724>] ? evict_inodes+0x114/0x140
[  118.118557]  [<ffffffff81200f72>] generic_shutdown_super+0x72/0xf0
[  118.118562]  [<ffffffff81201342>] kill_anon_super+0x12/0x20
[  118.118602]  [<ffffffffa04eaf15>] lustre_kill_super+0x45/0x50 [obdclass]
[  118.118607]  [<ffffffff812016f9>] deactivate_locked_super+0x49/0x60
[  118.118611]  [<ffffffff81201cf6>] deactivate_super+0x46/0x60
[  118.118616]  [<ffffffff8121f145>] mntput_no_expire+0xc5/0x120
[  118.118622]  [<ffffffff81220280>] SyS_umount+0xa0/0x3b0
[  118.118627]  [<ffffffff81697a49>] system_call_fastpath+0x16/0x1b
[  118.118630] --[ end trace 8308964b9c22e228 ]--
[  118.118641] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  118.128816] IP: [<ffffffff81333c5b>] __list_add+0x1b/0xc0
[  118.135215] PGD 0
[  118.137535] Oops: 0000 1 SMP
[  118.141290] Modules linked in: osc(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_mod ppdev sg pcspkr virtio_balloon i2c_piix4 i2c_core parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi virtio_scsi virtio_net virtio_blk ata_piix serio_raw virtio_pci virtio_ring virtio libata floppy
[  118.191875] CPU: 1 PID: 9487 Comm: umount Tainted: G        W  OE  ------------   3.10.0-514.26.2.el7_lustre.2.10.50_69_g8793c5b.x86_64 #1
[  118.201778] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  118.206430] task: ffff88003a2bbec0 ti: ffff88001203c000 task.ti: ffff88001203c000
[  118.212428] RIP: 0010:[<ffffffff81333c5b>]  [<ffffffff81333c5b>] __list_add+0x1b/0xc0
[  118.218697] RSP: 0018:ffff88001203f9d8  EFLAGS: 00010046
[  118.222891] RAX: ffff88001203fa00 RBX: ffff88001203fa18 RCX: ffff88001203ffd8
[  118.228673] RDX: ffff88001240ef10 RSI: 0000000000000000 RDI: ffff88001203fa18
[  118.234425] RBP: ffff88001203f9f0 R08: 0000000000000000 R09: 0000000000000259
[  118.240045] R10: 0000000000000000 R11: ffff88001203f696 R12: ffff88001240ef10
[  118.245753] R13: 0000000000000000 R14: ffff88003a2bbec0 R15: ffff88001212d800
[  118.251453] FS:  00007f592e03f880(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[  118.257952] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  118.262534] CR2: 0000000000000000 CR3: 0000000039b90000 CR4: 00000000000006e0
[  118.268281] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  118.274125] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  118.279871] Stack:
[  118.281497]  ffff88001240ef00 ffff88001240ef08 7fffffffffffffff ffff88001203fa50
[  118.287490]  ffffffff8168cdfb 0000000000000001 ffff88003a2bbec0 ffffffff810c54e0
[  118.293586]  0000000000000000 0000000000000000 00000000ac9255be ffff88001240df40
[  118.299707] Call Trace:
[  118.301684]  [<ffffffff8168cdfb>] wait_for_completion+0xeb/0x170
[  118.306465]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
[  118.311048]  [<ffffffffa04d15a2>] lprocfs_obd_cleanup+0x62/0x70 [obdclass]
[  118.316568]  [<ffffffffa0f9dcc7>] osc_precleanup+0xe7/0x2c0 [osc]
[  118.321477]  [<ffffffffa04e4f91>] class_cleanup+0x2a1/0xcf0 [obdclass]
[  118.326709]  [<ffffffffa04e79e2>] class_process_config+0x1992/0x23f0 [obdclass]
[  118.332633]  [<ffffffffa0dfe9c5>] ? lov_putref+0x2f5/0xa80 [lov]
[  118.337354]  [<ffffffffa03a2b97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  118.342771]  [<ffffffffa04e8606>] class_manual_cleanup+0x1c6/0x710 [obdclass]
[  118.348450]  [<ffffffffa0dfe9d2>] lov_putref+0x302/0xa80 [lov]
[  118.353088]  [<ffffffffa0e05d92>] lov_disconnect+0x172/0x420 [lov]
[  118.357984]  [<ffffffffa0ecc853>] obd_disconnect+0xb3/0x330 [lustre]
[  118.363140]  [<ffffffffa0ecfc90>] ll_put_super+0x610/0xaa0 [lustre]
[  118.368130]  [<ffffffff81138fcd>] ? call_rcu_sched+0x1d/0x20
[  118.372727]  [<ffffffffa0efa30c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[  118.377968]  [<ffffffff8121a8f8>] ? destroy_inode+0x38/0x60
[  118.382433]  [<ffffffff8121aa26>] ? evict+0x106/0x170
[  118.386409]  [<ffffffff8121aace>] ? dispose_list+0x3e/0x50
[  118.390857]  [<ffffffff8121b724>] ? evict_inodes+0x114/0x140
[  118.395288]  [<ffffffff81200f72>] generic_shutdown_super+0x72/0xf0
[  118.400239]  [<ffffffff81201342>] kill_anon_super+0x12/0x20
[  118.404689]  [<ffffffffa04eaf15>] lustre_kill_super+0x45/0x50 [obdclass]
[  118.409950]  [<ffffffff812016f9>] deactivate_locked_super+0x49/0x60
[  118.414997]  [<ffffffff81201cf6>] deactivate_super+0x46/0x60
[  118.419472]  [<ffffffff8121f145>] mntput_no_expire+0xc5/0x120
[  118.424018]  [<ffffffff81220280>] SyS_umount+0xa0/0x3b0
[  118.428227]  [<ffffffff81697a49>] system_call_fastpath+0x16/0x1b
[  118.433054] Code: ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 4c 8b 42 08 48 89 fb 49 39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39 49 89
[  118.450312] RIP  [<ffffffff81333c5b>] __list_add+0x1b/0xc0
[  118.454795]  RSP <ffff88001203f9d8>
[  118.457571] CR2: 0000000000000000

If the client runs on a seperate host, everything will be fine.



 Comments   
Comment by Bruno Faccini (Inactive) [ 24/Jul/17 ]

Well, this should have been introduced by recent landing of "LU-8066 obdclass : Add infrastructure for procfs to sysfs migration" master's patch, commit 4594c6656d3224eb4f8eff100a2320df53c05a8f.

Do you mean this problem is 100% reproducible on a single-node setup ?
If yes, I can try to reproduce, but if not a crash-dump from an occurrence on your side would be helpful.

Comment by Li Xi (Inactive) [ 25/Jul/17 ]

Hi Bruno,

This is 100% reproducable. So, would you please reproduce it? Uploading crash dump costs too much time for me. And I think a reproduce environment is helpful for testing the patch later.

Comment by Bruno Faccini (Inactive) [ 25/Jul/17 ]

Sure, I told you

Comment by Bruno Faccini (Inactive) [ 26/Jul/17 ]

Humm sorry but, trying to mount/umount Client on a single-node setup does not reproduce for me...
And this even after running some activity/file-creations in between, so can you better describe/detail the exact way/config you are using to reproduce 100% on your side ?

To be complete I am using the current master version which has only the following list of patch on top of yours :

6c63418 LU-9500 lnd: Don't Page Align remote_addr with FastReg
c084c62 LU-9749 llite: Reduce overhead for ll_do_fast_read
834e942 LU-7129 tests: fsx with directio
25e1cea LU-9772 utils: Enable new ZFS MMP on mkfs
9761b5c LU-9769 lnet: Fix lost lock
c4ff984 LU-9019 target: migrate to 64 bit time
829a24f LU-8849 ofd: Client hanges on ladvise with large start values
b2c8846 LU-6210 utils: Use C99 struct initializer for long_opt_start
.......................
Comment by John Hammond [ 16/Aug/17 ]

For this to be triggered the osp module must be loaded before the osc module. To see why look at the part of osc_setup() that calls lprocfs_obd_setup().

diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh
index e94f941..6e38ca1 100755
--- a/lustre/tests/test-framework.sh
+++ b/lustre/tests/test-framework.sh
@@ -621,7 +621,6 @@ load_modules_local() {
     load_module fid/fid
     load_module lmv/lmv
     load_module mdc/mdc
-    load_module osc/osc
     load_module lov/lov
     load_module mgc/mgc
     load_module obdecho/obdecho
@@ -656,6 +655,7 @@ load_modules_local() {
                load_module osp/osp
        fi
 
+       load_module osc/osc
        load_module llite/lustre
        [ -d /r ] && OGDB=${OGDB:-"/r/tmp"}
        OGDB=${OGDB:-$TMP}

Then do llmount.sh && umount /mnt/lustre.

Comment by John Hammond [ 16/Aug/17 ]

2.10 is not affected but 2.11 is.

This was introduced by https://review.whamcloud.com/26020 LU-8066 obdclass : Add infrastructure for procfs to sysfs migration.

Comment by Peter Jones [ 16/Aug/17 ]

James

Do you have input on this one?

Peter

Comment by James A Simmons [ 16/Aug/17 ]

I will take a look. I have a idea of what is going on.

Comment by Peter Jones [ 23/Aug/17 ]

John knows how to fix this

Comment by Gerrit Updater [ 23/Aug/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/28668
Subject: LU-9791 osc: always do OBD lprocfs setup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df2a40fb38232fd7f420747ca69ff9845bd05a48

Comment by Gerrit Updater [ 27/Aug/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/28747
Subject: LU-9791 obd: always call lprocfs_obd_setup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e0009f79b0038136fa6c01a850145b4ab13912a6

Comment by Gerrit Updater [ 10/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28747/
Subject: LU-9791 obd: always call lprocfs_obd_setup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9da376b7b3526c1c5bf4c5b26fc6ed692e13f9b7

Comment by Peter Jones [ 10/Sep/17 ]

Landed for 2.11

Comment by Peter Jones [ 10/Sep/17 ]

Does this affect b2_10?

Comment by James A Simmons [ 11/Sep/17 ]

No. The sysfs stuff only has landed for 2.11.

Comment by Peter Jones [ 11/Sep/17 ]

Thanks James

Generated at Sat Feb 10 02:29:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.