[LU-3857] panic in lprocfs_remove_nolock+0x3b/0x100 Created: 29/Aug/13  Updated: 14/Feb/14  Resolved: 02/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Alexey Lyashkov Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

b2_4 branch - 2.6.32-358.2.1 kernel


Issue Links:
Duplicate
is duplicated by LU-4385 replay-single test 61d causes oops in... Resolved
Severity: 3
Rank (Obsolete): 10008

 Description   

while testing problems with mount lustre in 2.4 i found a panic with shutdown

[root@rhel6-64 utils]# ../utils/mkfs.lustre --reformat --mgs --mdt --fsname=wwtest --mgsnode=rhel6-64.shadowland@tcp --index=0 --quiet --backfstype=ldiskfs --param sys.timeout=300 --param lov.stripesize=1048576 --param lov.stripecount=1 --device-size=0 --verbose /dev/sdb1 

   Permanent disk data:
Target:     wwtest:MDT0000
Index:      0
Lustre FS:  wwtest
Mount type: ldiskfs
Flags:      0x65
              (MDT MGS first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.69.5@tcp sys.timeout=300 lov.stripesize=1048576 lov.stripecount=1

device size = 2048MB
formatting backing filesystem ldiskfs on /dev/sdb1
	target name  wwtest:MDT0000
	4k blocks     524540
	options        -J size=80 -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L wwtest:MDT0000  -J size=80 -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/sdb1 524540
Writing CONFIGS/mountdata
[root@rhel6-64 utils]# ../utils/mkfs.lustre --reformat --mgs --mdt --fsname=wwtest --mgsnode=rhel6-64.shadowland@tcp --index=0 --quiet --backfstype=ldiskfs --param sys.timeout=300 --param lov.stripesize=1048576 --param lov.stripecount=1 --device-size=0 --verbose /dev/sdb1 ^C
[root@rhel6-64 utils]# mount -t lustre -o nosvc,abort_recov,svname=rhel6-64.shadowland /dev/sdb1 /mnt/mdt
[root@rhel6-64 utils]# umount -f /mnt/mdt
Timeout, server not responding.

console backtrace

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/host2/target2:0:1/2:0:1:0/block/sdb/queue/max_sectors_kb
CPU 7 
Modules linked in: ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mgs lquota jbd mgc fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_generic sha256_generic crc32c_inte
l libcfs nfs lockd auth_rpcgss nfs_acl sunrpc cachefiles fscache(T) ib_ipoib ib_cm ipv6 ib_uverbs ib_umad mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core dm_mirror dm_region_hash dm_log dm_mo
d ppdev vmware_balloon parport_pc parport vmxnet3 i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_generic ata_piix [last unloaded: lmv]

Pid: 10144, comm: umount Tainted: G           ---------------  T 2.6.32-358.2.1.el6 #0 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa056aacb>]  [<ffffffffa056aacb>] lprocfs_remove_nolock+0x3b/0x100 [obdclass]
RSP: 0018:ffff880101e21a58  EFLAGS: 00010202
RAX: ffff88011954c100 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffa056acdd RDI: ffff880138742580
RBP: ffff880101e21a88 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0a6e776f64747568 R12: 6b6b6b6b6b6b6b6b
R13: ffff8801387426d8 R14: 6b6b6b6b6b6b6b6b R15: 0000000000000000
FS:  00007f77cb076740(0000) GS:ffff88002cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f77ca7961b0 CR3: 0000000102102000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 10144, threadinfo ffff880101e20000, task ffff88011954c100)
Stack:
 ffffffffa056acdd ffff8801387426d8 ffff880138742580 ffff880101e21b28
<d> ffff8801387426d8 ffff880101e21b28 ffff880101e21aa8 ffffffffa056ace5
<d> ffff8801387424f0 ffff8801387424f0 ffff880101e21ad8 ffffffffa0a18cb0
Call Trace:
 [<ffffffffa056acdd>] ? lprocfs_remove+0x1d/0x40 [obdclass]
 [<ffffffffa056ace5>] lprocfs_remove+0x25/0x40 [obdclass]
 [<ffffffffa0a18cb0>] qsd_fini+0x80/0x460 [lquota]
 [<ffffffffa0bbd708>] osd_shutdown+0x38/0xe0 [osd_ldiskfs]
 [<ffffffffa0bc52b9>] osd_device_fini+0x129/0x190 [osd_ldiskfs]
 [<ffffffffa058b437>] class_cleanup+0x577/0xda0 [obdclass]
 [<ffffffffa0560f6c>] ? class_name2dev+0x7c/0xe0 [obdclass]
 [<ffffffffa058cd1c>] class_process_config+0x10bc/0x1c80 [obdclass]
 [<ffffffffa03f5e58>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0586741>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
 [<ffffffffa058da59>] class_manual_cleanup+0x179/0x6e0 [obdclass]
 [<ffffffffa03f5e58>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0bc6404>] osd_obd_disconnect+0x174/0x1e0 [osd_ldiskfs]
 [<ffffffffa058fa8e>] lustre_put_lsi+0x17e/0xe20 [obdclass]
 [<ffffffffa05982a8>] lustre_common_put_super+0x5d8/0xc20 [obdclass]
 [<ffffffffa05c12ba>] server_put_super+0x1ca/0xe60 [obdclass]
 [<ffffffff811b136a>] ? invalidate_inodes+0xfa/0x180
 [<ffffffff81196afb>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff81196be6>] kill_anon_super+0x16/0x60
 [<ffffffff8119737f>] ? deactivate_super+0x4f/0x80
 [<ffffffffa058f8b6>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff81197387>] deactivate_super+0x57/0x80
 [<ffffffff811b5acf>] mntput_no_expire+0xbf/0x110
 [<ffffffff811b654b>] sys_umount+0x7b/0x3a0
 [<ffffffff81531b48>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: 1f 44 00 00 48 8b 1f 48 85 db 74 51 48 c7 07 00 00 00 00 4c 8b 73 48 4d 85 f6 75 0f e9 9a 00 00 00 0f 1f 80 00 00 00 00 4c 89 e3 <4c> 8b 63 50 4d 85 e4 75 f4 4c 8b 6b 08 4c 8b 63 48 4c 89 ef e8 
RIP  [<ffffffffa056aacb>] lprocfs_remove_nolock+0x3b/0x100 [obdclass]


 Comments   
Comment by Alexey Lyashkov [ 02/Oct/13 ]

Hit again today.

ustre: 3905:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff88020b8d9148 x1447772636058308/t0(0) o101->MGC192.168.69.5@t
cp@0@lo:26/25 lens 328/344 e 0 to 1 dl 0 ref 2 fl Rpc:EX/0/ffffffff rc -108/-1
LustreError: 6244:0:(ldlm_resource.c:809:ldlm_resource_complain()) MGC192.168.69.5@tcp: namespace resource [0x65727473756c:0x0:0x0].0 (ffff88023401ab00) refcount nonzero (1) after lock cleanup; 
forcing cleanup.
LustreError: 6244:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x65727473756c:0x0:0x0].0 (ffff88023401ab00) refcount = 2
Lustre: 3905:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 1 previous similar message
LustreError: 15c-8: MGC192.168.69.5@tcp: The configuration from log 'lustre-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuratio
n, or other errors. See the syslog for more information.
Lustre: MGS: Client f0a9a982-4623-afea-e5fb-240161f63073 (at 0@lo) reconnecting
Lustre: MGS: Client f0a9a982-4623-afea-e5fb-240161f63073 (at 0@lo) refused reconnection, still busy with 1 active RPCs
LustreError: 6244:0:(obd_mount_server.c:1257:server_start_targets()) failed to start server lustre-MDT0000: -5
LustreError: 6244:0:(obd_mount_server.c:1732:server_fill_super()) Unable to start targets: -5
LustreError: 6244:0:(obd_mount_server.c:848:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
LustreError: 6244:0:(obd_mount_server.c:1426:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
LustreError: 6244:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-MDT0000
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/possible
CPU 0 
Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs exportfs mdd mgs lquota lfsck jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet sha512_g
eneric sha256_generic crc32c_intel libcfs nfs lockd auth_rpcgss nfs_acl sunrpc cachefiles fscache(T) ib_ipoib ib_cm ipv6 ib_uverbs ib_umad mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core dm_mirro
r dm_region_hash dm_log dm_mod ppdev parport_pc parport vmware_balloon microcode vmxnet3 i2c_piix4 i2c_core sg shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi pata_acpi ata_ge
neric ata_piix [last unloaded: scsi_wait_scan]

Pid: 6244, comm: mount.lustre Tainted: G           ---------------  T 2.6.32-358.18.1-lustre #0 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa0562013>]  [<ffffffffa0562013>] lprocfs_remove_nolock+0x43/0x110 [obdclass]

Comment by Di Wang [ 15/Nov/13 ]

hmm, I also hit this problem in my local test on master branch. (conf-sanity 17 with 4 MDTs and 2 MDSes )

Lustre: DEBUG MARKER: == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) == 19:57:05 (1384487825)
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
LDISKFS-fs (sdb2): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 172.16.151.128@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
LustreError: Skipped 52 previous similar messages
LustreError: 11-0: lustre-MDT0001-osp-MDT0002: Communicating with 172.16.151.128@tcp, operation obd_ping failed with -107.
LustreError: Skipped 97 previous similar messages
LustreError: 166-1: MGC172.16.151.30@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 15c-8: MGC172.16.151.30@tcp: The configuration from log 'lustre-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 10584:0:(obd_mount_server.c:1256:server_start_targets()) failed to start server lustre-MDT0000: -2
LustreError: 10584:0:(obd_mount_server.c:1731:server_fill_super()) Unable to start targets: -2
LustreError: 10584:0:(obd_mount_server.c:847:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
LustreError: 10584:0:(obd_mount_server.c:1425:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
LustreError: 10584:0:(obd_mount_server.c:1455:server_put_super()) no obd lustre-MDT0000
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:10.0/host0/target0:0:1/0:0:1:0/block/sdb/queue/scheduler
CPU 1 
Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) mdd(U) mgs(U) osd_ldiskfs(U) ldiskfs(U) lquota(U) lfsck(U) jbd obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate dm_mirror dm_region_hash dm_log ppdev snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 parport_pc parport vmware_balloon sg i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi dm_mod [last unloaded: speedstep_lib]

Pid: 10584, comm: mount.lustre Tainted: P           ---------------    2.6.32 #3 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa081ca23>]  [<ffffffffa081ca23>] lprocfs_remove_nolock+0x43/0x110 [obdclass]
RSP: 0018:ffff88006e0578b8  EFLAGS: 00010202
RAX: ffffffffa089b020 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800658251a8
RBP: ffff88006e0578e8 R08: 616974696e69203a R09: 44535120676e6974
R10: 732044535120676e R11: 0a6e776f64747568 R12: 6b6b6b6b6b6b6b6b
R13: ffff8800658252b0 R14: 6b6b6b6b6b6b6b6b R15: ffff8800771f6de0
FS:  00007f9a2e488700(0000) GS:ffff88000a240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc3ae5b9000 CR3: 000000006f953000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount.lustre (pid: 10584, threadinfo ffff88006e056000, task ffff880036d9c380)
Stack:
 ffff880000000010 ffff88006e057918 ffff8800658251a8 ffff88006e057988
<d> ffff8800658252b0 ffff88006e057988 ffff88006e057908 ffffffffa081cc45
<d> ffff88006e057908 ffff880065825118 ffff88006e057938 ffffffffa0b23dd0
Call Trace:
 [<ffffffffa081cc45>] lprocfs_remove+0x25/0x40 [obdclass]
 [<ffffffffa0b23dd0>] qsd_fini+0x80/0x450 [lquota]
 [<ffffffffa0d2ec78>] osd_shutdown+0x38/0xe0 [osd_ldiskfs]
 [<ffffffffa0d36bf9>] osd_device_fini+0x129/0x190 [osd_ldiskfs]
 [<ffffffffa0834913>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa081233c>] ? class_name2dev+0x7c/0xe0 [obdclass]
 [<ffffffffa083663a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa06be9b8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa082f202>] ? lustre_cfg_new+0x312/0x6e0 [obdclass]
 [<ffffffffa0836d19>] class_manual_cleanup+0x179/0x6e0 [obdclass]
 [<ffffffffa06be9b8>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0d378b4>] osd_obd_disconnect+0x174/0x1e0 [osd_ldiskfs]
 [<ffffffffa083926b>] lustre_put_lsi+0x1ab/0xeb0 [obdclass]
 [<ffffffffa08414d8>] lustre_common_put_super+0x5c8/0xbe0 [obdclass]
 [<ffffffffa087081d>] server_put_super+0x1bd/0xed0 [obdclass]
 [<ffffffffa0871bbb>] server_fill_super+0x68b/0x1630 [obdclass]
 [<ffffffffa0840bb8>] lustre_fill_super+0x1d8/0x530 [obdclass]
 [<ffffffffa08409e0>] ? lustre_fill_super+0x0/0x530 [obdclass]
 [<ffffffff8117df8f>] get_sb_nodev+0x5f/0xa0
 [<ffffffffa0838bc5>] lustre_get_sb+0x25/0x30 [obdclass]
 [<ffffffff8117dbeb>] vfs_kern_mount+0x7b/0x1b0
 [<ffffffff8117dd92>] do_kern_mount+0x52/0x130
 [<ffffffff8119c642>] do_mount+0x2d2/0x8c0
 [<ffffffff8119ccc0>] sys_mount+0x90/0xe0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: c7 07 00 00 00 00 48 85 db 74 52 48 81 fb 00 f0 ff ff 77 49 4c 8b 73 48 4d 85 f6 75 0e e9 99 00 00 00 66 0f 1f 44 00 00 4c 89 e3 <4c> 8b 63 50 4d 85 e4 75 f4 4c 8b 6b 08 4c 8b 63 48 4c 89 ef e8 
RIP  [<ffffffffa081ca23>] lprocfs_remove_nolock+0x43/0x110 [obdclass]
 RSP <ffff88006e0578b8>
---[ end trace d6b25d1e4af65ad1 ]---
Comment by Di Wang [ 27/Nov/13 ]

It seems in osd_device_fini, osd_procfs_fini is being called first to destroy all of proc entry for OSD, but later, osd_shutdown/qsd_fini does it again, which caused the LBUG.

Comment by Di Wang [ 09/Dec/13 ]

http://review.whamcloud.com/8506

Comment by Peter Jones [ 02/Jan/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:37:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.