[LU-11986] After partial lustre_rmmod, lnet panics on debugfs read Created: 21/Feb/19 Updated: 08/Sep/20 Resolved: 08/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Nathaniel Clark | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
IML 5-devel on VirtualBox |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
While investigating # modprobe osd-zfs # lctl list_nids 192.168.56.20@tcp # lustre_rmmod lnet Wait a minute for (I assume) iml-agent to try to get lnet status and the following panic results: [ 68.383777] BUG: unable to handle kernel paging request at ffffffffc0959ca0 [ 68.388689] IP: [<ffffffffc089b654>] lnet_debugfs_read+0x24/0x40 [libcfs] [ 68.391817] PGD 11c14067 PUD 11c16067 PMD 16472067 PTE 0 [ 68.402787] Oops: 0000 [#1] SMP [ 68.404489] Modules linked in: libcfs(OE) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc zfs(POE) zunicode(POE) zavl(POE) icp(POE) ppdev iosf_mbi crc32_pclmul zcommon(POE) znvpair(POE) ghash_clmulni_intel spl(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg parport_pc parport video i2c_piix4 ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw ahci ata_piix libahci libata e1000 dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet] [ 68.438921] CPU: 1 PID: 4853 Comm: lctl Tainted: P OE ------------ 3.10.0-957.el7_lustre.x86_64 #1 [ 68.444438] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 68.448776] task: ffff893597914100 ti: ffff893596f44000 task.ti: ffff893596f44000 [ 68.452887] RIP: 0010:[<ffffffffc089b654>] [<ffffffffc089b654>] lnet_debugfs_read+0x24/0x40 [libcfs] [ 68.458006] RSP: 0018:ffff893596f47ec8 EFLAGS: 00010246 [ 68.460928] RAX: ffffffffc089b630 RBX: ffff89359ead7900 RCX: ffff893596f47ec8 [ 68.465870] RDX: 0000000000cd50c0 RSI: 0000000000000000 RDI: ffffffffc0959c80 [ 68.475879] RBP: ffff893596f47ed0 R08: ffff893596f47f18 R09: 0000000000000000 [ 68.479744] R10: 00007fff4e9a0f60 R11: 0000000000000246 R12: 0000000000cd50c0 [ 68.483282] R13: ffff893596f47f18 R14: 0000000000001000 R15: 0000000000000000 [ 68.487155] FS: 00007fd435e89740(0000) GS:ffff89359fd00000(0000) knlGS:0000000000000000 [ 68.492060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 68.495707] CR2: ffffffffc0959ca0 CR3: 0000000017940000 CR4: 00000000000606e0 [ 68.499746] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 68.503180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 68.507283] Call Trace: [ 68.508720] [<ffffffff94640f0f>] vfs_read+0x9f/0x170 [ 68.511351] [<ffffffff94641dcf>] SyS_read+0x7f/0xf0 [ 68.513924] [<ffffffff94b74d21>] ? system_call_after_swapgs+0xae/0x146 [ 68.519936] [<ffffffff94b74ddb>] system_call_fastpath+0x22/0x27 [ 68.524797] [<ffffffff94b74d21>] ? system_call_after_swapgs+0xae/0x146 [ 68.529162] Code: 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 55 49 89 c8 48 89 e5 48 83 ec 08 48 8b bf a8 00 00 00 48 89 55 f8 48 8d 4d f8 48 89 f2 31 f6 <48> 8b 47 20 e8 23 b5 ee d3 48 98 48 85 c0 48 0f 44 45 f8 c9 c3 [ 68.545905] RIP [<ffffffffc089b654>] lnet_debugfs_read+0x24/0x40 [libcfs] [ 68.549423] RSP <ffff893596f47ec8> [ 68.551714] CR2: ffffffffc0959ca0 [ 68.553976] ---[ end trace 31c5ad0e3a22fb28 ]--- [ 68.556862] Kernel panic - not syncing: Fatal exception [ 68.559764] Kernel Offset: 0x13400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) |
| Comments |
| Comment by Nathaniel Clark [ 21/Feb/19 ] |
|
Running lustre_rmmod with no arguments does not exhibit this behavior. |
| Comment by Joe Grund [ 20/Mar/19 ] |
|
I can also cause a panic by doing the following: lustre_rmmod lctl network down
|
| Comment by Joe Grund [ 20/Mar/19 ] |
|
Attached a full console dump if it's useful: console-dump.txt
Full trace I'm seeing: [ 257.222895] CPU: 0 PID: 7331 Comm: lctl Tainted: P OE ------------ 3.10.0-957.el7_lustre.x86_64 #1 [ 257.229312] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 257.233659] task: ffff9c9fbaf15140 ti: ffff9c9fbabcc000 task.ti: ffff9c9fbabcc000 [ 257.238388] RIP: 0010:[<ffffffffc0bb95c6>] [<ffffffffc0bb95c6>] cfs_percpt_number+0x6/0x10 [libcfs] [ 257.243851] RSP: 0018:ffff9c9fbabcfdb0 EFLAGS: 00010296 [ 257.246400] RAX: 0000000000000000 RBX: ffff9c9fba2a5200 RCX: 0000000000000000 [ 257.250304] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: 0000000000000000 [ 257.253677] RBP: ffff9c9fbabcfdd0 R08: 000000000001f120 R09: ffff9c9fbe001700 [ 257.257073] R10: ffffffffc0c376db R11: 0000000000000246 R12: 0000000000000000 [ 257.260339] R13: 0000000000000000 R14: 0000000000001000 R15: ffff9c9fba2a5200 [ 257.263204] FS: 00007fbdc89c6740(0000) GS:ffff9c9fbfc00000(0000) knlGS:0000000000000000 [ 257.266409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 257.269105] CR2: fffffffffffffff0 CR3: 0000000022e36000 CR4: 00000000000606f0 [ 257.272529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 257.275209] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 257.277936] Call Trace: [ 257.279245] [<ffffffffc0c0a88b>] ? lnet_counters_get_common+0xeb/0x150 [lnet] [ 257.283071] [<ffffffffc0c0a95c>] lnet_counters_get+0x6c/0x150 [lnet] [ 257.286224] [<ffffffffc0c3771b>] __proc_lnet_stats+0xfb/0x810 [lnet] [ 257.288975] [<ffffffffc0ba6602>] lprocfs_call_handler+0x22/0x50 [libcfs] [ 257.292387] [<ffffffffc0c36bf5>] proc_lnet_stats+0x25/0x30 [lnet] [ 257.295184] [<ffffffffc0ba665d>] lnet_debugfs_read+0x2d/0x40 [libcfs] [ 257.298195] [<ffffffffba440f0f>] vfs_read+0x9f/0x170 [ 257.300211] [<ffffffffba441dcf>] SyS_read+0x7f/0xf0 [ 257.302775] [<ffffffffba974d21>] ? system_call_after_swapgs+0xae/0x146 [ 257.305742] [<ffffffffba974ddb>] system_call_fastpath+0x22/0x27 [ 257.308191] [<ffffffffba974d21>] ? system_call_after_swapgs+0xae/0x146 [ 257.310673] Code: 98 03 00 10 2e bf c0 c7 05 8c 98 03 00 00 00 02 00 e8 ff f8 fe ff 8b 1d 49 7a 01 00 e9 e8 f9 ff ff 0f 1f 40 00 0f 1f 44 00 00 55 <8b> 47 f0 48 89 e5 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 [ 257.322785] RIP [<ffffffffc0bb95c6>] cfs_percpt_number+0x6/0x10 [libcfs] [ 257.325230] RSP <ffff9c9fbabcfdb0> [ 257.326493] CR2: fffffffffffffff0 [ 257.328279] ---[ end trace 6e448abd7d16753c ]--- [ 257.330671] Kernel panic - not syncing: Fatal exception [ 257.333186] Kernel Offset: 0x39200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) |
| Comment by Peter Jones [ 20/Mar/19 ] |
|
Sonia Could you please investigate? Thanks Peter |
| Comment by Sonia Sharma (Inactive) [ 21/Mar/19 ] |
|
I could not replicate this behavior on lustre-master [root@client udsp]# modprobe lustre [root@client udsp]# lctl list_nids 10.0.2.15@tcp [root@client udsp]# lustre_rmmod lnet [root@client udsp]#
[root@oss ~]# modprobe lustre [root@oss ~]# lctl list_nids 10.0.2.15@tcp [root@oss ~]# lustre_rmmod [root@oss ~]# lctl network down opening /dev/lnet failed: No such file or directory hint: the kernel modules may not be loaded LNET unconfigure error 2: No such file or directory [root@oss ~]# Is it something specific that iml code is trying to do which is causing this? |
| Comment by Joe Grund [ 21/Mar/19 ] |
|
You are correct, there was one more step the IML agent was doing. I am able to reproduce this without the IML agent running by doing:
lustre_rmmod lctl network down lctl get_param -n stats |
| Comment by Nathaniel Clark [ 05/Apr/19 ] |
|
Where you able to reproduce with Joe's steps above? We're hitting this in our automated testing for IML pretty regularly. |
| Comment by James A Simmons [ 05/Apr/19 ] |
|
I believe this is due to struct ctl_table disappearing before the debugfs files deregister. I think the solution is to check if table still exist in both lnet_debugfs_read() and lnet_debugfs_write(). |
| Comment by Gerrit Updater [ 09/Apr/19 ] |
|
Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34622 |
| Comment by Gerrit Updater [ 11/Apr/19 ] |
|
Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34634 |
| Comment by Joe Grund [ 14/Apr/19 ] |
|
I may be missing something, but I appear to be getting a panic using the review on the b2_12 branch. repo I am using:
[lustre]
name=Lustre Server
baseurl=https://build.whamcloud.com/jobs-pub/lustre-reviews/configurations/axis-arch/x86_64/axis-build_type/server/axis-distro/el7.6/axis-ib_stack/inkernel/builds/63380/archive/artifacts/
enabled=1
gpgcheck=0
repo_gpgcheck=0
Installed lustre: Name : lustre Arch : x86_64 Version : 2.12.0_79_gb0c272b Release : 1.el7 Size : 766 k Repo : lustre Summary : Lustre File System URL : https://wiki.whamcloud.com/ License : GPL Description : Userspace tools and files for the Lustre file system. panic: [ 140.614378] BUG: unable to handle kernel paging request at fffffffffffffff0 [ 140.616028] IP: [<ffffffffc09825f6>] cfs_percpt_number+0x6/0x10 [libcfs] [ 140.617528] PGD 1a614067 PUD 1a616067 PMD 0 [ 140.618546] Oops: 0000 [#1] SMP [ 140.619337] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc zfs(POE) zunicode(POE) zavl(POE) icp(POE) ppdev zcommon(POE) znvpair(POE) spl(OE) sb_edac iosf_mbi crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg joydev virtio_net cirrus virtio_balloon ttm drm_kms_helper parport_pc parport syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 drm_panel_orientation_quirks ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic virtio_blk ata_generic pata_acpi virtio_scsi crct10dif_pclmul crct10dif_common serio_raw ata_piix libata virtio_pci virtio_ring virtio [ 140.641386] CPU: 0 PID: 4913 Comm: lctl Tainted: P OE ------------ 3.10.0-957.1.3.el7_lustre.x86_64 #1 [ 140.643553] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 140.644779] task: ffff9382edb1c100 ti: ffff9382fb748000 task.ti: ffff9382fb748000 [ 140.646357] RIP: 0010:[<ffffffffc09825f6>] [<ffffffffc09825f6>] cfs_percpt_number+0x6/0x10 [libcfs] [ 140.648304] RSP: 0018:ffff9382fb74bdb0 EFLAGS: 00010296 [ 140.649441] RAX: 0000000000000000 RBX: ffff9382f8903580 RCX: 0000000000000000 [ 140.650952] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: 0000000000000000 [ 140.652454] RBP: ffff9382fb74bdd0 R08: 000000000001f120 R09: ffff9382fd001700 [ 140.653968] R10: ffffffffc0a179cb R11: 0000000000000246 R12: 0000000000000000 [ 140.655469] R13: 0000000000000000 R14: 0000000000001000 R15: ffff9382f8903580 [ 140.656984] FS: 00007fcf084be740(0000) GS:ffff9382ffc00000(0000) knlGS:0000000000000000 [ 140.658685] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 140.659908] CR2: fffffffffffffff0 CR3: 000000006c568000 CR4: 00000000000607f0 [ 140.661412] Call Trace: [ 140.661989] [<ffffffffc09ea96b>] ? lnet_counters_get_common+0xeb/0x150 [lnet] [ 140.663520] [<ffffffffc09eaa3c>] lnet_counters_get+0x6c/0x150 [lnet] [ 140.664907] [<ffffffffc0a17a0b>] __proc_lnet_stats+0xfb/0x810 [lnet] [ 140.666278] [<ffffffffc096f602>] lprocfs_call_handler+0x22/0x50 [libcfs] [ 140.667727] [<ffffffffc0a16ee5>] proc_lnet_stats+0x25/0x30 [lnet] [ 140.669050] [<ffffffffc096f662>] lnet_debugfs_read+0x32/0x50 [libcfs] [ 140.670466] [<ffffffffb904117f>] vfs_read+0x9f/0x170 [ 140.671554] [<ffffffffb904203f>] SyS_read+0x7f/0xf0 [ 140.672655] [<ffffffffb9574d21>] ? system_call_after_swapgs+0xae/0x146 [ 140.674061] [<ffffffffb9574ddb>] system_call_fastpath+0x22/0x27 [ 140.675342] [<ffffffffb9574d21>] ? system_call_after_swapgs+0xae/0x146 [ 140.676747] Code: 85 03 00 50 ab 9b c0 c7 05 9c 85 03 00 00 00 02 00 e8 ff f8 fe ff 8b 1d 59 67 01 00 e9 e8 f9 ff ff 0f 1f 40 00 66 66 66 66 90 55 <8b> 47 f0 48 89 e5 5d c3 66 90 66 66 66 66 90 55 48 89 e5 41 57 [ 140.685523] RIP [<ffffffffc09825f6>] cfs_percpt_number+0x6/0x10 [libcfs] [ 140.688462] RSP <ffff9382fb74bdb0> [ 140.690693] CR2: fffffffffffffff0 [ 140.692874] ---[ end trace 3fe411e27d033edb ]--- [ 140.695279] Kernel panic - not syncing: Fatal exception [ 140.697872] Kernel Offset: 0x37e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) |
| Comment by Sonia Sharma (Inactive) [ 14/Apr/19 ] |
|
I just tried on my local VM (building b2_12 locally) and this is how I get -
[root@oss lustre-release]# lctl list_nids
192.168.57.102@tcp
[root@oss lustre-release]# lnetctl lnet unconfigure
[root@oss lustre-release]# lustre_rmmod
[root@oss lustre-release]# lctl network down
opening /dev/lnet failed: No such file or directory
hint: the kernel modules may not be loaded
LNET unconfigure error 2: No such file or directory
[root@oss lustre-release]# lctl get_param -n stats
error: get_param: param_path 'stats': No such file or directory
[root@oss lustre-release]#
Let me try installing the rpms from the above link and see why it is still giving panic. |
| Comment by Joe Grund [ 15/Apr/19 ] |
|
I tried the above and that seems to work. My teardown steps are now a little different, I am able to get a panic by doing:
[root@vm8 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 192.168.121.128@tcp
status: up
[root@vm8 ~]# lustre_rmmod ptlrpc
[root@vm8 ~]# lnetctl lnet unconfigure
[root@vm8 ~]# lctl get_param -n stats
This is basically what lnet.service does as well: ExecStop=/usr/sbin/lustre_rmmod ptlrpc ExecStop=/usr/sbin/lnetctl lnet unconfigure ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs |
| Comment by Gerrit Updater [ 15/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34634/ |
| Comment by James A Simmons [ 15/Apr/19 ] |
|
Oh my. LNetNIFini() calls lnet_router_debugfs_init() instead of lnet_router_debugfs_fini(). Joe try changing that code and I bet the problem goes away. Other wise what happens is that debugfs still exist but the lnet counters are free which is causing your crash.
|
| Comment by Sonia Sharma (Inactive) [ 15/Apr/19 ] |
|
joe.grund I do see crash when I follow the steps you shared -
[root@vm8 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 192.168.57.102@tcp
status: up
[root@vm8 ~]# lustre_rmmod ptlrpc
[root@vm8 ~]# lnetctl lnet unconfigure
[root@vm8 ~]# lctl get_param -n stats
But if I unload the libcfs module too before running lctl get_param -n stats then I don't see the crash. I understand it should not crash but just to understand the use-case, may I know in what scenario, would you unload ptlrpc module and unconfigure lnet but not unload libcfs?
|
| Comment by Gerrit Updater [ 15/Apr/19 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/34669 |
| Comment by Joe Grund [ 16/Apr/19 ] |
The IML state machine splits LNet teardown into two steps, "stopping" and "unloading". The stopping step looks like this currently: and the unloading step looks like this currently: Stopping and unloading are usually performed together, but there is no hard and fast guarantee that will always be the case (user can stop without unloading) |
| Comment by James A Simmons [ 16/Apr/19 ] |
|
Path 34669 should resolve your problem. The reason is that libcfs registers the root debugfs file for libcfs/LNet. When lnetctl lnet configure is called additional debugfs files are registered. The problem was nothing was cleaned up until libcfs module was unloaded. What we needed to do was just remove the LNet specific debugfs files when lnetctl lnet unconfigure was called. libcfs module unload would do the rest of the cleanup like it does now. |
| Comment by James A Simmons [ 16/Apr/19 ] |
|
I would recommend both patches to land!!! |
| Comment by Gerrit Updater [ 16/Apr/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34692 |
| Comment by Joe Grund [ 17/Apr/19 ] |
|
I've run patch 34669 (review 63547) through IML automated integration testing, it has passed all checks: https://build.whamcloud.com/job/integration-tests-shared-storage-configuration/1571/ |
| Comment by Gerrit Updater [ 21/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34622/ |
| Comment by Peter Jones [ 21/Apr/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 22/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34669/ |
| Comment by Gerrit Updater [ 22/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34692/ |
| Comment by Vitaly Fertman [ 12/Aug/19 ] |
|
afaics, it happened again: |
| Comment by James A Simmons [ 12/Aug/19 ] |
|
So we reduce the chance of this bug |
| Comment by Gerrit Updater [ 27/Mar/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38090 |
| Comment by Gerrit Updater [ 06/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38090/ |
| Comment by Gerrit Updater [ 07/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38529 |
| Comment by Gerrit Updater [ 11/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38529/ |
| Comment by Gerrit Updater [ 25/May/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38716 |
| Comment by Gerrit Updater [ 11/Jul/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38716/ |
| Comment by Gerrit Updater [ 16/Jul/20 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/39404 |
| Comment by Gerrit Updater [ 08/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39404/ |