[LU-11769] parallel-scale-nfsv4 crashes on client unmount Created: 12/Dec/18 Updated: 29/Jul/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
After all parallel-scale-nfsv4 tests are complete, we unmount Lustre on all the clients and we get a kernel crash. There are at least three occurrences of this crash starting on 27 October 2018 with Lustre tag 2.11.56.30. There are several other parallel-scale-nsfv4 test sessions where all tests pass, but the test suite times out on umount, these are possibly the same issue, but I can’t confirm. Looking at one of the test sessions that crash, https://testing.whamcloud.com/test_sets/9361c4ca-d9d0-11e8-975a-52540065bddc , we can see that the Client 2 (vm2) crashed on umount with the following stack trace. From Client 2’s console log [68913.804086] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2885 sec ============================================= 06:57:33 (1540623453)
[68914.000010] Lustre: DEBUG MARKER: umount -f /mnt/lustre
[68914.175525] BUG: Dentry ffff9ea53a31c9c0{i=49be60a6,n=17} still in use (-1) [unmount of nfs4 0:44]
[68914.177383] ------------[ cut here ]------------
[68914.178139] kernel BUG at fs/dcache.c:970!
[68914.178804] invalid opcode: 0000 [#1] SMP
[68914.179524] Modules linked in: nfsv3 nfs_acl lnet_selftest(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon ppdev pcspkr joydev i2c_piix4 parport_pc parport i2c_core ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel floppy serio_raw virtio_pci virtio_ring virtio 8139cp mii
[68914.193620] CPU: 1 PID: 15892 Comm: umount.nfs4 Kdump: loaded Tainted: G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1
[68914.195439] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[68914.196340] task: ffff9ea568c00000 ti: ffff9ea56dea0000 task.ti: ffff9ea56dea0000
[68914.197502] RIP: 0010:[<ffffffff9f83808c>] [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
[68914.199069] RSP: 0018:ffff9ea56dea3df8 EFLAGS: 00010246
[68914.199902] RAX: 0000000000000056 RBX: ffff9ea53a31c9c0 RCX: 0000000000000000
[68914.201028] RDX: 0000000000000000 RSI: ffff9ea57fd13978 RDI: ffff9ea57fd13978
[68914.202138] RBP: ffff9ea56dea3e18 R08: 0000000000000000 R09: ffff9ea57d148f00
[68914.203248] R10: 0000000000002da0 R11: ffff9ea56dea3af6 R12: ffff9ea51da8bf00
[68914.204409] R13: 00000000000003f4 R14: ffffffffa06c69d0 R15: ffff9ea568c007d0
[68914.205555] FS: 00007fc4b794d880(0000) GS:ffff9ea57fd00000(0000) knlGS:0000000000000000
[68914.206850] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68914.207777] CR2: 0000563ba547f7f8 CR3: 0000000040542000 CR4: 00000000000606e0
[68914.208908] Call Trace:
[68914.209321] [<ffffffff9f8399d9>] shrink_dcache_for_umount+0x49/0x60
[68914.210343] [<ffffffff9f821c1f>] generic_shutdown_super+0x1f/0x100
[68914.211336] [<ffffffff9f822052>] kill_anon_super+0x12/0x20
[68914.212326] [<ffffffffc058c07b>] nfs_kill_super+0x1b/0x30 [nfs]
[68914.213302] [<ffffffff9f82240e>] deactivate_locked_super+0x4e/0x70
[68914.214317] [<ffffffff9f822b96>] deactivate_super+0x46/0x60
[68914.215201] [<ffffffff9f840a9f>] cleanup_mnt+0x3f/0x80
[68914.216055] [<ffffffff9f840b32>] __cleanup_mnt+0x12/0x20
[68914.216916] [<ffffffff9f6bab8b>] task_work_run+0xbb/0xe0
[68914.217784] [<ffffffff9f62bc55>] do_notify_resume+0xa5/0xc0
[68914.218678] [<ffffffff9fd25ae4>] int_signal+0x12/0x17
[68914.219511] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 40 48 89 34 24 48 c7 c7 f8 8f 07 a0 48 89 de 31 c0 e8 4a 53 4d 00 <0f> 0b 31 d2 eb e5 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66
[68914.224552] RIP [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
[68914.225767] RSP <ffff9ea56dea3df8>
So far, all test session that crash like this are for DNE with ldiskfs. Other test sessions that fail this way are at |
| Comments |
| Comment by Oleg Drokin [ 12/Dec/18 ] |
[68914.175525] BUG: Dentry ffff9ea53a31c9c0{i=49be60a6,n=17} still in use (-1) [unmount of nfs4 0:44]
This tells me nfs4 code in rhel had dentry accounting issues. This looks like it has nothing to do with Lustre at all. If we can reproduce this without any Lustre modules loaded - we can even submit a bugreport to RedHat if we are interested. |