Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
After all parallel-scale-nfsv4 tests are complete, we unmount Lustre on all the clients and we get a kernel crash. There are at least three occurrences of this crash starting on 27 October 2018 with Lustre tag 2.11.56.30. There are several other parallel-scale-nsfv4 test sessions where all tests pass, but the test suite times out on umount, these are possibly the same issue, but I can’t confirm.
Looking at one of the test sessions that crash, https://testing.whamcloud.com/test_sets/9361c4ca-d9d0-11e8-975a-52540065bddc , we can see that the Client 2 (vm2) crashed on umount with the following stack trace. From Client 2’s console log
[68913.804086] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2885 sec ============================================= 06:57:33 (1540623453)
[68914.000010] Lustre: DEBUG MARKER: umount -f /mnt/lustre
[68914.175525] BUG: Dentry ffff9ea53a31c9c0{i=49be60a6,n=17} still in use (-1) [unmount of nfs4 0:44]
[68914.177383] ------------[ cut here ]------------
[68914.178139] kernel BUG at fs/dcache.c:970!
[68914.178804] invalid opcode: 0000 [#1] SMP
[68914.179524] Modules linked in: nfsv3 nfs_acl lnet_selftest(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon ppdev pcspkr joydev i2c_piix4 parport_pc parport i2c_core ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel floppy serio_raw virtio_pci virtio_ring virtio 8139cp mii
[68914.193620] CPU: 1 PID: 15892 Comm: umount.nfs4 Kdump: loaded Tainted: G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1
[68914.195439] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[68914.196340] task: ffff9ea568c00000 ti: ffff9ea56dea0000 task.ti: ffff9ea56dea0000
[68914.197502] RIP: 0010:[<ffffffff9f83808c>] [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
[68914.199069] RSP: 0018:ffff9ea56dea3df8 EFLAGS: 00010246
[68914.199902] RAX: 0000000000000056 RBX: ffff9ea53a31c9c0 RCX: 0000000000000000
[68914.201028] RDX: 0000000000000000 RSI: ffff9ea57fd13978 RDI: ffff9ea57fd13978
[68914.202138] RBP: ffff9ea56dea3e18 R08: 0000000000000000 R09: ffff9ea57d148f00
[68914.203248] R10: 0000000000002da0 R11: ffff9ea56dea3af6 R12: ffff9ea51da8bf00
[68914.204409] R13: 00000000000003f4 R14: ffffffffa06c69d0 R15: ffff9ea568c007d0
[68914.205555] FS: 00007fc4b794d880(0000) GS:ffff9ea57fd00000(0000) knlGS:0000000000000000
[68914.206850] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68914.207777] CR2: 0000563ba547f7f8 CR3: 0000000040542000 CR4: 00000000000606e0
[68914.208908] Call Trace:
[68914.209321] [<ffffffff9f8399d9>] shrink_dcache_for_umount+0x49/0x60
[68914.210343] [<ffffffff9f821c1f>] generic_shutdown_super+0x1f/0x100
[68914.211336] [<ffffffff9f822052>] kill_anon_super+0x12/0x20
[68914.212326] [<ffffffffc058c07b>] nfs_kill_super+0x1b/0x30 [nfs]
[68914.213302] [<ffffffff9f82240e>] deactivate_locked_super+0x4e/0x70
[68914.214317] [<ffffffff9f822b96>] deactivate_super+0x46/0x60
[68914.215201] [<ffffffff9f840a9f>] cleanup_mnt+0x3f/0x80
[68914.216055] [<ffffffff9f840b32>] __cleanup_mnt+0x12/0x20
[68914.216916] [<ffffffff9f6bab8b>] task_work_run+0xbb/0xe0
[68914.217784] [<ffffffff9f62bc55>] do_notify_resume+0xa5/0xc0
[68914.218678] [<ffffffff9fd25ae4>] int_signal+0x12/0x17
[68914.219511] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 40 48 89 34 24 48 c7 c7 f8 8f 07 a0 48 89 de 31 c0 e8 4a 53 4d 00 <0f> 0b 31 d2 eb e5 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66
[68914.224552] RIP [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
[68914.225767] RSP <ffff9ea56dea3df8>
So far, all test session that crash like this are for DNE with ldiskfs.
Other test sessions that fail this way are at
https://testing.whamcloud.com/test_sets/f1035502-f781-11e8-815b-52540065bddc
https://testing.whamcloud.com/test_sets/93427904-fdec-11e8-b970-52540065bddc