Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11769

parallel-scale-nfsv4 crashes on client unmount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.12.1
    • None
    • 3
    • 9223372036854775807

    Description

      After all parallel-scale-nfsv4 tests are complete, we unmount Lustre on all the clients and we get a kernel crash. There are at least three occurrences of this crash starting on 27 October 2018 with Lustre tag 2.11.56.30. There are several other parallel-scale-nsfv4 test sessions where all tests pass, but the test suite times out on umount, these are possibly the same issue, but I can’t confirm.

      Looking at one of the test sessions that crash, https://testing.whamcloud.com/test_sets/9361c4ca-d9d0-11e8-975a-52540065bddc , we can see that the Client 2 (vm2) crashed on umount with the following stack trace. From Client 2’s console log

      [68913.804086] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2885 sec ============================================= 06:57:33 (1540623453)
      [68914.000010] Lustre: DEBUG MARKER: umount -f /mnt/lustre
      [68914.175525] BUG: Dentry ffff9ea53a31c9c0{i=49be60a6,n=17} still in use (-1) [unmount of nfs4 0:44]
      [68914.177383] ------------[ cut here ]------------
      [68914.178139] kernel BUG at fs/dcache.c:970!
      [68914.178804] invalid opcode: 0000 [#1] SMP 
      [68914.179524] Modules linked in: nfsv3 nfs_acl lnet_selftest(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon ppdev pcspkr joydev i2c_piix4 parport_pc parport i2c_core ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel floppy serio_raw virtio_pci virtio_ring virtio 8139cp mii
      [68914.193620] CPU: 1 PID: 15892 Comm: umount.nfs4 Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
      [68914.195439] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [68914.196340] task: ffff9ea568c00000 ti: ffff9ea56dea0000 task.ti: ffff9ea56dea0000
      [68914.197502] RIP: 0010:[<ffffffff9f83808c>]  [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
      [68914.199069] RSP: 0018:ffff9ea56dea3df8  EFLAGS: 00010246
      [68914.199902] RAX: 0000000000000056 RBX: ffff9ea53a31c9c0 RCX: 0000000000000000
      [68914.201028] RDX: 0000000000000000 RSI: ffff9ea57fd13978 RDI: ffff9ea57fd13978
      [68914.202138] RBP: ffff9ea56dea3e18 R08: 0000000000000000 R09: ffff9ea57d148f00
      [68914.203248] R10: 0000000000002da0 R11: ffff9ea56dea3af6 R12: ffff9ea51da8bf00
      [68914.204409] R13: 00000000000003f4 R14: ffffffffa06c69d0 R15: ffff9ea568c007d0
      [68914.205555] FS:  00007fc4b794d880(0000) GS:ffff9ea57fd00000(0000) knlGS:0000000000000000
      [68914.206850] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [68914.207777] CR2: 0000563ba547f7f8 CR3: 0000000040542000 CR4: 00000000000606e0
      [68914.208908] Call Trace:
      [68914.209321]  [<ffffffff9f8399d9>] shrink_dcache_for_umount+0x49/0x60
      [68914.210343]  [<ffffffff9f821c1f>] generic_shutdown_super+0x1f/0x100
      [68914.211336]  [<ffffffff9f822052>] kill_anon_super+0x12/0x20
      [68914.212326]  [<ffffffffc058c07b>] nfs_kill_super+0x1b/0x30 [nfs]
      [68914.213302]  [<ffffffff9f82240e>] deactivate_locked_super+0x4e/0x70
      [68914.214317]  [<ffffffff9f822b96>] deactivate_super+0x46/0x60
      [68914.215201]  [<ffffffff9f840a9f>] cleanup_mnt+0x3f/0x80
      [68914.216055]  [<ffffffff9f840b32>] __cleanup_mnt+0x12/0x20
      [68914.216916]  [<ffffffff9f6bab8b>] task_work_run+0xbb/0xe0
      [68914.217784]  [<ffffffff9f62bc55>] do_notify_resume+0xa5/0xc0
      [68914.218678]  [<ffffffff9fd25ae4>] int_signal+0x12/0x17
      [68914.219511] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 1b 48 8b 50 40 48 89 34 24 48 c7 c7 f8 8f 07 a0 48 89 de 31 c0 e8 4a 53 4d 00 <0f> 0b 31 d2 eb e5 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 
      [68914.224552] RIP  [<ffffffff9f83808c>] shrink_dcache_for_umount_subtree+0x1cc/0x1e0
      [68914.225767]  RSP <ffff9ea56dea3df8>
      

      So far, all test session that crash like this are for DNE with ldiskfs.

      Other test sessions that fail this way are at
      https://testing.whamcloud.com/test_sets/f1035502-f781-11e8-815b-52540065bddc
      https://testing.whamcloud.com/test_sets/93427904-fdec-11e8-b970-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: