Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11766

parallel-scale-nfsv3 test racer_on_nfs crash MDS

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.12.0
    • ZFS
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs crashes the MDS. Looking at the MDS1, 3 (vm4) console log for https://testing.whamcloud.com/test_sets/2f3b07b8-fd9d-11e8-b837-52540065bddc , we see the stack trace

      [103333.477207] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:10:23 (1544566223)
      [103335.041254] LustreError: 7416:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200022ac9:0x175ef:0x0] get parent: rc = -2
      [103344.816370] LustreError: 7419:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200022ac9:0x1760c:0x0] get parent: rc = -2
      …
      [103545.019234] LustreError: 7421:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200022ac9:0x17ec8:0x0] get parent: rc = -2
      [103545.020769] LustreError: 7421:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 25 previous similar messages
      [103605.968995] BUG: unable to handle kernel paging request at ffffffc0ad8da0ff
      [103605.970132] IP: [<ffffffc0ad8da0ff>] 0xffffffc0ad8da0ff
      [103605.970689] PGD 5a414067 PUD 0 
      [103605.971142] Oops: 0010 [#1] SMP 
      [103605.971605] Modules linked in: nfsd nfs_acl lustre(OE) obdecho(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) ptlrpc_gss(OE) ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod zunicode(POE) zavl(POE) icp(POE) ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel zcommon(POE) znvpair(POE) spl(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr parport_pc parport virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy ata_piix libata 8139too virtio_pci virtio_ring virtio 8139cp mii [last unloaded: lnet_selftest]
      [103605.986800] CPU: 1 PID: 26259 Comm: mdt00_002 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-957.el7_lustre.x86_64 #1
      [103605.987936] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [103605.988497] task: ffffa0575e598000 ti: ffffa0574d64c000 task.ti: ffffa0574d64c000
      [103605.989218] RIP: 0010:[<ffffffc0ad8da0ff>]  [<ffffffc0ad8da0ff>] 0xffffffc0ad8da0ff
      [103605.990063] RSP: 0018:ffffa0577fd03eb8  EFLAGS: 00010286
      [103605.990616] RAX: ffffffc0ad8da0ff RBX: ffffffffb12784c0 RCX: ffffa0577fd162a0
      [103605.991311] RDX: ffffa05719918027 RSI: ffffa0577ad01b28 RDI: ffffa05719918027
      [103605.992010] RBP: ffffa0577fd03f10 R08: ffffa0575d255210 R09: ffffa0577fd162a0
      [103605.992700] R10: ffffffffb12784c0 R11: ffffa0577fd03de8 R12: 000000000000000a
      [103605.993395] R13: 0000000000000000 R14: ffa057140e16a8ff R15: ffffa0577fd162c0
      [103605.994084] FS:  0000000000000000(0000) GS:ffffa0577fd00000(0000) knlGS:0000000000000000
      [103605.994857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [103605.995424] CR2: ffffffc0ad8da0ff CR3: 000000007a856000 CR4: 00000000000606e0
      [103605.996116] Call Trace:
      [103605.996383]  <IRQ> 
      [103605.996618]  [<ffffffffb0754940>] ? rcu_process_callbacks+0x1e0/0x580
      [103605.997304]  [<ffffffffb06a0f05>] __do_softirq+0xf5/0x280
      [103605.997854]  [<ffffffffb0d7832c>] call_softirq+0x1c/0x30
      [103605.998397]  [<ffffffffb062e675>] do_softirq+0x65/0xa0
      [103605.998942]  [<ffffffffb06a1285>] irq_exit+0x105/0x110
      [103605.999453]  [<ffffffffb0d796c8>] smp_apic_timer_interrupt+0x48/0x60
      [103606.000087]  [<ffffffffb0d75df2>] apic_timer_interrupt+0x162/0x170
      [103606.000749]  <EOI> 
      [103606.001073]  [<ffffffffc15eda95>] ? dbuf_find+0x1d5/0x1e0 [zfs]
      [103606.001710]  [<ffffffffb09866cd>] ? memcpy+0xd/0x110
      [103606.002210]  [<ffffffffb0982b84>] ? vsnprintf+0x234/0x6a0
      [103606.002778]  [<ffffffffc0343675>] libcfs_debug_vmsg2+0x2f5/0xb30 [libcfs]
      [103606.003467]  [<ffffffffc15f0111>] ? dbuf_rele_and_unlock+0x371/0x4b0 [zfs]
      [103606.004154]  [<ffffffffb0d66e72>] ? down_read+0x12/0x40
      [103606.004668]  [<ffffffffb0d65e12>] ? mutex_lock+0x12/0x2f
      [103606.005363]  [<ffffffffc0c8b84b>] _ldlm_lock_debug+0x52b/0x750 [ptlrpc]
      [103606.006047]  [<ffffffffc0c8e6c0>] ldlm_lock_addref_internal_nolock+0x80/0x100 [ptlrpc]
      [103606.006853]  [<ffffffffc0caccbc>] ldlm_cli_enqueue_local+0x12c/0x870 [ptlrpc]
      [103606.007578]  [<ffffffffc0caba80>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
      [103606.008396]  [<ffffffffc0f6d7d0>] ? mdt_object_alloc+0x2c0/0x2c0 [mdt]
      [103606.009055]  [<ffffffffc0f7d4ab>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [103606.009728]  [<ffffffffc0f6d7d0>] ? mdt_object_alloc+0x2c0/0x2c0 [mdt]
      [103606.010385]  [<ffffffffc0caba80>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
      [103606.011159]  [<ffffffffc0348a90>] ? cfs_hash_nl_unlock+0x10/0x10 [libcfs]
      [103606.011829]  [<ffffffffc0f7db30>] mdt_object_lock_internal+0x70/0x3e0 [mdt]
      [103606.012526]  [<ffffffffc0f7df57>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [103606.013169]  [<ffffffffc0f7f697>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt]
      [103606.013914]  [<ffffffffc0cdf2f7>] ? lustre_msg_buf+0x17/0x60 [ptlrpc]
      [103606.014594]  [<ffffffffc0d06d9f>] ? __req_capsule_get+0x15f/0x740 [ptlrpc]
      [103606.015295]  [<ffffffffc0cdf5ac>] ? lustre_msg_get_flags+0x2c/0xa0 [ptlrpc]
      [103606.015984]  [<ffffffffc0f86bb5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [103606.016623]  [<ffffffffc0f83a18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [103606.017263]  [<ffffffffc0f86900>] ? mdt_intent_layout+0xcc0/0xcc0 [mdt]
      [103606.017925]  [<ffffffffc0c92ec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [103606.018582]  [<ffffffffc0348fa3>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
      [103606.019292]  [<ffffffffc034c72e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [103606.019934]  [<ffffffffc0cbb8a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [103606.020643]  [<ffffffffc0ce37f0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [103606.021425]  [<ffffffffc0d42302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [103606.022109]  [<ffffffffc0d4935a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [103606.022784]  [<ffffffffc0343f07>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [103606.023486]  [<ffffffffc0ced92b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [103606.024325]  [<ffffffffb06cba9b>] ? __wake_up_common+0x5b/0x90
      [103606.024927]  [<ffffffffc0cf125c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [103606.025569]  [<ffffffffc0cf0760>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [103606.026298]  [<ffffffffb06c1c31>] kthread+0xd1/0xe0
      [103606.026782]  [<ffffffffb06c1b60>] ? insert_kthread_work+0x40/0x40
      [103606.027389]  [<ffffffffb0d74c37>] ret_from_fork_nospec_begin+0x21/0x21
      [103606.028026]  [<ffffffffb06c1b60>] ? insert_kthread_work+0x40/0x40
      [103606.028611] Code:  Bad RIP value.
      [103606.029005] RIP  [<ffffffc0ad8da0ff>] 0xffffffc0ad8da0ff
      [103606.029550]  RSP <ffffa0577fd03eb8>
      [103606.029903] CR2: ffffffc0ad8da0ff
      

      There are several tickets open for racer_on_nfs with similar, but not exactly the same stack traces.

      I can’t find another racer_on_nfs crash like this in the past month.

      Attachments

        Issue Links

          Activity

            [LU-11766] parallel-scale-nfsv3 test racer_on_nfs crash MDS

            This is as much an NFS issue as it might be Lustre, so we do not plan to test or debug NFSv3 racer issues at this point.

            adilger Andreas Dilger added a comment - This is as much an NFS issue as it might be Lustre, so we do not plan to test or debug NFSv3 racer issues at this point.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: