Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12454

parallel-scale-nfsv3 test racer_on_nfs crashes with ‘BUG: unable to handle kernel paging request’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs crashes the MDS with the following seen on the MDS console

      [49701.504057] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
      [49706.380665] LustreError: 16699:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x316b:0x0] get parent: rc = -2
      [49708.950363] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x31a1:0x0] get parent: rc = -2
      [49708.952257] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
      [49712.266388] LustreError: 16703:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3149:0x0] get parent: rc = -2
      [49712.267842] LustreError: 16703:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 8 previous similar messages
      [49714.267135] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3192:0x0] get parent: rc = -2
      [49714.268519] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 17 previous similar messages
      [49718.497228] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x319c:0x0] get parent: rc = -2
      [49718.498722] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 4 previous similar messages
      [49726.526828] LustreError: 16700:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x331b:0x0] get parent: rc = -2
      [49726.528292] LustreError: 16700:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 21 previous similar messages
      [49742.526786] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x33ae:0x0] get parent: rc = -2
      [49742.528242] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 67 previous similar messages
      [49774.714035] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3621:0x0] get parent: rc = -2
      [49774.715588] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 78 previous similar messages
      [49820.920174] reconnect_path: npd != pd
      [49838.821293] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3add:0x0] get parent: rc = -2
      [49838.822654] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 165 previous similar messages
      [49966.044645] Lustre: 16703:0:(client.c:2213:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1560895807/real 1560895807]  req@ffff89fa8facb600 x1636666139976640/t0(0) o36->lustre-MDT0000-mdc-ffff89fa95f50800@10.9.4.243@tcp:12/10 lens 488/512 e 0 to 1 dl 1560895851 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1
      [49966.047538] Lustre: lustre-MDT0000-mdc-ffff89fa95f50800: Connection to lustre-MDT0000 (at 10.9.4.243@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [49966.049754] Lustre: lustre-MDT0000: Client b75a1e0a-907a-4 (at 10.9.4.243@tcp) reconnecting
      [49966.050727] Lustre: lustre-MDT0000: Connection restored to 10.9.4.243@tcp (at 10.9.4.243@tcp)
      [49972.210770] LustreError: 16702:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x4157:0x0] get parent: rc = -2
      [49972.212693] LustreError: 16702:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 157 previous similar messages
      [49982.423146] BUG: unable to handle kernel paging request at ffffffac2c8840ff
      [49982.423990] IP: [<ffffffac2c8840ff>] 0xffffffac2c8840ff
      [49982.424536] PGD 3d414067 PUD 0 
      [49982.424914] Oops: 0010 [#1] SMP 
      [49982.425289] Modules linked in: nfsd nfs_acl lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel ppdev lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy serio_raw ata_piix 8139too libata virtio_pci virtio_ring virtio 8139cp mii [last unloaded: dm_flakey]
      [49982.435135] CPU: 1 PID: 16705 Comm: nfsd Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.12.2.el7_lustre.x86_64 #1
      [49982.436262] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [49982.436815] task: ffff89fa595230c0 ti: ffff89faa49c0000 task.ti: ffff89faa49c0000
      [49982.437542] RIP: 0010:[<ffffffac2c8840ff>]  [<ffffffac2c8840ff>] 0xffffffac2c8840ff
      [49982.438287] RSP: 0018:ffff89fabfd03eb8  EFLAGS: 00010286
      [49982.438803] RAX: ffffffac2c8840ff RBX: fffffffface78500 RCX: 0000000005bd981f
      [49982.439479] RDX: ffff89fabb588d4f RSI: fffff74b41df7200 RDI: ffff89fabb588d4f
      [49982.440156] RBP: ffff89fabfd03f10 R08: 000000000001f0a0 R09: ffffffffac354e6c
      [49982.440842] R10: ffff89fabfd1f0a0 R11: fffff74b4166ae40 R12: 000000000000000a
      [49982.441520] R13: 0000000000000006 R14: ff89fa756b7f90ff R15: ffff89fabfd162c0
      [49982.442196] FS:  0000000000000000(0000) GS:ffff89fabfd00000(0000) knlGS:0000000000000000
      [49982.442964] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [49982.443523] CR2: ffffffac2c8840ff CR3: 0000000079b70000 CR4: 00000000000606e0
      [49982.444196] Call Trace:
      [49982.444459]  <IRQ> 
      [49982.444679]  [<ffffffffac354e20>] ? rcu_process_callbacks+0x1e0/0x580
      [49982.445340]  [<ffffffffac2a0ff5>] __do_softirq+0xf5/0x280
      [49982.445880]  [<ffffffffac97932c>] call_softirq+0x1c/0x30
      [49982.446399]  [<ffffffffac22e675>] do_softirq+0x65/0xa0
      [49982.446897]  [<ffffffffac2a1375>] irq_exit+0x105/0x110
      [49982.447393]  [<ffffffffac97a6e8>] smp_apic_timer_interrupt+0x48/0x60
      [49982.448009]  [<ffffffffac976df2>] apic_timer_interrupt+0x162/0x170
      [49982.448602]  <EOI> 
      [49982.448873]  [<ffffffffac583551>] ? vsnprintf+0x421/0x6a0
      [49982.449468]  [<ffffffffc05f6915>] libcfs_debug_vmsg2+0x2f5/0xb30 [libcfs]
      [49982.450251]  [<ffffffffc09d8719>] _ldlm_lock_debug+0x3f9/0x750 [ptlrpc]
      [49982.450923]  [<ffffffffc09dcfd0>] ldlm_lock_match_with_skip+0x5d0/0x8d0 [ptlrpc]
      [49982.451648]  [<ffffffffac3024b2>] ? ktime_get+0x52/0xe0
      [49982.452175]  [<ffffffffc09dc0f0>] ? ldlm_lock_decref_internal+0x2f0/0xa30 [ptlrpc]
      [49982.453016]  [<ffffffffc0732173>] ? class_handle2object+0xc3/0x1c0 [obdclass]
      [49982.453735]  [<ffffffffc117886f>] osc_enqueue_base+0x10f/0x6b0 [osc]
      [49982.454377]  [<ffffffffc075e156>] ? cl_lock_slice_add+0x46/0x130 [obdclass]
      [49982.455060]  [<ffffffffc11817a0>] ? osc_lock_lvb_update+0x330/0x330 [osc]
      [49982.455723]  [<ffffffffc1183a2b>] osc_lock_enqueue+0x38b/0x840 [osc]
      [49982.456336]  [<ffffffffc11817a0>] ? osc_lock_lvb_update+0x330/0x330 [osc]
      [49982.457010]  [<ffffffffc075e535>] cl_lock_enqueue+0x65/0x120 [obdclass]
      [49982.457677]  [<ffffffffc1127925>] lov_lock_enqueue+0x95/0x150 [lov]
      [49982.458290]  [<ffffffffc075e535>] cl_lock_enqueue+0x65/0x120 [obdclass]
      [49982.458944]  [<ffffffffc075ebf0>] cl_lock_request+0x190/0x1f0 [obdclass]
      [49982.459661]  [<ffffffffc1298b81>] cl_glimpse_lock+0x311/0x370 [lustre]
      [49982.460299]  [<ffffffffc1298f0d>] cl_glimpse_size0+0x20d/0x240 [lustre]
      [49982.460941]  [<ffffffffc125d392>] ll_getattr_dentry+0x3b2/0x670 [lustre]
      [49982.461590]  [<ffffffffc125d664>] ll_getattr+0x14/0x20 [lustre]
      [49982.462166]  [<ffffffffac447219>] vfs_getattr+0x49/0x80
      [49982.462704]  [<ffffffffc1321ee8>] fill_post_wcc+0x58/0xe0 [nfsd]
      [49982.463284]  [<ffffffffc1318aac>] nfsd_setattr+0x41c/0x4a0 [nfsd]
      [49982.463877]  [<ffffffffc131ce32>] ? lru_put_end+0x52/0x60 [nfsd]
      [49982.464456]  [<ffffffffc131d4dc>] ? nfsd_cache_lookup+0x3ac/0x640 [nfsd]
      [49982.465089]  [<ffffffffc131fcfe>] nfsd3_proc_setattr+0x6e/0xb0 [nfsd]
      [49982.465725]  [<ffffffffc1311810>] nfsd_dispatch+0xe0/0x290 [nfsd]
      [49982.466371]  [<ffffffffc02b1cf3>] svc_process_common+0x493/0x760 [sunrpc]
      [49982.467036]  [<ffffffffc02b20c3>] svc_process+0x103/0x190 [sunrpc]
      [49982.467639]  [<ffffffffc131116f>] nfsd+0xdf/0x150 [nfsd]
      [49982.468155]  [<ffffffffc1311090>] ? nfsd_destroy+0x80/0x80 [nfsd]
      [49982.468752]  [<ffffffffac2c1d21>] kthread+0xd1/0xe0
      [49982.469230]  [<ffffffffac2c1c50>] ? insert_kthread_work+0x40/0x40
      [49982.469830]  [<ffffffffac975c37>] ret_from_fork_nospec_begin+0x21/0x21
      [49982.470453]  [<ffffffffac2c1c50>] ? insert_kthread_work+0x40/0x40
      [49982.471035] Code:  Bad RIP value.
      [49982.471410] RIP  [<ffffffac2c8840ff>] 0xffffffac2c8840ff
      [49982.471949]  RSP <ffff89fabfd03eb8>
      [49982.472289] CR2: ffffffac2c8840ff
      

      The last messages seen on the client1 (vm4) console is

      [109601.055916] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
      [109601.248721] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [109607.600126] 10[29731]: segfault at 8 ip 00007fd80b408718 sp 00007ffc28190300 error 4 in ld-2.17.so[7fd80b3fd000+22000]
      [109887.498220] LNetError: 30272:0:(peer.c:3676:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.243@tcp added to recovery queue. Health = 900
      [109907.498362] LNetError: 30272:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.240@tcp added to recovery queue. Health = 900
      [109947.497986] LNetError: 30272:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.240@tcp added to recovery queue. Health = 900
      

      On the console for client2 (vm5), we see

      [109601.051403] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
      [109601.236777] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [109729.013749] 4[14656]: segfault at 8 ip 00007ff224db1718 sp 00007fff38bc3e90 error 4 in ld-2.17.so[7ff224da6000+22000]
      [109907.509083] LNetError: 19177:0:(peer.c:3676:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.243@tcp added to recovery queue. Health = 900
      [109912.509852] LNetError: 19177:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.241@tcp added to recovery queue. Health = 900
      

      So far, we’ve only seen this crash once at
      https://testing.whamcloud.com/test_sets/3aebe3dc-9223-11e9-abe3-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: