[LU-12454] parallel-scale-nfsv3 test racer_on_nfs crashes with ‘BUG: unable to handle kernel paging request’ Created: 19/Jun/19  Updated: 23/Jun/22  Resolved: 22/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-11063 RHEL7.[345] RCU breakage Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv3 test_racer_on_nfs crashes the MDS with the following seen on the MDS console

[49701.504057] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
[49706.380665] LustreError: 16699:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x316b:0x0] get parent: rc = -2
[49708.950363] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x31a1:0x0] get parent: rc = -2
[49708.952257] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
[49712.266388] LustreError: 16703:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3149:0x0] get parent: rc = -2
[49712.267842] LustreError: 16703:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 8 previous similar messages
[49714.267135] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3192:0x0] get parent: rc = -2
[49714.268519] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 17 previous similar messages
[49718.497228] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x319c:0x0] get parent: rc = -2
[49718.498722] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 4 previous similar messages
[49726.526828] LustreError: 16700:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x331b:0x0] get parent: rc = -2
[49726.528292] LustreError: 16700:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 21 previous similar messages
[49742.526786] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x33ae:0x0] get parent: rc = -2
[49742.528242] LustreError: 16701:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 67 previous similar messages
[49774.714035] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3621:0x0] get parent: rc = -2
[49774.715588] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 78 previous similar messages
[49820.920174] reconnect_path: npd != pd
[49838.821293] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x3add:0x0] get parent: rc = -2
[49838.822654] LustreError: 16704:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 165 previous similar messages
[49966.044645] Lustre: 16703:0:(client.c:2213:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1560895807/real 1560895807]  req@ffff89fa8facb600 x1636666139976640/t0(0) o36->lustre-MDT0000-mdc-ffff89fa95f50800@10.9.4.243@tcp:12/10 lens 488/512 e 0 to 1 dl 1560895851 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1
[49966.047538] Lustre: lustre-MDT0000-mdc-ffff89fa95f50800: Connection to lustre-MDT0000 (at 10.9.4.243@tcp) was lost; in progress operations using this service will wait for recovery to complete
[49966.049754] Lustre: lustre-MDT0000: Client b75a1e0a-907a-4 (at 10.9.4.243@tcp) reconnecting
[49966.050727] Lustre: lustre-MDT0000: Connection restored to 10.9.4.243@tcp (at 10.9.4.243@tcp)
[49972.210770] LustreError: 16702:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) lustre: failure inode [0x200072823:0x4157:0x0] get parent: rc = -2
[49972.212693] LustreError: 16702:0:(llite_nfs.c:348:ll_dir_get_parent_fid()) Skipped 157 previous similar messages
[49982.423146] BUG: unable to handle kernel paging request at ffffffac2c8840ff
[49982.423990] IP: [<ffffffac2c8840ff>] 0xffffffac2c8840ff
[49982.424536] PGD 3d414067 PUD 0 
[49982.424914] Oops: 0010 [#1] SMP 
[49982.425289] Modules linked in: nfsd nfs_acl lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) dm_flakey osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel ppdev lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy serio_raw ata_piix 8139too libata virtio_pci virtio_ring virtio 8139cp mii [last unloaded: dm_flakey]
[49982.435135] CPU: 1 PID: 16705 Comm: nfsd Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.12.2.el7_lustre.x86_64 #1
[49982.436262] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[49982.436815] task: ffff89fa595230c0 ti: ffff89faa49c0000 task.ti: ffff89faa49c0000
[49982.437542] RIP: 0010:[<ffffffac2c8840ff>]  [<ffffffac2c8840ff>] 0xffffffac2c8840ff
[49982.438287] RSP: 0018:ffff89fabfd03eb8  EFLAGS: 00010286
[49982.438803] RAX: ffffffac2c8840ff RBX: fffffffface78500 RCX: 0000000005bd981f
[49982.439479] RDX: ffff89fabb588d4f RSI: fffff74b41df7200 RDI: ffff89fabb588d4f
[49982.440156] RBP: ffff89fabfd03f10 R08: 000000000001f0a0 R09: ffffffffac354e6c
[49982.440842] R10: ffff89fabfd1f0a0 R11: fffff74b4166ae40 R12: 000000000000000a
[49982.441520] R13: 0000000000000006 R14: ff89fa756b7f90ff R15: ffff89fabfd162c0
[49982.442196] FS:  0000000000000000(0000) GS:ffff89fabfd00000(0000) knlGS:0000000000000000
[49982.442964] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[49982.443523] CR2: ffffffac2c8840ff CR3: 0000000079b70000 CR4: 00000000000606e0
[49982.444196] Call Trace:
[49982.444459]  <IRQ> 
[49982.444679]  [<ffffffffac354e20>] ? rcu_process_callbacks+0x1e0/0x580
[49982.445340]  [<ffffffffac2a0ff5>] __do_softirq+0xf5/0x280
[49982.445880]  [<ffffffffac97932c>] call_softirq+0x1c/0x30
[49982.446399]  [<ffffffffac22e675>] do_softirq+0x65/0xa0
[49982.446897]  [<ffffffffac2a1375>] irq_exit+0x105/0x110
[49982.447393]  [<ffffffffac97a6e8>] smp_apic_timer_interrupt+0x48/0x60
[49982.448009]  [<ffffffffac976df2>] apic_timer_interrupt+0x162/0x170
[49982.448602]  <EOI> 
[49982.448873]  [<ffffffffac583551>] ? vsnprintf+0x421/0x6a0
[49982.449468]  [<ffffffffc05f6915>] libcfs_debug_vmsg2+0x2f5/0xb30 [libcfs]
[49982.450251]  [<ffffffffc09d8719>] _ldlm_lock_debug+0x3f9/0x750 [ptlrpc]
[49982.450923]  [<ffffffffc09dcfd0>] ldlm_lock_match_with_skip+0x5d0/0x8d0 [ptlrpc]
[49982.451648]  [<ffffffffac3024b2>] ? ktime_get+0x52/0xe0
[49982.452175]  [<ffffffffc09dc0f0>] ? ldlm_lock_decref_internal+0x2f0/0xa30 [ptlrpc]
[49982.453016]  [<ffffffffc0732173>] ? class_handle2object+0xc3/0x1c0 [obdclass]
[49982.453735]  [<ffffffffc117886f>] osc_enqueue_base+0x10f/0x6b0 [osc]
[49982.454377]  [<ffffffffc075e156>] ? cl_lock_slice_add+0x46/0x130 [obdclass]
[49982.455060]  [<ffffffffc11817a0>] ? osc_lock_lvb_update+0x330/0x330 [osc]
[49982.455723]  [<ffffffffc1183a2b>] osc_lock_enqueue+0x38b/0x840 [osc]
[49982.456336]  [<ffffffffc11817a0>] ? osc_lock_lvb_update+0x330/0x330 [osc]
[49982.457010]  [<ffffffffc075e535>] cl_lock_enqueue+0x65/0x120 [obdclass]
[49982.457677]  [<ffffffffc1127925>] lov_lock_enqueue+0x95/0x150 [lov]
[49982.458290]  [<ffffffffc075e535>] cl_lock_enqueue+0x65/0x120 [obdclass]
[49982.458944]  [<ffffffffc075ebf0>] cl_lock_request+0x190/0x1f0 [obdclass]
[49982.459661]  [<ffffffffc1298b81>] cl_glimpse_lock+0x311/0x370 [lustre]
[49982.460299]  [<ffffffffc1298f0d>] cl_glimpse_size0+0x20d/0x240 [lustre]
[49982.460941]  [<ffffffffc125d392>] ll_getattr_dentry+0x3b2/0x670 [lustre]
[49982.461590]  [<ffffffffc125d664>] ll_getattr+0x14/0x20 [lustre]
[49982.462166]  [<ffffffffac447219>] vfs_getattr+0x49/0x80
[49982.462704]  [<ffffffffc1321ee8>] fill_post_wcc+0x58/0xe0 [nfsd]
[49982.463284]  [<ffffffffc1318aac>] nfsd_setattr+0x41c/0x4a0 [nfsd]
[49982.463877]  [<ffffffffc131ce32>] ? lru_put_end+0x52/0x60 [nfsd]
[49982.464456]  [<ffffffffc131d4dc>] ? nfsd_cache_lookup+0x3ac/0x640 [nfsd]
[49982.465089]  [<ffffffffc131fcfe>] nfsd3_proc_setattr+0x6e/0xb0 [nfsd]
[49982.465725]  [<ffffffffc1311810>] nfsd_dispatch+0xe0/0x290 [nfsd]
[49982.466371]  [<ffffffffc02b1cf3>] svc_process_common+0x493/0x760 [sunrpc]
[49982.467036]  [<ffffffffc02b20c3>] svc_process+0x103/0x190 [sunrpc]
[49982.467639]  [<ffffffffc131116f>] nfsd+0xdf/0x150 [nfsd]
[49982.468155]  [<ffffffffc1311090>] ? nfsd_destroy+0x80/0x80 [nfsd]
[49982.468752]  [<ffffffffac2c1d21>] kthread+0xd1/0xe0
[49982.469230]  [<ffffffffac2c1c50>] ? insert_kthread_work+0x40/0x40
[49982.469830]  [<ffffffffac975c37>] ret_from_fork_nospec_begin+0x21/0x21
[49982.470453]  [<ffffffffac2c1c50>] ? insert_kthread_work+0x40/0x40
[49982.471035] Code:  Bad RIP value.
[49982.471410] RIP  [<ffffffac2c8840ff>] 0xffffffac2c8840ff
[49982.471949]  RSP <ffff89fabfd03eb8>
[49982.472289] CR2: ffffffac2c8840ff

The last messages seen on the client1 (vm4) console is

[109601.055916] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
[109601.248721] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[109607.600126] 10[29731]: segfault at 8 ip 00007fd80b408718 sp 00007ffc28190300 error 4 in ld-2.17.so[7fd80b3fd000+22000]
[109887.498220] LNetError: 30272:0:(peer.c:3676:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.243@tcp added to recovery queue. Health = 900
[109907.498362] LNetError: 30272:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.240@tcp added to recovery queue. Health = 900
[109947.497986] LNetError: 30272:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.240@tcp added to recovery queue. Health = 900

On the console for client2 (vm5), we see

[109601.051403] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 22:06:26 (1560895586)
[109601.236777] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[109729.013749] 4[14656]: segfault at 8 ip 00007ff224db1718 sp 00007fff38bc3e90 error 4 in ld-2.17.so[7ff224da6000+22000]
[109907.509083] LNetError: 19177:0:(peer.c:3676:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.4.243@tcp added to recovery queue. Health = 900
[109912.509852] LNetError: 19177:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.4.241@tcp added to recovery queue. Health = 900

So far, we’ve only seen this crash once at
https://testing.whamcloud.com/test_sets/3aebe3dc-9223-11e9-abe3-52540065bddc



 Comments   
Comment by Oleg Drokin [ 22/Jun/19 ]

really a duplicate of LU-11063

Generated at Sat Feb 10 02:52:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.