[LU-11812] parallel-scale-nfsv4 test racer_on_nfs crashes with “BUG: unable to handle kernel NULL pointer dereference” Created: 19/Dec/18  Updated: 24/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

2.11.0 servers with 2.12.0 RC2 clients
2.12.53.1 servers with 2.12.1 clients


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv4 test_racer_on_nfs client crashes for 2.11.0 servers and 2.12.0 RC2 clients.

Looking at the logs at https://testing.whamcloud.com/test_sets/d535d716-fd79-11e8-a97c-52540065bddc, from client 1 (vm10) we can see the stack trace

 [47194.303686] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 18:06:33 (1544551593)
[47194.487648] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[47277.681873] 2[25283]: segfault at 8 ip 00007f3fedceb718 sp 00007ffdbca831f0 error 4 in ld-2.17.so[7f3fedce0000+22000]
[47414.009897] 15[3823]: segfault at 0 ip 00000000004043e0 sp 00007ffca5fa64e8 error 6 in 15[400000+6000]
[47419.350008] 14[17578]: segfault at 8 ip 00007ff5d6baa718 sp 00007ffecd59c0f0 error 4 in ld-2.17.so[7ff5d6b9f000+22000]
[47427.166966] 9[4070]: segfault at 8 ip 00007fe542938718 sp 00007ffeb0050000 error 4 in ld-2.17.so[7fe54292d000+22000]
[47450.889695] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[47450.890808] IP: [<ffffffffc073d6a6>] nfs_advise_use_readdirplus+0x6/0x40 [nfs]
[47450.891637] PGD 800000005bd1c067 PUD 7c171067 PMD 0 
[47450.892238] Oops: 0000 [#1] SMP 
[47450.892634] Modules linked in: nfsv3 nfs_acl mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel sunrpc ppdev aesni_intel pcspkr joydev lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel serio_raw floppy ata_piix libata 8139too virtio_pci virtio_ring virtio 8139cp mii [last unloaded: lnet_selftest]
[47450.901754] CPU: 0 PID: 20258 Comm: rm Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
[47450.902787] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[47450.903355] task: ffff8d5a828f0000 ti: ffff8d5a8ca50000 task.ti: ffff8d5a8ca50000
[47450.904097] RIP: 0010:[<ffffffffc073d6a6>]  [<ffffffffc073d6a6>] nfs_advise_use_readdirplus+0x6/0x40 [nfs]
[47450.905052] RSP: 0018:ffff8d5a8ca53df8  EFLAGS: 00010246
[47450.905560] RAX: ffff8d5aeca26000 RBX: ffff8d5acf009640 RCX: ffffff8000000000
[47450.906225] RDX: ffffff8100000000 RSI: ffffff8100000000 RDI: 0000000000000000
[47450.906903] RBP: ffff8d5a8ca53e40 R08: 0000000000000001 R09: 0000000000000000
[47450.907579] R10: 00007ffdb2737ca0 R11: 0000000000000246 R12: ffff8d5aeca26000
[47450.908233] R13: ffff8d5a80c236c0 R14: ffff8d5afa56d6a0 R15: ffff8d5a8ca53ec0
[47450.908910] FS:  00007f35ee41e740(0000) GS:ffff8d5affc00000(0000) knlGS:0000000000000000
[47450.909721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47450.910284] CR2: 0000000000000028 CR3: 000000007b4f6000 CR4: 00000000000606f0
[47450.911063] Call Trace:
[47450.911352]  [<ffffffffc0743d19>] ? nfs_getattr+0xf9/0x250 [nfs]
[47450.911987]  [<ffffffffa2246aa9>] vfs_getattr+0x49/0x80
[47450.912539]  [<ffffffffa2246b25>] vfs_fstat+0x45/0x80
[47450.913081]  [<ffffffffa2247094>] SYSC_newfstat+0x24/0x60
[47450.913687]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.914387]  [<ffffffffa2774d15>] ? system_call_after_swapgs+0xa2/0x146
[47450.915054]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.915724]  [<ffffffffa2774d15>] ? system_call_after_swapgs+0xa2/0x146
[47450.916373]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.917022]  [<ffffffffa2774d15>] ? system_call_after_swapgs+0xa2/0x146
[47450.917664]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.918279]  [<ffffffffa2774d15>] ? system_call_after_swapgs+0xa2/0x146
[47450.918908]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.919551]  [<ffffffffa2774d15>] ? system_call_after_swapgs+0xa2/0x146
[47450.920171]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.920798]  [<ffffffffa224746e>] SyS_newfstat+0xe/0x10
[47450.921293]  [<ffffffffa2774ddb>] system_call_fastpath+0x22/0x27
[47450.921879]  [<ffffffffa2774d21>] ? system_call_after_swapgs+0xae/0x146
[47450.922527] Code: 89 8d 60 ff ff ff e8 81 df 01 e2 8b 8d 60 ff ff ff e9 02 fe ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 <48> 8b 47 28 48 89 e5 48 8b 80 50 03 00 00 f6 80 fc 02 00 00 01 
[47450.925633] RIP  [<ffffffffc073d6a6>] nfs_advise_use_readdirplus+0x6/0x40 [nfs]
[47450.926354]  RSP <ffff8d5a8ca53df8>
[47450.926704] CR2: 0000000000000028

Similar issue at https://testing.whamcloud.com/test_sets/56a97530-beb1-11e8-b143-52540065bddc , with the following in the client 1 (vm5) console

[ 4879.403401] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 21:00:54 (1537650054)
[ 4879.583288] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[ 4951.358555] 13[22550]: segfault at 8 ip 00007f1a4c2b6958 sp 00007fff9954d570 error 4 in ld-2.17.so[7f1a4c2ab000+22000]
[ 4964.483018] 1[23397]: segfault at 8 ip 00007f9b1355a958 sp 00007ffdc00ef9f0 error 4 in ld-2.17.so[7f9b1354f000+22000]
[ 5088.407853] 5[30393]: segfault at 8 ip 00007f310575e958 sp 00007ffdbc27b700 error 4 in ld-2.17.so[7f3105753000+22000]
[ 5103.960350] ------------[ cut here ]------------
[ 5103.960967] kernel BUG at fs/dcache.c:661!
[ 5103.961381] invalid opcode: 0000 [#1] SMP 
[ 5103.961939] Modules linked in: nfsv3 nfs_acl lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc ppdev iosf_mbi i2c_piix4 i2c_core crc32_pclmul pcspkr joydev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc virtio_balloon parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata crct10dif_pclmul crct10dif_common 8139too crc32c_intel serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy
[ 5103.970783] CPU: 1 PID: 2704 Comm: cp Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[ 5103.971789] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5103.972336] task: ffff8c90f9fd8000 ti: ffff8c90b5e9c000 task.ti: ffff8c90b5e9c000
[ 5103.973032] RIP: 0010:[<ffffffffb5033bd2>]  [<ffffffffb5033bd2>] dget_parent+0x72/0x80
[ 5103.973808] RSP: 0018:ffff8c90b5e9fde0  EFLAGS: 00010246
[ 5103.974305] RAX: 0000000000000000 RBX: ffff8c90f859acc0 RCX: 0000000000000000
[ 5103.974968] RDX: 0000000000000000 RSI: 0000000100000000 RDI: ffff8c90f859ad18
[ 5103.975633] RBP: ffff8c90b5e9fdf8 R08: 0000000000000000 R09: 90e45f4528000000
[ 5103.976302] R10: 00007ffeb1207460 R11: 0000000000000246 R12: ffff8c90f873e240
[ 5103.976973] R13: ffff8c90f859ad18 R14: ffff8c90fbdfb9a0 R15: ffff8c90b5e9fec0
[ 5103.977640] FS:  00007fb39631c840(0000) GS:ffff8c90ffd00000(0000) knlGS:0000000000000000
[ 5103.978391] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5103.978926] CR2: 0000000000406538 CR3: 000000007a4b2000 CR4: 00000000000606e0
[ 5103.979600] Call Trace:
[ 5103.979885]  [<ffffffffc07bbcad>] nfs_getattr+0xed/0x250 [nfs]
[ 5103.980453]  [<ffffffffb5020e09>] vfs_getattr+0x49/0x80
[ 5103.980943]  [<ffffffffb5020e85>] vfs_fstat+0x45/0x80
[ 5103.981423]  [<ffffffffb50215a4>] SYSC_newfstat+0x24/0x60
[ 5103.981927]  [<ffffffffb502ccdd>] ? putname+0x3d/0x60
[ 5103.982528]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.983146]  [<ffffffffb55206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 5103.983775]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.984418]  [<ffffffffb55206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 5103.985048]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.985668]  [<ffffffffb55206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 5103.986277]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.986895]  [<ffffffffb55206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 5103.987518]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.988137]  [<ffffffffb55206d5>] ? system_call_after_swapgs+0xa2/0x146
[ 5103.988759]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.989382]  [<ffffffffb502179e>] SyS_newfstat+0xe/0x10
[ 5103.989880]  [<ffffffffb5520795>] system_call_fastpath+0x1c/0x21
[ 5103.990451]  [<ffffffffb55206e1>] ? system_call_after_swapgs+0xae/0x146
[ 5103.991062] Code: 4c 89 ef e8 71 2c 4e 00 49 3b 5c 24 18 75 1e 8b 53 5c 85 d2 74 15 83 c2 01 4c 89 ef 89 53 5c ff 14 25 d0 07 a3 b5 48 89 d8 eb bd <0f> 0b 4c 89 ef ff 14 25 d0 07 a3 b5 eb be 66 66 66 66 90 55 48 
[ 5103.994142] RIP  [<ffffffffb5033bd2>] dget_parent+0x72/0x80
[ 5103.994694]  RSP <ffff8c90b5e9fde0>

Similar crashes at
https://testing.whamcloud.com/test_sets/501f81da-dc27-11e8-b46b-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 09/Apr/19 ]

We are seeing this kernel crash for non-interop testing also; https://testing.whamcloud.com/test_sets/52455738-5af2-11e9-a256-52540065bddc

Generated at Sat Feb 10 02:47:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.