Details
-
Bug
-
Resolution: Low Priority
-
Minor
-
None
-
SLES15 SP2 clients
-
3
-
9223372036854775807
Description
parallel-scale-nfsv3 test_racer_on_nfs hangs with ‘general protection fault’ on the client. We’ve only seen this issue once at https://testing.whamcloud.com/test_sets/1ca79db5-dcb8-457d-8d82-540881b78cb7.
Looking at the suite_log, the last information written before the hang for this test is
== parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) CMD: trevis-219vm16,trevis-219vm17 MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
There is no output from test racer_on_nfs in the client consoles and not much in the MDS console
[28519.704689] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) [28599.757782] LustreError: 1448:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x398c:0x0] get parent: rc = -2 [28653.246054] reconnect_path: npd != pd [28724.302719] LustreError: 1451:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x4a5a:0x0] get parent: rc = -2 [28765.943370] reconnect_path: npd != pd
In the client2 (vm17) dmesg, we see
[Sun Oct 3 23:48:29 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) [30115.270757] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 01:48:00 (1657417680) [30115.474474] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [30224.905303] BUG: kernel NULL pointer dereference, address: 0000000000000028 [30224.909544] #PF: supervisor read access in kernel mode [30224.910526] #PF: error_code(0x0000) - not-present page [30224.912000] Oops: 0000 [#1] SMP PTI [30224.912670] CPU: 0 PID: 11734 Comm: dd 5.3.18-59.37-default #1 SLE15-SP3 [30224.914634] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [30224.915728] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3] [30224.931896] Call Trace: [30224.935347] nfs3_proc_setacls+0xa/0x20 [nfsv3] [30224.936212] nfs3_proc_create+0x1bd/0x2b0 [nfsv3] [30224.937111] nfs_create+0x82/0x180 [nfs] [30224.938945] path_openat+0x1212/0x1520 [30224.940487] do_filp_open+0x9b/0x110 [30224.942700] do_sys_open+0x1bd/0x260
In the client2 journal, we see
Oct 03 23:48:37 trevis-219vm17 systemd[1]: Removed slice User Slice of UID 532. Oct 03 23:52:57 trevis-219vm17 kernel: traps: 9[22660] general protection fault ip:7fa32dbdb3cd sp:7ffe2a089838 error:0 in ld-2.26.so[7fa32dbd0000+25000] Oct 03 23:52:57 trevis-219vm17 kernel: traps: 1[22693] general protection fault ip:7f842a0a93cd sp:7ffd3a65a368 error:0 in ld-2.26.so[7f842a09e000+25000] Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22942/UID 0). Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22939/UID 0). Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22964]: Process 22693 (1) of user 0 dumped core. Stack trace of thread 22693: #0 0x00007f842a0a93cd _dl_setup_hash (/lib64/ld-2.26.so) #1 0x00007f842a0a0ddb dl_main (/lib64/ld-2.26.so) #2 0x00007f842a0b7010 _dl_sysdep_start (/lib64/ld-2.26.so) #3 0x00007f842a09fdb8 _dl_start (/lib64/ld-2.26.so) #4 0x00007f842a09eea8 _start (/lib64/ld-2.26.so) Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22967]: Process 22660 (9) of user 0 dumped core. Stack trace of thread 22660: #0 0x00007fa32dbdb3cd _dl_setup_hash (/lib64/ld-2.26.so) #1 0x00007fa32dbd2ddb dl_main (/lib64/ld-2.26.so) #2 0x00007fa32dbe9010 _dl_sysdep_start (/lib64/ld-2.26.so) #3 0x00007fa32dbd1db8 _dl_start (/lib64/ld-2.26.so) #4 0x00007fa32dbd0ea8 _start (/lib64/ld-2.26.so) Oct 03 23:53:34 trevis-219vm17 mrshd[11196]: pam_unix(mrsh:session): session closed for user root Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Session c19961 logged out. Waiting for processes to exit. Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Removed session c19961. Oct 03 23:53:45 trevis-219vm17 systemd[1]: user-runtime-dir@0.service: Unit not needed anymore. Stopping.