[LU-16163] parallel-scale-nfsv3 test racer_on_nfs hangs with ‘general protection fault’ in nfs3_proc_setacls() Created: 05/Oct/21 Updated: 13/Nov/23 Resolved: 27/Apr/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Alex Deiter |
| Resolution: | Low Priority | Votes: | 0 |
| Labels: | always_except | ||
| Environment: |
SLES15 SP2 clients |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
parallel-scale-nfsv3 test_racer_on_nfs hangs with ‘general protection fault’ on the client. We’ve only seen this issue once at https://testing.whamcloud.com/test_sets/1ca79db5-dcb8-457d-8d82-540881b78cb7. Looking at the suite_log, the last information written before the hang for this test is == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) CMD: trevis-219vm16,trevis-219vm17 MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs There is no output from test racer_on_nfs in the client consoles and not much in the MDS console [28519.704689] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) [28599.757782] LustreError: 1448:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x398c:0x0] get parent: rc = -2 [28653.246054] reconnect_path: npd != pd [28724.302719] LustreError: 1451:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x4a5a:0x0] get parent: rc = -2 [28765.943370] reconnect_path: npd != pd In the client2 (vm17) dmesg, we see [Sun Oct 3 23:48:29 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908) [30115.270757] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 01:48:00 (1657417680) [30115.474474] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [30224.905303] BUG: kernel NULL pointer dereference, address: 0000000000000028 [30224.909544] #PF: supervisor read access in kernel mode [30224.910526] #PF: error_code(0x0000) - not-present page [30224.912000] Oops: 0000 [#1] SMP PTI [30224.912670] CPU: 0 PID: 11734 Comm: dd 5.3.18-59.37-default #1 SLE15-SP3 [30224.914634] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [30224.915728] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3] [30224.931896] Call Trace: [30224.935347] nfs3_proc_setacls+0xa/0x20 [nfsv3] [30224.936212] nfs3_proc_create+0x1bd/0x2b0 [nfsv3] [30224.937111] nfs_create+0x82/0x180 [nfs] [30224.938945] path_openat+0x1212/0x1520 [30224.940487] do_filp_open+0x9b/0x110 [30224.942700] do_sys_open+0x1bd/0x260 In the client2 journal, we see Oct 03 23:48:37 trevis-219vm17 systemd[1]: Removed slice User Slice of UID 532.
Oct 03 23:52:57 trevis-219vm17 kernel: traps: 9[22660] general protection fault ip:7fa32dbdb3cd sp:7ffe2a089838 error:0 in ld-2.26.so[7fa32dbd0000+25000]
Oct 03 23:52:57 trevis-219vm17 kernel: traps: 1[22693] general protection fault ip:7f842a0a93cd sp:7ffd3a65a368 error:0 in ld-2.26.so[7f842a09e000+25000]
Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22942/UID 0).
Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22939/UID 0).
Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22964]: Process 22693 (1) of user 0 dumped core.
Stack trace of thread 22693:
#0 0x00007f842a0a93cd _dl_setup_hash (/lib64/ld-2.26.so)
#1 0x00007f842a0a0ddb dl_main (/lib64/ld-2.26.so)
#2 0x00007f842a0b7010 _dl_sysdep_start (/lib64/ld-2.26.so)
#3 0x00007f842a09fdb8 _dl_start (/lib64/ld-2.26.so)
#4 0x00007f842a09eea8 _start (/lib64/ld-2.26.so)
Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22967]: Process 22660 (9) of user 0 dumped core.
Stack trace of thread 22660:
#0 0x00007fa32dbdb3cd _dl_setup_hash (/lib64/ld-2.26.so)
#1 0x00007fa32dbd2ddb dl_main (/lib64/ld-2.26.so)
#2 0x00007fa32dbe9010 _dl_sysdep_start (/lib64/ld-2.26.so)
#3 0x00007fa32dbd1db8 _dl_start (/lib64/ld-2.26.so)
#4 0x00007fa32dbd0ea8 _start (/lib64/ld-2.26.so)
Oct 03 23:53:34 trevis-219vm17 mrshd[11196]: pam_unix(mrsh:session): session closed for user root
Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Session c19961 logged out. Waiting for processes to exit.
Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Removed session c19961.
Oct 03 23:53:45 trevis-219vm17 systemd[1]: user-runtime-dir@0.service: Unit not needed anymore. Stopping.
|
| Comments |
| Comment by Andreas Dilger [ 06/Oct/21 ] |
|
It looks like client1 crashed in NFS: [28679.492102] BUG: kernel NULL pointer dereference, address: 0000000000000028 [28679.496074] CPU: 0 PID: 32107 Comm: file_concat.sh 5.3.18-24.78-default #1 SLE15-SP2 [28679.498997] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3] [28679.514160] Call Trace: [28679.516928] nfs3_proc_setacls+0xa/0x20 [nfsv3] [28679.517615] nfs3_proc_create+0x1be/0x2a0 [nfsv3] [28679.518338] nfs_create+0x83/0x180 [nfs] [28679.519577] path_openat+0x1212/0x1520 [28679.520161] do_filp_open+0x9b/0x110 [28679.521899] do_sys_open+0x1bd/0x260 |
| Comment by Gerrit Updater [ 22/Mar/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50385 |
| Comment by Gerrit Updater [ 04/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50385/ |
| Comment by Gerrit Updater [ 07/Apr/23 ] |
|
"Alex Deiter <alex.deiter@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50579 |
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50579/ |
| Comment by Peter Jones [ 18/Apr/23 ] |
|
Landed for 2.16 |
| Comment by Andreas Dilger [ 27/Apr/23 ] |
|
Reopen temporarily to change resolution. |
| Comment by Andreas Dilger [ 27/Apr/23 ] |
|
Problem is not actually fixed, but we've stopped testing racer_on_nfs for NFSv3. |
| Comment by Gerrit Updater [ 12/Jun/23 ] |
|
"Alex Deiter <alex.deiter@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51282 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51282/ |