[LU-16163] parallel-scale-nfsv3 test racer_on_nfs hangs with ‘general protection fault’ in nfs3_proc_setacls() Created: 05/Oct/21  Updated: 13/Nov/23  Resolved: 27/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alex Deiter
Resolution: Low Priority Votes: 0
Labels: always_except
Environment:

SLES15 SP2 clients


Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv3 test_racer_on_nfs hangs with ‘general protection fault’ on the client. We’ve only seen this issue once at https://testing.whamcloud.com/test_sets/1ca79db5-dcb8-457d-8d82-540881b78cb7.

Looking at the suite_log, the last information written before the hang for this test is

== parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
CMD: trevis-219vm16,trevis-219vm17 MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs

There is no output from test racer_on_nfs in the client consoles and not much in the MDS console

[28519.704689] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
[28599.757782] LustreError: 1448:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x398c:0x0] get parent: rc = -2
[28653.246054] reconnect_path: npd != pd
[28724.302719] LustreError: 1451:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x4a5a:0x0] get parent: rc = -2
[28765.943370] reconnect_path: npd != pd

In the client2 (vm17) dmesg, we see

[Sun Oct  3 23:48:29 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
[30115.270757] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 01:48:00 (1657417680)
[30115.474474] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[30224.905303] BUG: kernel NULL pointer dereference, address: 0000000000000028
[30224.909544] #PF: supervisor read access in kernel mode
[30224.910526] #PF: error_code(0x0000) - not-present page
[30224.912000] Oops: 0000 [#1] SMP PTI
[30224.912670] CPU: 0 PID: 11734 Comm: dd  5.3.18-59.37-default #1 SLE15-SP3
[30224.914634] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[30224.915728] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3]
[30224.931896] Call Trace:
[30224.935347]  nfs3_proc_setacls+0xa/0x20 [nfsv3]
[30224.936212]  nfs3_proc_create+0x1bd/0x2b0 [nfsv3]
[30224.937111]  nfs_create+0x82/0x180 [nfs]
[30224.938945]  path_openat+0x1212/0x1520
[30224.940487]  do_filp_open+0x9b/0x110
[30224.942700]  do_sys_open+0x1bd/0x260

In the client2 journal, we see

Oct 03 23:48:37 trevis-219vm17 systemd[1]: Removed slice User Slice of UID 532.
Oct 03 23:52:57 trevis-219vm17 kernel: traps: 9[22660] general protection fault ip:7fa32dbdb3cd sp:7ffe2a089838 error:0 in ld-2.26.so[7fa32dbd0000+25000]
Oct 03 23:52:57 trevis-219vm17 kernel: traps: 1[22693] general protection fault ip:7f842a0a93cd sp:7ffd3a65a368 error:0 in ld-2.26.so[7f842a09e000+25000]
Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22942/UID 0).
Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22939/UID 0).
Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22964]: Process 22693 (1) of user 0 dumped core.
                                                        
                                                        Stack trace of thread 22693:
                                                        #0  0x00007f842a0a93cd _dl_setup_hash (/lib64/ld-2.26.so)
                                                        #1  0x00007f842a0a0ddb dl_main (/lib64/ld-2.26.so)
                                                        #2  0x00007f842a0b7010 _dl_sysdep_start (/lib64/ld-2.26.so)
                                                        #3  0x00007f842a09fdb8 _dl_start (/lib64/ld-2.26.so)
                                                        #4  0x00007f842a09eea8 _start (/lib64/ld-2.26.so)
Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22967]: Process 22660 (9) of user 0 dumped core.
                                                        
                                                        Stack trace of thread 22660:
                                                        #0  0x00007fa32dbdb3cd _dl_setup_hash (/lib64/ld-2.26.so)
                                                        #1  0x00007fa32dbd2ddb dl_main (/lib64/ld-2.26.so)
                                                        #2  0x00007fa32dbe9010 _dl_sysdep_start (/lib64/ld-2.26.so)
                                                        #3  0x00007fa32dbd1db8 _dl_start (/lib64/ld-2.26.so)
                                                        #4  0x00007fa32dbd0ea8 _start (/lib64/ld-2.26.so)
Oct 03 23:53:34 trevis-219vm17 mrshd[11196]: pam_unix(mrsh:session): session closed for user root
Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Session c19961 logged out. Waiting for processes to exit.
Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Removed session c19961.
Oct 03 23:53:45 trevis-219vm17 systemd[1]: user-runtime-dir@0.service: Unit not needed anymore. Stopping.


 Comments   
Comment by Andreas Dilger [ 06/Oct/21 ]

It looks like client1 crashed in NFS:

[28679.492102] BUG: kernel NULL pointer dereference, address: 0000000000000028
[28679.496074] CPU: 0 PID: 32107 Comm: file_concat.sh  5.3.18-24.78-default #1 SLE15-SP2
[28679.498997] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3]
[28679.514160] Call Trace:
[28679.516928]  nfs3_proc_setacls+0xa/0x20 [nfsv3]
[28679.517615]  nfs3_proc_create+0x1be/0x2a0 [nfsv3]
[28679.518338]  nfs_create+0x83/0x180 [nfs]
[28679.519577]  path_openat+0x1212/0x1520
[28679.520161]  do_filp_open+0x9b/0x110
[28679.521899]  do_sys_open+0x1bd/0x260
Comment by Gerrit Updater [ 22/Mar/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50385
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df15c8baa63d747eeb23da451f7cc50f5db98da7

Comment by Gerrit Updater [ 04/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50385/
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 513eb670b01f15104cbeb2909a141d2174dcc874

Comment by Gerrit Updater [ 07/Apr/23 ]

"Alex Deiter <alex.deiter@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50579
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b92c4189827090a80daefe37f752fdbabd1b939

Comment by Gerrit Updater [ 18/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50579/
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 892d726f274c7cd4e505689ad69194ac68dc323b

Comment by Peter Jones [ 18/Apr/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 27/Apr/23 ]

Reopen temporarily to change resolution.

Comment by Andreas Dilger [ 27/Apr/23 ]

Problem is not actually fixed, but we've stopped testing racer_on_nfs for NFSv3.

Comment by Gerrit Updater [ 12/Jun/23 ]

"Alex Deiter <alex.deiter@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51282
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 6ba201ca53c6aba58c397ba57ad147b4bbc3caec

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51282/
Subject: LU-16163 tests: skip racer_on_nfs for NFSv3
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 3626be5686cc395ce622d281a993603dba16e3e2

Generated at Sat Feb 10 03:24:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.