Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16163

parallel-scale-nfsv3 test racer_on_nfs hangs with ‘general protection fault’ in nfs3_proc_setacls()

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs hangs with ‘general protection fault’ on the client. We’ve only seen this issue once at https://testing.whamcloud.com/test_sets/1ca79db5-dcb8-457d-8d82-540881b78cb7.

      Looking at the suite_log, the last information written before the hang for this test is

      == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
      CMD: trevis-219vm16,trevis-219vm17 MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      

      There is no output from test racer_on_nfs in the client consoles and not much in the MDS console

      [28519.704689] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
      [28599.757782] LustreError: 1448:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x398c:0x0] get parent: rc = -2
      [28653.246054] reconnect_path: npd != pd
      [28724.302719] LustreError: 1451:0:(llite_nfs.c:343:ll_dir_get_parent_fid()) lustre: failure inode [0x200026562:0x4a5a:0x0] get parent: rc = -2
      [28765.943370] reconnect_path: npd != pd
      

      In the client2 (vm17) dmesg, we see

      [Sun Oct  3 23:48:29 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 23:48:28 (1633304908)
      [30115.270757] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 01:48:00 (1657417680)
      [30115.474474] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [30224.905303] BUG: kernel NULL pointer dereference, address: 0000000000000028
      [30224.909544] #PF: supervisor read access in kernel mode
      [30224.910526] #PF: error_code(0x0000) - not-present page
      [30224.912000] Oops: 0000 [#1] SMP PTI
      [30224.912670] CPU: 0 PID: 11734 Comm: dd  5.3.18-59.37-default #1 SLE15-SP3
      [30224.914634] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [30224.915728] RIP: 0010:__nfs3_proc_setacls+0x28/0x370 [nfsv3]
      [30224.931896] Call Trace:
      [30224.935347]  nfs3_proc_setacls+0xa/0x20 [nfsv3]
      [30224.936212]  nfs3_proc_create+0x1bd/0x2b0 [nfsv3]
      [30224.937111]  nfs_create+0x82/0x180 [nfs]
      [30224.938945]  path_openat+0x1212/0x1520
      [30224.940487]  do_filp_open+0x9b/0x110
      [30224.942700]  do_sys_open+0x1bd/0x260
      

      In the client2 journal, we see

      Oct 03 23:48:37 trevis-219vm17 systemd[1]: Removed slice User Slice of UID 532.
      Oct 03 23:52:57 trevis-219vm17 kernel: traps: 9[22660] general protection fault ip:7fa32dbdb3cd sp:7ffe2a089838 error:0 in ld-2.26.so[7fa32dbd0000+25000]
      Oct 03 23:52:57 trevis-219vm17 kernel: traps: 1[22693] general protection fault ip:7f842a0a93cd sp:7ffd3a65a368 error:0 in ld-2.26.so[7f842a09e000+25000]
      Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22942/UID 0).
      Oct 03 23:52:57 trevis-219vm17 systemd[1]: Started Process Core Dump (PID 22939/UID 0).
      Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22964]: Process 22693 (1) of user 0 dumped core.
                                                              
                                                              Stack trace of thread 22693:
                                                              #0  0x00007f842a0a93cd _dl_setup_hash (/lib64/ld-2.26.so)
                                                              #1  0x00007f842a0a0ddb dl_main (/lib64/ld-2.26.so)
                                                              #2  0x00007f842a0b7010 _dl_sysdep_start (/lib64/ld-2.26.so)
                                                              #3  0x00007f842a09fdb8 _dl_start (/lib64/ld-2.26.so)
                                                              #4  0x00007f842a09eea8 _start (/lib64/ld-2.26.so)
      Oct 03 23:52:58 trevis-219vm17 systemd-coredump[22967]: Process 22660 (9) of user 0 dumped core.
                                                              
                                                              Stack trace of thread 22660:
                                                              #0  0x00007fa32dbdb3cd _dl_setup_hash (/lib64/ld-2.26.so)
                                                              #1  0x00007fa32dbd2ddb dl_main (/lib64/ld-2.26.so)
                                                              #2  0x00007fa32dbe9010 _dl_sysdep_start (/lib64/ld-2.26.so)
                                                              #3  0x00007fa32dbd1db8 _dl_start (/lib64/ld-2.26.so)
                                                              #4  0x00007fa32dbd0ea8 _start (/lib64/ld-2.26.so)
      Oct 03 23:53:34 trevis-219vm17 mrshd[11196]: pam_unix(mrsh:session): session closed for user root
      Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Session c19961 logged out. Waiting for processes to exit.
      Oct 03 23:53:34 trevis-219vm17 systemd-logind[1522]: Removed session c19961.
      Oct 03 23:53:45 trevis-219vm17 systemd[1]: user-runtime-dir@0.service: Unit not needed anymore. Stopping.
      

      Attachments

        Issue Links

          Activity

            People

              Deiter Alex Deiter
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: