Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14112

parallel-scale-nfsv3 test racer_on_nfs hangs in locking

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs hangs in locking. Looking at the failure at https://testing.whamcloud.com/test_sets/be1b9c8d-3095-43fa-aaca-e43c7cbb84ce, in both client console logs, we see several dd and cp processes hung and nfs server not responding messages.
      From client1 (vm5) console log,

      [57611.728321] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
      [57611.932785] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [57614.148227] 8[14798]: segfault at 8 ip 00007f76d2c73798 sp 00007fff8d8b79f0 error 4 in ld-2.17.so[7f76d2c68000+22000]
      [57639.708629] 19[3351]: segfault at 8 ip 00007f1847a8b798 sp 00007ffd4c308b80 error 4 in ld-2.17.so[7f1847a80000+22000]
      [57641.608140] 6[7340]: segfault at 0 ip           (null) sp 00007ffc2bb99258 error 14 in 6[400000+6000]
      [57734.427404] 1[458]: segfault at 8 ip 00007f8013852798 sp 00007fff1727cc30 error 4 in ld-2.17.so[7f8013847000+22000]
      [57740.031404] 16[11741]: segfault at 8 ip 00007f3ad2163798 sp 00007ffdce934730 error 4 in ld-2.17.so[7f3ad2158000+22000]
      [58019.646614] nfs: server trevis-57vm8 not responding, still trying
      [58019.647966] nfs: server trevis-57vm8 not responding, still trying
      [58019.649123] nfs: server trevis-57vm8 not responding, still trying
      [58020.038570] nfs: server trevis-57vm8 not responding, still trying
      [58020.467569] nfs: server trevis-57vm8 not responding, still trying
      [58020.911526] nfs: server trevis-57vm8 not responding, still trying
      [58020.912834] nfs: server trevis-57vm8 not responding, still trying
      [58020.916520] nfs: server trevis-57vm8 not responding, still trying
      [58078.437386] INFO: task dd:19266 blocked for more than 120 seconds.
      [58078.438624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [58078.439924] dd              D ffff971669d362a0     0 19266      1 0x00000084
      [58078.441291] Call Trace:
      [58078.441816]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
      [58078.442797]  [<ffffffffa79811f9>] schedule+0x29/0x70
      [58078.443700]  [<ffffffffa797ed01>] schedule_timeout+0x221/0x2d0
      [58078.444914]  [<ffffffffc032f7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc]
      [58078.445993]  [<ffffffffc033e5d0>] ? rpc_put_task+0x10/0x20 [sunrpc]
      [58078.447264]  [<ffffffffa726d35e>] ? kvm_clock_get_cycles+0x1e/0x20
      [58078.448360]  [<ffffffffa7306392>] ? ktime_get_ts64+0x52/0xf0
      [58078.449348]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
      [58078.450237]  [<ffffffffa79808ed>] io_schedule_timeout+0xad/0x130
      [58078.451240]  [<ffffffffa7980988>] io_schedule+0x18/0x20
      [58078.452200]  [<ffffffffa797f351>] bit_wait_io+0x11/0x50
      [58078.453120]  [<ffffffffa797ee77>] __wait_on_bit+0x67/0x90
      [58078.454100]  [<ffffffffa73bcb91>] wait_on_page_bit+0x81/0xa0
      [58078.455117]  [<ffffffffa72c74d0>] ? wake_bit_function+0x40/0x40
      [58078.456155]  [<ffffffffa73bccc1>] __filemap_fdatawait_range+0x111/0x190
      [58078.457363]  [<ffffffffa73bcd54>] filemap_fdatawait_range+0x14/0x30
      [58078.458460]  [<ffffffffa73bf236>] filemap_write_and_wait_range+0x56/0x90
      [58078.459672]  [<ffffffffc05dc746>] nfs_file_fsync+0x86/0x110 [nfs]
      [58078.460758]  [<ffffffffa747fc2e>] vfs_fsync+0x2e/0x40
      [58078.461674]  [<ffffffffc05dcb86>] nfs_file_flush+0x46/0x60 [nfs]
      [58078.462737]  [<ffffffffa7447f87>] filp_close+0x37/0x90
      [58078.463666]  [<ffffffffa746b76c>] __close_fd+0x8c/0xb0
      [58078.464578]  [<ffffffffa7449aa3>] SyS_close+0x23/0x50
      [58078.465473]  [<ffffffffa798dede>] system_call_fastpath+0x25/0x2a
      [58078.466528]  [<ffffffffa798de21>] ? system_call_after_swapgs+0xae/0x146
      [58092.571814] nfs: server trevis-57vm8 not responding, still trying
      [58097.755630] nfs: server trevis-57vm8 not responding, still trying
      [58198.462827] INFO: task dd:19266 blocked for more than 120 seconds.
      [58198.464059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [58198.465353] dd              D ffff971669d362a0     0 19266      1 0x00000084
      

      On the MDS1,3 (vm8) console log, we see

      [57623.525925] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
      [57631.121275] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x313e:0x0] get parent: rc = -2
      [57636.101177] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3267:0x0] get parent: rc = -2
      [57636.103551] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 6 previous similar messages
      [57637.950585] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3288:0x0] get parent: rc = -2
      [57647.690933] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x334a:0x0] get parent: rc = -2
      [57656.183969] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x33ce:0x0] get parent: rc = -2
      [57656.186407] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 25 previous similar messages
      [57671.270437] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3542:0x0] get parent: rc = -2
      [57671.273037] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 12 previous similar messages
      [57692.317861] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3752:0x0] get parent: rc = -2
      [57692.320930] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 46 previous similar messages
      [57725.517726] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3ac4:0x0] get parent: rc = -2
      [57725.521666] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 128 previous similar messages
      [57754.052394] reconnect_path: npd != pd
      [57790.136336] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3fe7:0x0] get parent: rc = -2
      [57790.138707] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 207 previous similar messages
      [57913.388418] LNet: Service thread pid 22912 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [57913.391801] Pid: 22912, comm: mdt00_009 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
      [57913.393552] Call Trace:
      [57913.394117]  [<ffffffffc0c80150>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [57913.395890]  [<ffffffffc0c82181>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      [57913.397261]  [<ffffffffc10b369b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [57913.398722]  [<ffffffffc10b3d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [57913.400030]  [<ffffffffc10b40c7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [57913.401320]  [<ffffffffc10dab68>] mdt_object_open_lock+0x688/0xac0 [mdt]
      [57913.402637]  [<ffffffffc10df3b7>] mdt_reint_open+0x15b7/0x3240 [mdt]
      [57913.404007]  [<ffffffffc10d31b3>] mdt_reint_rec+0x83/0x210 [mdt]
      [57913.405289]  [<ffffffffc10af383>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [57913.406619]  [<ffffffffc10bc622>] mdt_intent_open+0x82/0x3a0 [mdt]
      [57913.407807]  [<ffffffffc10ba745>] mdt_intent_policy+0x435/0xd80 [mdt]
      [57913.409088]  [<ffffffffc0c66e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [57913.410443]  [<ffffffffc0c8eba6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      [57913.411785]  [<ffffffffc0d181a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [57913.413151]  [<ffffffffc0d1efba>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [57913.414553]  [<ffffffffc0cc392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [57913.415991]  [<ffffffffc0cc7294>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [57913.417401]  [<ffffffffaf4c6321>] kthread+0xd1/0xe0
      [57913.418379]  [<ffffffffafb8ed37>] ret_from_fork_nospec_end+0x0/0x39
      [57913.419653]  [<ffffffffffffffff>] 0xffffffffffffffff
      [57913.420913] LustreError: dumping log to /tmp/lustre-log.1604093989.22912
      [57913.858305] Pid: 19858, comm: mdt00_007 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
      [57913.860079] Call Trace:
      

      This failure is for RHEL7.7, DNE ldiskfs testing.

      This hang looks very similar to several other tickets, but for each case, there is a call missing from the logs:

      • looks very similar to LU-11359, but I don’t see “mdt_dom_discard_data” in the MDS console logs/call traces.
      • looks very similar to LU-11358, but I don’t see “mdt_brw_enqueue” in the MDS console logs/call traces.

      Attachments

        Issue Links

          Activity

            [LU-14112] parallel-scale-nfsv3 test racer_on_nfs hangs in locking

            This is as much an NFS issue as it might be Lustre, so we do not plan to test or debug NFSv3 racer issues at this point.

            adilger Andreas Dilger added a comment - This is as much an NFS issue as it might be Lustre, so we do not plan to test or debug NFSv3 racer issues at this point.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: