Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14112

parallel-scale-nfsv3 test racer_on_nfs hangs in locking

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.12.6
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv3 test_racer_on_nfs hangs in locking. Looking at the failure at https://testing.whamcloud.com/test_sets/be1b9c8d-3095-43fa-aaca-e43c7cbb84ce, in both client console logs, we see several dd and cp processes hung and nfs server not responding messages.
      From client1 (vm5) console log,

      [57611.728321] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
      [57611.932785] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [57614.148227] 8[14798]: segfault at 8 ip 00007f76d2c73798 sp 00007fff8d8b79f0 error 4 in ld-2.17.so[7f76d2c68000+22000]
      [57639.708629] 19[3351]: segfault at 8 ip 00007f1847a8b798 sp 00007ffd4c308b80 error 4 in ld-2.17.so[7f1847a80000+22000]
      [57641.608140] 6[7340]: segfault at 0 ip           (null) sp 00007ffc2bb99258 error 14 in 6[400000+6000]
      [57734.427404] 1[458]: segfault at 8 ip 00007f8013852798 sp 00007fff1727cc30 error 4 in ld-2.17.so[7f8013847000+22000]
      [57740.031404] 16[11741]: segfault at 8 ip 00007f3ad2163798 sp 00007ffdce934730 error 4 in ld-2.17.so[7f3ad2158000+22000]
      [58019.646614] nfs: server trevis-57vm8 not responding, still trying
      [58019.647966] nfs: server trevis-57vm8 not responding, still trying
      [58019.649123] nfs: server trevis-57vm8 not responding, still trying
      [58020.038570] nfs: server trevis-57vm8 not responding, still trying
      [58020.467569] nfs: server trevis-57vm8 not responding, still trying
      [58020.911526] nfs: server trevis-57vm8 not responding, still trying
      [58020.912834] nfs: server trevis-57vm8 not responding, still trying
      [58020.916520] nfs: server trevis-57vm8 not responding, still trying
      [58078.437386] INFO: task dd:19266 blocked for more than 120 seconds.
      [58078.438624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [58078.439924] dd              D ffff971669d362a0     0 19266      1 0x00000084
      [58078.441291] Call Trace:
      [58078.441816]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
      [58078.442797]  [<ffffffffa79811f9>] schedule+0x29/0x70
      [58078.443700]  [<ffffffffa797ed01>] schedule_timeout+0x221/0x2d0
      [58078.444914]  [<ffffffffc032f7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc]
      [58078.445993]  [<ffffffffc033e5d0>] ? rpc_put_task+0x10/0x20 [sunrpc]
      [58078.447264]  [<ffffffffa726d35e>] ? kvm_clock_get_cycles+0x1e/0x20
      [58078.448360]  [<ffffffffa7306392>] ? ktime_get_ts64+0x52/0xf0
      [58078.449348]  [<ffffffffa797f340>] ? bit_wait+0x50/0x50
      [58078.450237]  [<ffffffffa79808ed>] io_schedule_timeout+0xad/0x130
      [58078.451240]  [<ffffffffa7980988>] io_schedule+0x18/0x20
      [58078.452200]  [<ffffffffa797f351>] bit_wait_io+0x11/0x50
      [58078.453120]  [<ffffffffa797ee77>] __wait_on_bit+0x67/0x90
      [58078.454100]  [<ffffffffa73bcb91>] wait_on_page_bit+0x81/0xa0
      [58078.455117]  [<ffffffffa72c74d0>] ? wake_bit_function+0x40/0x40
      [58078.456155]  [<ffffffffa73bccc1>] __filemap_fdatawait_range+0x111/0x190
      [58078.457363]  [<ffffffffa73bcd54>] filemap_fdatawait_range+0x14/0x30
      [58078.458460]  [<ffffffffa73bf236>] filemap_write_and_wait_range+0x56/0x90
      [58078.459672]  [<ffffffffc05dc746>] nfs_file_fsync+0x86/0x110 [nfs]
      [58078.460758]  [<ffffffffa747fc2e>] vfs_fsync+0x2e/0x40
      [58078.461674]  [<ffffffffc05dcb86>] nfs_file_flush+0x46/0x60 [nfs]
      [58078.462737]  [<ffffffffa7447f87>] filp_close+0x37/0x90
      [58078.463666]  [<ffffffffa746b76c>] __close_fd+0x8c/0xb0
      [58078.464578]  [<ffffffffa7449aa3>] SyS_close+0x23/0x50
      [58078.465473]  [<ffffffffa798dede>] system_call_fastpath+0x25/0x2a
      [58078.466528]  [<ffffffffa798de21>] ? system_call_after_swapgs+0xae/0x146
      [58092.571814] nfs: server trevis-57vm8 not responding, still trying
      [58097.755630] nfs: server trevis-57vm8 not responding, still trying
      [58198.462827] INFO: task dd:19266 blocked for more than 120 seconds.
      [58198.464059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [58198.465353] dd              D ffff971669d362a0     0 19266      1 0x00000084
      

      On the MDS1,3 (vm8) console log, we see

      [57623.525925] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:34:59 (1604093699)
      [57631.121275] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x313e:0x0] get parent: rc = -2
      [57636.101177] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3267:0x0] get parent: rc = -2
      [57636.103551] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 6 previous similar messages
      [57637.950585] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3288:0x0] get parent: rc = -2
      [57647.690933] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x334a:0x0] get parent: rc = -2
      [57656.183969] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x33ce:0x0] get parent: rc = -2
      [57656.186407] LustreError: 14919:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 25 previous similar messages
      [57671.270437] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3542:0x0] get parent: rc = -2
      [57671.273037] LustreError: 14916:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 12 previous similar messages
      [57692.317861] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3752:0x0] get parent: rc = -2
      [57692.320930] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 46 previous similar messages
      [57725.517726] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3ac4:0x0] get parent: rc = -2
      [57725.521666] LustreError: 14918:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 128 previous similar messages
      [57754.052394] reconnect_path: npd != pd
      [57790.136336] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20004a783:0x3fe7:0x0] get parent: rc = -2
      [57790.138707] LustreError: 14913:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 207 previous similar messages
      [57913.388418] LNet: Service thread pid 22912 was inactive for 62.03s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [57913.391801] Pid: 22912, comm: mdt00_009 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
      [57913.393552] Call Trace:
      [57913.394117]  [<ffffffffc0c80150>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [57913.395890]  [<ffffffffc0c82181>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      [57913.397261]  [<ffffffffc10b369b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      [57913.398722]  [<ffffffffc10b3d20>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [57913.400030]  [<ffffffffc10b40c7>] mdt_object_lock_try+0x27/0xb0 [mdt]
      [57913.401320]  [<ffffffffc10dab68>] mdt_object_open_lock+0x688/0xac0 [mdt]
      [57913.402637]  [<ffffffffc10df3b7>] mdt_reint_open+0x15b7/0x3240 [mdt]
      [57913.404007]  [<ffffffffc10d31b3>] mdt_reint_rec+0x83/0x210 [mdt]
      [57913.405289]  [<ffffffffc10af383>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [57913.406619]  [<ffffffffc10bc622>] mdt_intent_open+0x82/0x3a0 [mdt]
      [57913.407807]  [<ffffffffc10ba745>] mdt_intent_policy+0x435/0xd80 [mdt]
      [57913.409088]  [<ffffffffc0c66e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [57913.410443]  [<ffffffffc0c8eba6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      [57913.411785]  [<ffffffffc0d181a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [57913.413151]  [<ffffffffc0d1efba>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [57913.414553]  [<ffffffffc0cc392b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [57913.415991]  [<ffffffffc0cc7294>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [57913.417401]  [<ffffffffaf4c6321>] kthread+0xd1/0xe0
      [57913.418379]  [<ffffffffafb8ed37>] ret_from_fork_nospec_end+0x0/0x39
      [57913.419653]  [<ffffffffffffffff>] 0xffffffffffffffff
      [57913.420913] LustreError: dumping log to /tmp/lustre-log.1604093989.22912
      [57913.858305] Pid: 19858, comm: mdt00_007 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020
      [57913.860079] Call Trace:
      

      This failure is for RHEL7.7, DNE ldiskfs testing.

      This hang looks very similar to several other tickets, but for each case, there is a call missing from the logs:

      • looks very similar to LU-11359, but I don’t see “mdt_dom_discard_data” in the MDS console logs/call traces.
      • looks very similar to LU-11358, but I don’t see “mdt_brw_enqueue” in the MDS console logs/call traces.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: