Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12654

parallel-scale-nfsv4 hangs in test racer_on_nfs

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7, Lustre 2.12.8
    • RHEL 8.0 clients
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv4 test_racer_on_nfs hangs for RHEL 8.0 clients.

      Looking at a recent b2_12 hang at https://testing.whamcloud.com/test_sets/088a338c-b5bc-11e9-b88c-52540065bddc
      , we see the following on the console log for client 1 (vm5)

      [128401.640876] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128401.999359] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [128615.457711] 1[7004]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe851dc380 error 14
      [128615.459458] Code: Bad RIP value.
      [128615.505546] systemd-coredump[7103]: Not enough arguments passed by the kernel (0, expected 7).
      

      On the console log for client 2 (vm6), we see

      [128401.602782] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128401.976730] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [128455.523450] 3[1583]: segfault at 8 ip 00007f97d78d50d3 sp 00007fffe77a2c80 error 4 in ld-2.28.so[7f97d78c9000+28000]
      [128455.527725] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
      [128455.806473] FS-Cache: Duplicate cookie detected
      [128455.808605] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
      [128455.811322] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
      [128455.813515] FS-Cache: O-key=[10] '0400020008010a090002'
      [128455.815189] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
      [128455.817621] FS-Cache: N-cookie d=00000000bcc65360 n=0000000005cc644a
      [128455.819573] FS-Cache: N-key=[10] '0400020008010a090002'
      [128455.859830] FS-Cache: Duplicate cookie detected
      [128455.861408] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
      [128455.863994] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
      [128455.866117] FS-Cache: O-key=[10] '0400020008010a090002'
      [128455.867879] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
      [128455.870282] FS-Cache: N-cookie d=00000000bcc65360 n=000000003e9860aa
      [128455.872260] FS-Cache: N-key=[10] '0400020008010a090002'
      [128601.980237] 18[24202]: segfault at 8 ip 00007feb1dc170d3 sp 00007ffe58b0ddb0 error 4 in ld-2.28.so[7feb1dc0b000+28000]
      [128601.984214] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
      [128702.709758] NFS: server trevis-39vm8 error: fileid changed
      [128702.709758] fsid 0:52: expected fileid 0x6eda66c5, got 0x6eda6609
      [132036.334902] sysrq: SysRq : Changing Loglevel
      [132036.336286] sysrq: Loglevel set to 8
      [132036.339034] sysrq: SysRq : Show State
      [132036.340124]   task                        PC stack   pid father
      [132036.341703] systemd         S    0     1      0 0x00000000
      [132036.343164] Call Trace:
      [132036.344180]  ? __schedule+0x254/0x840
      [132036.345162]  schedule+0x28/0x70
      [132036.346067]  schedule_hrtimeout_range_clock+0x18d/0x1a0
      [132036.347389]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
      [132036.348748]  ep_poll+0x3c8/0x410
      [132036.349774]  ? wake_up_q+0x70/0x70
      [132036.350722]  do_epoll_wait+0xb0/0xd0
      [132036.351677]  __x64_sys_epoll_wait+0x1a/0x20
      [132036.352794]  do_syscall_64+0x5b/0x1b0
      [132036.353838]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [132036.355224] RIP: 0033:0x7f5db4153d97
      [132036.356244] Code: Bad RIP value.
      

      On the MDS console, we see some errors and then hundreds of “RPC request reserved *” messages:

      [128316.858258] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128322.237425] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x30b4:0x0] get parent: rc = -2
      [128322.242093] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 33 previous similar messages
      [128354.249725] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x328d:0x0] get parent: rc = -2
      [128354.252140] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 95 previous similar messages
      [128424.633861] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x36e4:0x0] get parent: rc = -2
      [128424.638621] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 119 previous similar messages
      [128508.193828] reconnect_path: npd != pd
      [128552.804263] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x4139:0x0] get parent: rc = -2
      [128552.807733] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 277 previous similar messages
      [128618.030608] RPC request reserved 84 but used 272
      [128618.031231] RPC request reserved 156 but used 272
      …
      

      In some of these test hangs, we see a process on the client hung. Looking at the console log for client 2 (vm9) for https://testing.whamcloud.com/test_sets/4a3da7ac-b376-11e9-b753-52540065bddc , we see

      ======================================= 08:16:41 \(1564561001\)
      [ 2741.696387] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:16:41 (1564561001)
      [ 2742.041451] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [ 2978.402841] 18[2351]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe24da9dd0 error 14
      [ 2978.404661] Code: Bad RIP value.
      [ 2978.523782] FS-Cache: Duplicate cookie detected
      [ 2978.524726] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
      [ 2978.526175] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
      [ 2978.527299] FS-Cache: O-key=[10] '0400020008010a090002'
      [ 2978.528241] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
      [ 2978.529707] FS-Cache: N-cookie d=000000004f183641 n=000000005df420e8
      [ 2978.530869] FS-Cache: N-key=[10] '0400020008010a090002'
      [ 2978.536357] FS-Cache: Duplicate cookie detected
      [ 2978.537374] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
      [ 2978.538877] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
      [ 2978.540002] FS-Cache: O-key=[10] '0400020008010a090002'
      [ 2978.540951] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
      [ 2978.542297] FS-Cache: N-cookie d=000000004f183641 n=000000001ed1f7ed
      [ 2978.543498] FS-Cache: N-key=[10] '0400020008010a090002'
      [ 3318.608229] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
      [ 3318.609671]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1
      [ 3318.611089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3318.612449] dir_create.sh   D    0 15909  15897 0x80000086
      [ 3318.613433] Call Trace:
      [ 3318.613989]  ? __schedule+0x254/0x840
      [ 3318.614678]  schedule+0x28/0x70
      [ 3318.615312]  io_schedule+0x12/0x40
      [ 3318.615980]  wait_on_page_bit_common+0x117/0x250
      [ 3318.616830]  ? page_cache_tree_insert+0xe0/0xe0
      [ 3318.617660]  __filemap_fdatawait_range+0xe4/0x140
      [ 3318.618515]  ? __filemap_fdatawrite_range+0xc8/0xf0
      [ 3318.619402]  filemap_write_and_wait_range+0x45/0x80
      [ 3318.620326]  nfs_file_fsync+0x44/0x1e0 [nfs]
      [ 3318.621118]  filp_close+0x31/0x60
      [ 3318.621772]  put_files_struct+0x6c/0xc0
      [ 3318.622503]  do_exit+0x2bd/0xad0
      [ 3318.623125]  ? try_to_wake_up+0x54/0x4b0
      [ 3318.623852]  do_group_exit+0x3a/0xa0
      [ 3318.624532]  get_signal+0x27c/0x590
      [ 3318.625207]  ? __send_signal+0x333/0x4a0
      [ 3318.625942]  do_signal+0x36/0x610
      [ 3318.626583]  ? do_send_sig_info+0x65/0x90
      [ 3318.627338]  exit_to_usermode_loop+0x89/0xf0
      [ 3318.628128]  do_syscall_64+0x182/0x1b0
      [ 3318.628840]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [ 3318.629767] RIP: 0033:0x7f7e25af4c3b
      [ 3318.630442] Code: Bad RIP value.
      [ 3318.631057] RSP: 002b:00007fff0b5d6338 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
      [ 3318.632354] RAX: 0000000000000000 RBX: 000055893f033960 RCX: 00007f7e25af4c3b
      [ 3318.633609] RDX: 000055893e5d36a0 RSI: 000000000000000f RDI: 0000000000003e25
      [ 3318.634846] RBP: 000055893f028600 R08: 00007fff0b5d6270 R09: 0000000000000007
      [ 3318.636092] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
      [ 3318.637334] R13: 00007fff0b5d6350 R14: 00007fff0b5d6354 R15: 0000000000000000
      [ 3441.482869] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
      [ 3441.484224]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1
      

      We’ve seen variations of this error for master clients
      https://testing.whamcloud.com/test_sets/eab3b214-b8e7-11e9-98c8-52540065bddc

      and b2_12 clients
      https://testing.whamcloud.com/test_sets/f1613024-b64d-11e9-9f36-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12654] parallel-scale-nfsv4 hangs in test racer_on_nfs
            sarah Sarah Liu added a comment - another on 2.15.5 https://testing.whamcloud.com/test_sets/a8e2119e-6325-48ec-9efd-452ebf18eaf7
            sarah Sarah Liu added a comment -

            Also see similar error on b2_12 branch Ubuntu 18.04 client
            https://testing.whamcloud.com/test_sets/55e079ef-d783-42a4-901c-9d86191ca678

            client 1 dmesg shows

            [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: dmesg
            [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\)
            [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979)
            [Sat Nov 20 10:59:39 2021] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
            [Sat Nov 20 11:00:13 2021] 7[26754]: segfault at 8 ip 00007f137b9049a9 sp 00007ffc9653c560 error 4 in ld-2.27.so[7f137b8f8000+29000]
            [Sat Nov 20 11:00:28 2021] 19[18133]: segfault at 8 ip 00007f621c1209a9 sp 00007fffe5a23be0 error 4 in ld-2.27.so[7f621c114000+29000]
            

            MDS console log shows

            [ 6920.488820] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\)
            [ 6920.686899] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979)
            [ 6923.513485] LustreError: 24621:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x307e:0x0] get parent: rc = -2
            [ 6927.362117] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3092:0x0] get parent: rc = -2
            [ 6927.364552] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message
            [ 6930.876962] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x30e9:0x0] get parent: rc = -2
            [ 6930.879476] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 2 previous similar messages
            [ 6934.054373] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3146:0x0] get parent: rc = -2
            [ 6934.064686] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
            [ 6943.939346] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x329c:0x0] get parent: rc = -2
            [ 6943.941779] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 18 previous similar messages
            [ 6952.134561] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x33a4:0x0] get parent: rc = -2
            [ 6952.137002] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
            [ 6978.345219] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3615:0x0] get parent: rc = -2
            [ 6978.354358] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 9 previous similar messages
            [ 7014.365989] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x39e1:0x0] get parent: rc = -2
            [ 7014.373482] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 50 previous similar messages
            [ 7079.605839] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x4293:0x0] get parent: rc = -2
            [ 7079.612565] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 151 previous similar messages
            [ 7213.605688] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x5162:0x0] get parent: rc = -2
            [ 7213.611971] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 366 previous similar messages
            [ 7221.071656] RPC request reserved 84 but used 276
            [ 7221.076982] RPC request reserved 84 but used 276
            [ 7221.094387] RPC request reserved 84 but used 276
            [ 7221.110672] RPC request reserved 84 but used 276
            [ 7221.112185] RPC request reserved 84 but used 276
            [ 7224.462275] RPC request reserved 84 but used 276
            [ 7227.794169] RPC request reserved 84 but used 276
            [ 7228.050120] RPC request reserved 84 but used 276
            [ 7231.374471] RPC request reserved 84 but used 276
            
            sarah Sarah Liu added a comment - Also see similar error on b2_12 branch Ubuntu 18.04 client https://testing.whamcloud.com/test_sets/55e079ef-d783-42a4-901c-9d86191ca678 client 1 dmesg shows [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: dmesg [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\) [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979) [Sat Nov 20 10:59:39 2021] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [Sat Nov 20 11:00:13 2021] 7[26754]: segfault at 8 ip 00007f137b9049a9 sp 00007ffc9653c560 error 4 in ld-2.27.so[7f137b8f8000+29000] [Sat Nov 20 11:00:28 2021] 19[18133]: segfault at 8 ip 00007f621c1209a9 sp 00007fffe5a23be0 error 4 in ld-2.27.so[7f621c114000+29000] MDS console log shows [ 6920.488820] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\) [ 6920.686899] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979) [ 6923.513485] LustreError: 24621:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x307e:0x0] get parent: rc = -2 [ 6927.362117] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3092:0x0] get parent: rc = -2 [ 6927.364552] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [ 6930.876962] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x30e9:0x0] get parent: rc = -2 [ 6930.879476] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 2 previous similar messages [ 6934.054373] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3146:0x0] get parent: rc = -2 [ 6934.064686] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages [ 6943.939346] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x329c:0x0] get parent: rc = -2 [ 6943.941779] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 18 previous similar messages [ 6952.134561] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x33a4:0x0] get parent: rc = -2 [ 6952.137002] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages [ 6978.345219] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3615:0x0] get parent: rc = -2 [ 6978.354358] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 9 previous similar messages [ 7014.365989] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x39e1:0x0] get parent: rc = -2 [ 7014.373482] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 50 previous similar messages [ 7079.605839] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x4293:0x0] get parent: rc = -2 [ 7079.612565] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 151 previous similar messages [ 7213.605688] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x5162:0x0] get parent: rc = -2 [ 7213.611971] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 366 previous similar messages [ 7221.071656] RPC request reserved 84 but used 276 [ 7221.076982] RPC request reserved 84 but used 276 [ 7221.094387] RPC request reserved 84 but used 276 [ 7221.110672] RPC request reserved 84 but used 276 [ 7221.112185] RPC request reserved 84 but used 276 [ 7224.462275] RPC request reserved 84 but used 276 [ 7227.794169] RPC request reserved 84 but used 276 [ 7228.050120] RPC request reserved 84 but used 276 [ 7231.374471] RPC request reserved 84 but used 276
            yujian Jian Yu added a comment -

            The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client:
            https://testing.whamcloud.com/test_sets/0d46d9da-3ecd-11ea-b3fe-52540065bddc

            yujian Jian Yu added a comment - The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client: https://testing.whamcloud.com/test_sets/0d46d9da-3ecd-11ea-b3fe-52540065bddc
            mdiep Minh Diep added a comment -

            I don't see fscache enabled

            -bash-4.4# cat /proc/fs/fscache/stats
            FS-Cache statistics
            Cookies: idx=5 dat=0 spc=0
            Objects: alc=0 nal=0 avl=0 ded=0
            ChkAux : non=0 ok=0 upd=0 obs=0
            Pages  : mrk=0 unc=0
            Acquire: n=8 nul=0 noc=0 ok=5 nbf=0 oom=0
            Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
            Invals : n=0 run=0
            Updates: n=0 nul=0 run=0
            Relinqs: n=3 nul=0 wcr=0 rtr=0
            AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
            Allocs : n=0 ok=0 wt=0 nbf=0 int=0
            Allocs : ops=0 owt=0 abt=0
            Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int=0 oom=0
            Retrvls: ops=0 owt=0 abt=0
            Stores : n=0 ok=0 agn=0 nbf=0 oom=0
            Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
            VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
            Ops    : pend=0 run=0 enq=0 can=0 rej=0
            Ops    : ini=0 dfr=0 rel=0 gc=0
            CacheOp: alo=0 luo=0 luc=0 gro=0
            CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
            CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
            CacheEv: nsp=0 stl=0 rtr=0 cul=0
            
            mdiep Minh Diep added a comment - I don't see fscache enabled -bash-4.4# cat /proc/fs/fscache/stats FS-Cache statistics Cookies: idx=5 dat=0 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8 nul=0 noc=0 ok=5 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=3 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int =0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int =0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops : pend=0 run=0 enq=0 can=0 rej=0 Ops : ini=0 dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 CacheEv: nsp=0 stl=0 rtr=0 cul=0

            Is this hang related? It's not racer_on_nfs, but it has some of the same error/failure characteristics; https://testing.whamcloud.com/test_sets/84138e04-0823-11ea-98f1-52540065bddc ?

            jamesanunez James Nunez (Inactive) added a comment - Is this hang related? It's not racer_on_nfs, but it has some of the same error/failure characteristics; https://testing.whamcloud.com/test_sets/84138e04-0823-11ea-98f1-52540065bddc ?
            pjones Peter Jones added a comment -

            Lai

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please advise on this one? Thanks Peter

            Two things here:

            • the "ll_dir_get_parent_fid()" message is spurious - the NFS client can send an old file handle with a parent FID that may refer to a directory that racer has deleted. We should quiet this error message for -ENOENT at least.
            • the FS-Cache code is bypassing IO to Lustre and directing it to a local cache device. This may improve NFS performance, but significantly reduces the effectiveness of our NFS-on-Lustre testing. Also, there may be bugs in Fscache triggered by racer, and we don't want to be debugging those. The use of FScache should be removed from our NFS testing configuration.
            adilger Andreas Dilger added a comment - Two things here: the " ll_dir_get_parent_fid() " message is spurious - the NFS client can send an old file handle with a parent FID that may refer to a directory that racer has deleted. We should quiet this error message for -ENOENT at least. the FS-Cache code is bypassing IO to Lustre and directing it to a local cache device. This may improve NFS performance, but significantly reduces the effectiveness of our NFS-on-Lustre testing. Also, there may be bugs in Fscache triggered by racer, and we don't want to be debugging those. The use of FScache should be removed from our NFS testing configuration.

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: