Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12654

parallel-scale-nfsv4 hangs in test racer_on_nfs

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4
    • Fix Version/s: None
    • Labels:
    • Environment:
      RHEL 8.0 clients
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      parallel-scale-nfsv4 test_racer_on_nfs hangs for RHEL 8.0 clients.

      Looking at a recent b2_12 hang at https://testing.whamcloud.com/test_sets/088a338c-b5bc-11e9-b88c-52540065bddc
      , we see the following on the console log for client 1 (vm5)

      [128401.640876] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128401.999359] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [128615.457711] 1[7004]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe851dc380 error 14
      [128615.459458] Code: Bad RIP value.
      [128615.505546] systemd-coredump[7103]: Not enough arguments passed by the kernel (0, expected 7).
      

      On the console log for client 2 (vm6), we see

      [128401.602782] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128401.976730] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [128455.523450] 3[1583]: segfault at 8 ip 00007f97d78d50d3 sp 00007fffe77a2c80 error 4 in ld-2.28.so[7f97d78c9000+28000]
      [128455.527725] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
      [128455.806473] FS-Cache: Duplicate cookie detected
      [128455.808605] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
      [128455.811322] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
      [128455.813515] FS-Cache: O-key=[10] '0400020008010a090002'
      [128455.815189] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
      [128455.817621] FS-Cache: N-cookie d=00000000bcc65360 n=0000000005cc644a
      [128455.819573] FS-Cache: N-key=[10] '0400020008010a090002'
      [128455.859830] FS-Cache: Duplicate cookie detected
      [128455.861408] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
      [128455.863994] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
      [128455.866117] FS-Cache: O-key=[10] '0400020008010a090002'
      [128455.867879] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
      [128455.870282] FS-Cache: N-cookie d=00000000bcc65360 n=000000003e9860aa
      [128455.872260] FS-Cache: N-key=[10] '0400020008010a090002'
      [128601.980237] 18[24202]: segfault at 8 ip 00007feb1dc170d3 sp 00007ffe58b0ddb0 error 4 in ld-2.28.so[7feb1dc0b000+28000]
      [128601.984214] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
      [128702.709758] NFS: server trevis-39vm8 error: fileid changed
      [128702.709758] fsid 0:52: expected fileid 0x6eda66c5, got 0x6eda6609
      [132036.334902] sysrq: SysRq : Changing Loglevel
      [132036.336286] sysrq: Loglevel set to 8
      [132036.339034] sysrq: SysRq : Show State
      [132036.340124]   task                        PC stack   pid father
      [132036.341703] systemd         S    0     1      0 0x00000000
      [132036.343164] Call Trace:
      [132036.344180]  ? __schedule+0x254/0x840
      [132036.345162]  schedule+0x28/0x70
      [132036.346067]  schedule_hrtimeout_range_clock+0x18d/0x1a0
      [132036.347389]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
      [132036.348748]  ep_poll+0x3c8/0x410
      [132036.349774]  ? wake_up_q+0x70/0x70
      [132036.350722]  do_epoll_wait+0xb0/0xd0
      [132036.351677]  __x64_sys_epoll_wait+0x1a/0x20
      [132036.352794]  do_syscall_64+0x5b/0x1b0
      [132036.353838]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [132036.355224] RIP: 0033:0x7f5db4153d97
      [132036.356244] Code: Bad RIP value.
      

      On the MDS console, we see some errors and then hundreds of “RPC request reserved *” messages:

      [128316.858258] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
      [128322.237425] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x30b4:0x0] get parent: rc = -2
      [128322.242093] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 33 previous similar messages
      [128354.249725] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x328d:0x0] get parent: rc = -2
      [128354.252140] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 95 previous similar messages
      [128424.633861] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x36e4:0x0] get parent: rc = -2
      [128424.638621] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 119 previous similar messages
      [128508.193828] reconnect_path: npd != pd
      [128552.804263] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x4139:0x0] get parent: rc = -2
      [128552.807733] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 277 previous similar messages
      [128618.030608] RPC request reserved 84 but used 272
      [128618.031231] RPC request reserved 156 but used 272
      …
      

      In some of these test hangs, we see a process on the client hung. Looking at the console log for client 2 (vm9) for https://testing.whamcloud.com/test_sets/4a3da7ac-b376-11e9-b753-52540065bddc , we see

      ======================================= 08:16:41 \(1564561001\)
      [ 2741.696387] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:16:41 (1564561001)
      [ 2742.041451] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
      [ 2978.402841] 18[2351]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe24da9dd0 error 14
      [ 2978.404661] Code: Bad RIP value.
      [ 2978.523782] FS-Cache: Duplicate cookie detected
      [ 2978.524726] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
      [ 2978.526175] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
      [ 2978.527299] FS-Cache: O-key=[10] '0400020008010a090002'
      [ 2978.528241] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
      [ 2978.529707] FS-Cache: N-cookie d=000000004f183641 n=000000005df420e8
      [ 2978.530869] FS-Cache: N-key=[10] '0400020008010a090002'
      [ 2978.536357] FS-Cache: Duplicate cookie detected
      [ 2978.537374] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
      [ 2978.538877] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
      [ 2978.540002] FS-Cache: O-key=[10] '0400020008010a090002'
      [ 2978.540951] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
      [ 2978.542297] FS-Cache: N-cookie d=000000004f183641 n=000000001ed1f7ed
      [ 2978.543498] FS-Cache: N-key=[10] '0400020008010a090002'
      [ 3318.608229] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
      [ 3318.609671]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1
      [ 3318.611089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3318.612449] dir_create.sh   D    0 15909  15897 0x80000086
      [ 3318.613433] Call Trace:
      [ 3318.613989]  ? __schedule+0x254/0x840
      [ 3318.614678]  schedule+0x28/0x70
      [ 3318.615312]  io_schedule+0x12/0x40
      [ 3318.615980]  wait_on_page_bit_common+0x117/0x250
      [ 3318.616830]  ? page_cache_tree_insert+0xe0/0xe0
      [ 3318.617660]  __filemap_fdatawait_range+0xe4/0x140
      [ 3318.618515]  ? __filemap_fdatawrite_range+0xc8/0xf0
      [ 3318.619402]  filemap_write_and_wait_range+0x45/0x80
      [ 3318.620326]  nfs_file_fsync+0x44/0x1e0 [nfs]
      [ 3318.621118]  filp_close+0x31/0x60
      [ 3318.621772]  put_files_struct+0x6c/0xc0
      [ 3318.622503]  do_exit+0x2bd/0xad0
      [ 3318.623125]  ? try_to_wake_up+0x54/0x4b0
      [ 3318.623852]  do_group_exit+0x3a/0xa0
      [ 3318.624532]  get_signal+0x27c/0x590
      [ 3318.625207]  ? __send_signal+0x333/0x4a0
      [ 3318.625942]  do_signal+0x36/0x610
      [ 3318.626583]  ? do_send_sig_info+0x65/0x90
      [ 3318.627338]  exit_to_usermode_loop+0x89/0xf0
      [ 3318.628128]  do_syscall_64+0x182/0x1b0
      [ 3318.628840]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [ 3318.629767] RIP: 0033:0x7f7e25af4c3b
      [ 3318.630442] Code: Bad RIP value.
      [ 3318.631057] RSP: 002b:00007fff0b5d6338 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
      [ 3318.632354] RAX: 0000000000000000 RBX: 000055893f033960 RCX: 00007f7e25af4c3b
      [ 3318.633609] RDX: 000055893e5d36a0 RSI: 000000000000000f RDI: 0000000000003e25
      [ 3318.634846] RBP: 000055893f028600 R08: 00007fff0b5d6270 R09: 0000000000000007
      [ 3318.636092] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
      [ 3318.637334] R13: 00007fff0b5d6350 R14: 00007fff0b5d6354 R15: 0000000000000000
      [ 3441.482869] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
      [ 3441.484224]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1
      

      We’ve seen variations of this error for master clients
      https://testing.whamcloud.com/test_sets/eab3b214-b8e7-11e9-98c8-52540065bddc

      and b2_12 clients
      https://testing.whamcloud.com/test_sets/f1613024-b64d-11e9-9f36-52540065bddc

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                laisiyao Lai Siyao
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: