[LU-12654] parallel-scale-nfsv4 hangs in test racer_on_nfs Created: 09/Aug/19  Updated: 09/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7, Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: rhel8
Environment:

RHEL 8.0 clients


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv4 test_racer_on_nfs hangs for RHEL 8.0 clients.

Looking at a recent b2_12 hang at https://testing.whamcloud.com/test_sets/088a338c-b5bc-11e9-b88c-52540065bddc
, we see the following on the console log for client 1 (vm5)

[128401.640876] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
[128401.999359] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[128615.457711] 1[7004]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe851dc380 error 14
[128615.459458] Code: Bad RIP value.
[128615.505546] systemd-coredump[7103]: Not enough arguments passed by the kernel (0, expected 7).

On the console log for client 2 (vm6), we see

[128401.602782] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
[128401.976730] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[128455.523450] 3[1583]: segfault at 8 ip 00007f97d78d50d3 sp 00007fffe77a2c80 error 4 in ld-2.28.so[7f97d78c9000+28000]
[128455.527725] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
[128455.806473] FS-Cache: Duplicate cookie detected
[128455.808605] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
[128455.811322] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
[128455.813515] FS-Cache: O-key=[10] '0400020008010a090002'
[128455.815189] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
[128455.817621] FS-Cache: N-cookie d=00000000bcc65360 n=0000000005cc644a
[128455.819573] FS-Cache: N-key=[10] '0400020008010a090002'
[128455.859830] FS-Cache: Duplicate cookie detected
[128455.861408] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1]
[128455.863994] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2
[128455.866117] FS-Cache: O-key=[10] '0400020008010a090002'
[128455.867879] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1]
[128455.870282] FS-Cache: N-cookie d=00000000bcc65360 n=000000003e9860aa
[128455.872260] FS-Cache: N-key=[10] '0400020008010a090002'
[128601.980237] 18[24202]: segfault at 8 ip 00007feb1dc170d3 sp 00007ffe58b0ddb0 error 4 in ld-2.28.so[7feb1dc0b000+28000]
[128601.984214] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57
[128702.709758] NFS: server trevis-39vm8 error: fileid changed
[128702.709758] fsid 0:52: expected fileid 0x6eda66c5, got 0x6eda6609
[132036.334902] sysrq: SysRq : Changing Loglevel
[132036.336286] sysrq: Loglevel set to 8
[132036.339034] sysrq: SysRq : Show State
[132036.340124]   task                        PC stack   pid father
[132036.341703] systemd         S    0     1      0 0x00000000
[132036.343164] Call Trace:
[132036.344180]  ? __schedule+0x254/0x840
[132036.345162]  schedule+0x28/0x70
[132036.346067]  schedule_hrtimeout_range_clock+0x18d/0x1a0
[132036.347389]  ? ep_scan_ready_list.constprop.22+0x20c/0x230
[132036.348748]  ep_poll+0x3c8/0x410
[132036.349774]  ? wake_up_q+0x70/0x70
[132036.350722]  do_epoll_wait+0xb0/0xd0
[132036.351677]  __x64_sys_epoll_wait+0x1a/0x20
[132036.352794]  do_syscall_64+0x5b/0x1b0
[132036.353838]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[132036.355224] RIP: 0033:0x7f5db4153d97
[132036.356244] Code: Bad RIP value.

On the MDS console, we see some errors and then hundreds of “RPC request reserved *” messages:

[128316.858258] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693)
[128322.237425] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x30b4:0x0] get parent: rc = -2
[128322.242093] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 33 previous similar messages
[128354.249725] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x328d:0x0] get parent: rc = -2
[128354.252140] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 95 previous similar messages
[128424.633861] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x36e4:0x0] get parent: rc = -2
[128424.638621] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 119 previous similar messages
[128508.193828] reconnect_path: npd != pd
[128552.804263] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x4139:0x0] get parent: rc = -2
[128552.807733] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 277 previous similar messages
[128618.030608] RPC request reserved 84 but used 272
[128618.031231] RPC request reserved 156 but used 272
…

In some of these test hangs, we see a process on the client hung. Looking at the console log for client 2 (vm9) for https://testing.whamcloud.com/test_sets/4a3da7ac-b376-11e9-b753-52540065bddc , we see

======================================= 08:16:41 \(1564561001\)
[ 2741.696387] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:16:41 (1564561001)
[ 2742.041451] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[ 2978.402841] 18[2351]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe24da9dd0 error 14
[ 2978.404661] Code: Bad RIP value.
[ 2978.523782] FS-Cache: Duplicate cookie detected
[ 2978.524726] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
[ 2978.526175] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
[ 2978.527299] FS-Cache: O-key=[10] '0400020008010a090002'
[ 2978.528241] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
[ 2978.529707] FS-Cache: N-cookie d=000000004f183641 n=000000005df420e8
[ 2978.530869] FS-Cache: N-key=[10] '0400020008010a090002'
[ 2978.536357] FS-Cache: Duplicate cookie detected
[ 2978.537374] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1]
[ 2978.538877] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f
[ 2978.540002] FS-Cache: O-key=[10] '0400020008010a090002'
[ 2978.540951] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1]
[ 2978.542297] FS-Cache: N-cookie d=000000004f183641 n=000000001ed1f7ed
[ 2978.543498] FS-Cache: N-key=[10] '0400020008010a090002'
[ 3318.608229] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
[ 3318.609671]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1
[ 3318.611089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3318.612449] dir_create.sh   D    0 15909  15897 0x80000086
[ 3318.613433] Call Trace:
[ 3318.613989]  ? __schedule+0x254/0x840
[ 3318.614678]  schedule+0x28/0x70
[ 3318.615312]  io_schedule+0x12/0x40
[ 3318.615980]  wait_on_page_bit_common+0x117/0x250
[ 3318.616830]  ? page_cache_tree_insert+0xe0/0xe0
[ 3318.617660]  __filemap_fdatawait_range+0xe4/0x140
[ 3318.618515]  ? __filemap_fdatawrite_range+0xc8/0xf0
[ 3318.619402]  filemap_write_and_wait_range+0x45/0x80
[ 3318.620326]  nfs_file_fsync+0x44/0x1e0 [nfs]
[ 3318.621118]  filp_close+0x31/0x60
[ 3318.621772]  put_files_struct+0x6c/0xc0
[ 3318.622503]  do_exit+0x2bd/0xad0
[ 3318.623125]  ? try_to_wake_up+0x54/0x4b0
[ 3318.623852]  do_group_exit+0x3a/0xa0
[ 3318.624532]  get_signal+0x27c/0x590
[ 3318.625207]  ? __send_signal+0x333/0x4a0
[ 3318.625942]  do_signal+0x36/0x610
[ 3318.626583]  ? do_send_sig_info+0x65/0x90
[ 3318.627338]  exit_to_usermode_loop+0x89/0xf0
[ 3318.628128]  do_syscall_64+0x182/0x1b0
[ 3318.628840]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 3318.629767] RIP: 0033:0x7f7e25af4c3b
[ 3318.630442] Code: Bad RIP value.
[ 3318.631057] RSP: 002b:00007fff0b5d6338 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[ 3318.632354] RAX: 0000000000000000 RBX: 000055893f033960 RCX: 00007f7e25af4c3b
[ 3318.633609] RDX: 000055893e5d36a0 RSI: 000000000000000f RDI: 0000000000003e25
[ 3318.634846] RBP: 000055893f028600 R08: 00007fff0b5d6270 R09: 0000000000000007
[ 3318.636092] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
[ 3318.637334] R13: 00007fff0b5d6350 R14: 00007fff0b5d6354 R15: 0000000000000000
[ 3441.482869] INFO: task dir_create.sh:15909 blocked for more than 120 seconds.
[ 3441.484224]       Tainted: G           OE    --------- -  - 4.18.0-80.4.2.el8_0.x86_64 #1

We’ve seen variations of this error for master clients
https://testing.whamcloud.com/test_sets/eab3b214-b8e7-11e9-98c8-52540065bddc

and b2_12 clients
https://testing.whamcloud.com/test_sets/f1613024-b64d-11e9-9f36-52540065bddc



 Comments   
Comment by Andreas Dilger [ 09/Aug/19 ]

Two things here:

  • the "ll_dir_get_parent_fid()" message is spurious - the NFS client can send an old file handle with a parent FID that may refer to a directory that racer has deleted. We should quiet this error message for -ENOENT at least.
  • the FS-Cache code is bypassing IO to Lustre and directing it to a local cache device. This may improve NFS performance, but significantly reduces the effectiveness of our NFS-on-Lustre testing. Also, there may be bugs in Fscache triggered by racer, and we don't want to be debugging those. The use of FScache should be removed from our NFS testing configuration.
Comment by Peter Jones [ 10/Oct/19 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by James Nunez (Inactive) [ 18/Nov/19 ]

Is this hang related? It's not racer_on_nfs, but it has some of the same error/failure characteristics; https://testing.whamcloud.com/test_sets/84138e04-0823-11ea-98f1-52540065bddc ?

Comment by Minh Diep [ 13/Dec/19 ]

I don't see fscache enabled

-bash-4.4# cat /proc/fs/fscache/stats
FS-Cache statistics
Cookies: idx=5 dat=0 spc=0
Objects: alc=0 nal=0 avl=0 ded=0
ChkAux : non=0 ok=0 upd=0 obs=0
Pages  : mrk=0 unc=0
Acquire: n=8 nul=0 noc=0 ok=5 nbf=0 oom=0
Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=3 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int=0 oom=0
Retrvls: ops=0 owt=0 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
Ops    : pend=0 run=0 enq=0 can=0 rej=0
Ops    : ini=0 dfr=0 rel=0 gc=0
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
CacheEv: nsp=0 stl=0 rtr=0 cul=0
Comment by Jian Yu [ 29/Jan/20 ]

The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client:
https://testing.whamcloud.com/test_sets/0d46d9da-3ecd-11ea-b3fe-52540065bddc

Comment by Sarah Liu [ 29/Nov/21 ]

Also see similar error on b2_12 branch Ubuntu 18.04 client
https://testing.whamcloud.com/test_sets/55e079ef-d783-42a4-901c-9d86191ca678

client 1 dmesg shows

[Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: dmesg
[Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\)
[Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979)
[Sat Nov 20 10:59:39 2021] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[Sat Nov 20 11:00:13 2021] 7[26754]: segfault at 8 ip 00007f137b9049a9 sp 00007ffc9653c560 error 4 in ld-2.27.so[7f137b8f8000+29000]
[Sat Nov 20 11:00:28 2021] 19[18133]: segfault at 8 ip 00007f621c1209a9 sp 00007fffe5a23be0 error 4 in ld-2.27.so[7f621c114000+29000]

MDS console log shows

[ 6920.488820] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\)
[ 6920.686899] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979)
[ 6923.513485] LustreError: 24621:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x307e:0x0] get parent: rc = -2
[ 6927.362117] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3092:0x0] get parent: rc = -2
[ 6927.364552] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message
[ 6930.876962] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x30e9:0x0] get parent: rc = -2
[ 6930.879476] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 2 previous similar messages
[ 6934.054373] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3146:0x0] get parent: rc = -2
[ 6934.064686] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
[ 6943.939346] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x329c:0x0] get parent: rc = -2
[ 6943.941779] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 18 previous similar messages
[ 6952.134561] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x33a4:0x0] get parent: rc = -2
[ 6952.137002] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages
[ 6978.345219] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3615:0x0] get parent: rc = -2
[ 6978.354358] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 9 previous similar messages
[ 7014.365989] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x39e1:0x0] get parent: rc = -2
[ 7014.373482] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 50 previous similar messages
[ 7079.605839] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x4293:0x0] get parent: rc = -2
[ 7079.612565] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 151 previous similar messages
[ 7213.605688] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x5162:0x0] get parent: rc = -2
[ 7213.611971] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 366 previous similar messages
[ 7221.071656] RPC request reserved 84 but used 276
[ 7221.076982] RPC request reserved 84 but used 276
[ 7221.094387] RPC request reserved 84 but used 276
[ 7221.110672] RPC request reserved 84 but used 276
[ 7221.112185] RPC request reserved 84 but used 276
[ 7224.462275] RPC request reserved 84 but used 276
[ 7227.794169] RPC request reserved 84 but used 276
[ 7228.050120] RPC request reserved 84 but used 276
[ 7231.374471] RPC request reserved 84 but used 276
Generated at Sat Feb 10 02:54:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.