[LU-12654] parallel-scale-nfsv4 hangs in test racer_on_nfs Created: 09/Aug/19 Updated: 09/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.7, Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | rhel8 | ||
| Environment: |
RHEL 8.0 clients |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
parallel-scale-nfsv4 test_racer_on_nfs hangs for RHEL 8.0 clients. Looking at a recent b2_12 hang at https://testing.whamcloud.com/test_sets/088a338c-b5bc-11e9-b88c-52540065bddc [128401.640876] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693) [128401.999359] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [128615.457711] 1[7004]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe851dc380 error 14 [128615.459458] Code: Bad RIP value. [128615.505546] systemd-coredump[7103]: Not enough arguments passed by the kernel (0, expected 7). On the console log for client 2 (vm6), we see [128401.602782] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693) [128401.976730] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [128455.523450] 3[1583]: segfault at 8 ip 00007f97d78d50d3 sp 00007fffe77a2c80 error 4 in ld-2.28.so[7f97d78c9000+28000] [128455.527725] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57 [128455.806473] FS-Cache: Duplicate cookie detected [128455.808605] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1] [128455.811322] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2 [128455.813515] FS-Cache: O-key=[10] '0400020008010a090002' [128455.815189] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1] [128455.817621] FS-Cache: N-cookie d=00000000bcc65360 n=0000000005cc644a [128455.819573] FS-Cache: N-key=[10] '0400020008010a090002' [128455.859830] FS-Cache: Duplicate cookie detected [128455.861408] FS-Cache: O-cookie c=000000007cfc03b1 [p=0000000080c9e5cd fl=222 nc=0 na=1] [128455.863994] FS-Cache: O-cookie d=00000000bcc65360 n=00000000662f4ab2 [128455.866117] FS-Cache: O-key=[10] '0400020008010a090002' [128455.867879] FS-Cache: N-cookie c=00000000643d95b4 [p=0000000080c9e5cd fl=2 nc=0 na=1] [128455.870282] FS-Cache: N-cookie d=00000000bcc65360 n=000000003e9860aa [128455.872260] FS-Cache: N-key=[10] '0400020008010a090002' [128601.980237] 18[24202]: segfault at 8 ip 00007feb1dc170d3 sp 00007ffe58b0ddb0 error 4 in ld-2.28.so[7feb1dc0b000+28000] [128601.984214] Code: 85 35 15 00 00 49 83 bb f0 00 00 00 00 48 c7 85 18 ff ff ff 00 00 00 00 0f 85 ce 13 00 00 49 8b 43 68 49 83 bb f8 00 00 00 00 <48> 8b 40 08 48 89 85 40 ff ff ff 0f 84 5c 0c 00 00 45 85 ed 74 57 [128702.709758] NFS: server trevis-39vm8 error: fileid changed [128702.709758] fsid 0:52: expected fileid 0x6eda66c5, got 0x6eda6609 [132036.334902] sysrq: SysRq : Changing Loglevel [132036.336286] sysrq: Loglevel set to 8 [132036.339034] sysrq: SysRq : Show State [132036.340124] task PC stack pid father [132036.341703] systemd S 0 1 0 0x00000000 [132036.343164] Call Trace: [132036.344180] ? __schedule+0x254/0x840 [132036.345162] schedule+0x28/0x70 [132036.346067] schedule_hrtimeout_range_clock+0x18d/0x1a0 [132036.347389] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [132036.348748] ep_poll+0x3c8/0x410 [132036.349774] ? wake_up_q+0x70/0x70 [132036.350722] do_epoll_wait+0xb0/0xd0 [132036.351677] __x64_sys_epoll_wait+0x1a/0x20 [132036.352794] do_syscall_64+0x5b/0x1b0 [132036.353838] entry_SYSCALL_64_after_hwframe+0x65/0xca [132036.355224] RIP: 0033:0x7f5db4153d97 [132036.356244] Code: Bad RIP value. On the MDS console, we see some errors and then hundreds of “RPC request reserved *” messages: [128316.858258] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 05:38:13 (1564810693) [128322.237425] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x30b4:0x0] get parent: rc = -2 [128322.242093] LustreError: 32025:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 33 previous similar messages [128354.249725] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x328d:0x0] get parent: rc = -2 [128354.252140] LustreError: 32024:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 95 previous similar messages [128424.633861] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x36e4:0x0] get parent: rc = -2 [128424.638621] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 119 previous similar messages [128508.193828] reconnect_path: npd != pd [128552.804263] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x4139:0x0] get parent: rc = -2 [128552.807733] LustreError: 32027:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 277 previous similar messages [128618.030608] RPC request reserved 84 but used 272 [128618.031231] RPC request reserved 156 but used 272 … In some of these test hangs, we see a process on the client hung. Looking at the console log for client 2 (vm9) for https://testing.whamcloud.com/test_sets/4a3da7ac-b376-11e9-b753-52540065bddc , we see ======================================= 08:16:41 \(1564561001\) [ 2741.696387] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 08:16:41 (1564561001) [ 2742.041451] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [ 2978.402841] 18[2351]: segfault at 1b00 ip 0000000000001b00 sp 00007ffe24da9dd0 error 14 [ 2978.404661] Code: Bad RIP value. [ 2978.523782] FS-Cache: Duplicate cookie detected [ 2978.524726] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1] [ 2978.526175] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f [ 2978.527299] FS-Cache: O-key=[10] '0400020008010a090002' [ 2978.528241] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1] [ 2978.529707] FS-Cache: N-cookie d=000000004f183641 n=000000005df420e8 [ 2978.530869] FS-Cache: N-key=[10] '0400020008010a090002' [ 2978.536357] FS-Cache: Duplicate cookie detected [ 2978.537374] FS-Cache: O-cookie c=0000000058c6eee2 [p=00000000fee62d73 fl=222 nc=0 na=1] [ 2978.538877] FS-Cache: O-cookie d=000000004f183641 n=00000000d718a10f [ 2978.540002] FS-Cache: O-key=[10] '0400020008010a090002' [ 2978.540951] FS-Cache: N-cookie c=000000007561c599 [p=00000000fee62d73 fl=2 nc=0 na=1] [ 2978.542297] FS-Cache: N-cookie d=000000004f183641 n=000000001ed1f7ed [ 2978.543498] FS-Cache: N-key=[10] '0400020008010a090002' [ 3318.608229] INFO: task dir_create.sh:15909 blocked for more than 120 seconds. [ 3318.609671] Tainted: G OE --------- - - 4.18.0-80.4.2.el8_0.x86_64 #1 [ 3318.611089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3318.612449] dir_create.sh D 0 15909 15897 0x80000086 [ 3318.613433] Call Trace: [ 3318.613989] ? __schedule+0x254/0x840 [ 3318.614678] schedule+0x28/0x70 [ 3318.615312] io_schedule+0x12/0x40 [ 3318.615980] wait_on_page_bit_common+0x117/0x250 [ 3318.616830] ? page_cache_tree_insert+0xe0/0xe0 [ 3318.617660] __filemap_fdatawait_range+0xe4/0x140 [ 3318.618515] ? __filemap_fdatawrite_range+0xc8/0xf0 [ 3318.619402] filemap_write_and_wait_range+0x45/0x80 [ 3318.620326] nfs_file_fsync+0x44/0x1e0 [nfs] [ 3318.621118] filp_close+0x31/0x60 [ 3318.621772] put_files_struct+0x6c/0xc0 [ 3318.622503] do_exit+0x2bd/0xad0 [ 3318.623125] ? try_to_wake_up+0x54/0x4b0 [ 3318.623852] do_group_exit+0x3a/0xa0 [ 3318.624532] get_signal+0x27c/0x590 [ 3318.625207] ? __send_signal+0x333/0x4a0 [ 3318.625942] do_signal+0x36/0x610 [ 3318.626583] ? do_send_sig_info+0x65/0x90 [ 3318.627338] exit_to_usermode_loop+0x89/0xf0 [ 3318.628128] do_syscall_64+0x182/0x1b0 [ 3318.628840] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 3318.629767] RIP: 0033:0x7f7e25af4c3b [ 3318.630442] Code: Bad RIP value. [ 3318.631057] RSP: 002b:00007fff0b5d6338 EFLAGS: 00000202 ORIG_RAX: 000000000000003e [ 3318.632354] RAX: 0000000000000000 RBX: 000055893f033960 RCX: 00007f7e25af4c3b [ 3318.633609] RDX: 000055893e5d36a0 RSI: 000000000000000f RDI: 0000000000003e25 [ 3318.634846] RBP: 000055893f028600 R08: 00007fff0b5d6270 R09: 0000000000000007 [ 3318.636092] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000 [ 3318.637334] R13: 00007fff0b5d6350 R14: 00007fff0b5d6354 R15: 0000000000000000 [ 3441.482869] INFO: task dir_create.sh:15909 blocked for more than 120 seconds. [ 3441.484224] Tainted: G OE --------- - - 4.18.0-80.4.2.el8_0.x86_64 #1 We’ve seen variations of this error for master clients and b2_12 clients |
| Comments |
| Comment by Andreas Dilger [ 09/Aug/19 ] |
|
Two things here:
|
| Comment by Peter Jones [ 10/Oct/19 ] |
|
Lai Could you please advise on this one? Thanks Peter |
| Comment by James Nunez (Inactive) [ 18/Nov/19 ] |
|
Is this hang related? It's not racer_on_nfs, but it has some of the same error/failure characteristics; https://testing.whamcloud.com/test_sets/84138e04-0823-11ea-98f1-52540065bddc ? |
| Comment by Minh Diep [ 13/Dec/19 ] |
|
I don't see fscache enabled -bash-4.4# cat /proc/fs/fscache/stats FS-Cache statistics Cookies: idx=5 dat=0 spc=0 Objects: alc=0 nal=0 avl=0 ded=0 ChkAux : non=0 ok=0 upd=0 obs=0 Pages : mrk=0 unc=0 Acquire: n=8 nul=0 noc=0 ok=5 nbf=0 oom=0 Lookups: n=0 neg=0 pos=0 crt=0 tmo=0 Invals : n=0 run=0 Updates: n=0 nul=0 run=0 Relinqs: n=3 nul=0 wcr=0 rtr=0 AttrChg: n=0 ok=0 nbf=0 oom=0 run=0 Allocs : n=0 ok=0 wt=0 nbf=0 int=0 Allocs : ops=0 owt=0 abt=0 Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int=0 oom=0 Retrvls: ops=0 owt=0 abt=0 Stores : n=0 ok=0 agn=0 nbf=0 oom=0 Stores : ops=0 run=0 pgs=0 rxd=0 olm=0 VmScan : nos=0 gon=0 bsy=0 can=0 wt=0 Ops : pend=0 run=0 enq=0 can=0 rej=0 Ops : ini=0 dfr=0 rel=0 gc=0 CacheOp: alo=0 luo=0 luc=0 gro=0 CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0 CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0 CacheEv: nsp=0 stl=0 rtr=0 cul=0 |
| Comment by Jian Yu [ 29/Jan/20 ] |
|
The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client: |
| Comment by Sarah Liu [ 29/Nov/21 ] |
|
Also see similar error on b2_12 branch Ubuntu 18.04 client client 1 dmesg shows [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: dmesg [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\) [Sat Nov 20 10:59:38 2021] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979) [Sat Nov 20 10:59:39 2021] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs [Sat Nov 20 11:00:13 2021] 7[26754]: segfault at 8 ip 00007f137b9049a9 sp 00007ffc9653c560 error 4 in ld-2.27.so[7f137b8f8000+29000] [Sat Nov 20 11:00:28 2021] 19[18133]: segfault at 8 ip 00007f621c1209a9 sp 00007fffe5a23be0 error 4 in ld-2.27.so[7f621c114000+29000] MDS console log shows [ 6920.488820] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 \(1637405979\) [ 6920.686899] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 10:59:39 (1637405979) [ 6923.513485] LustreError: 24621:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x307e:0x0] get parent: rc = -2 [ 6927.362117] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3092:0x0] get parent: rc = -2 [ 6927.364552] LustreError: 24622:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 1 previous similar message [ 6930.876962] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x30e9:0x0] get parent: rc = -2 [ 6930.879476] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 2 previous similar messages [ 6934.054373] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3146:0x0] get parent: rc = -2 [ 6934.064686] LustreError: 24619:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages [ 6943.939346] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x329c:0x0] get parent: rc = -2 [ 6943.941779] LustreError: 24623:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 18 previous similar messages [ 6952.134561] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x33a4:0x0] get parent: rc = -2 [ 6952.137002] LustreError: 24618:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 3 previous similar messages [ 6978.345219] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x3615:0x0] get parent: rc = -2 [ 6978.354358] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 9 previous similar messages [ 7014.365989] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x39e1:0x0] get parent: rc = -2 [ 7014.373482] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 50 previous similar messages [ 7079.605839] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x4293:0x0] get parent: rc = -2 [ 7079.612565] LustreError: 24624:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 151 previous similar messages [ 7213.605688] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x200000405:0x5162:0x0] get parent: rc = -2 [ 7213.611971] LustreError: 24620:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 366 previous similar messages [ 7221.071656] RPC request reserved 84 but used 276 [ 7221.076982] RPC request reserved 84 but used 276 [ 7221.094387] RPC request reserved 84 but used 276 [ 7221.110672] RPC request reserved 84 but used 276 [ 7221.112185] RPC request reserved 84 but used 276 [ 7224.462275] RPC request reserved 84 but used 276 [ 7227.794169] RPC request reserved 84 but used 276 [ 7228.050120] RPC request reserved 84 but used 276 [ 7231.374471] RPC request reserved 84 but used 276 |