Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
DNE
-
3
-
9223372036854775807
Description
parallel-scale-nfsv4 test_compilebench fails with 'compilebench failed: 1'. Looking at a recent failure at https://testing.whamcloud.com/test_sets/8506fed8-8f1e-4d88-b02b-ace481b2b76a we can see an error
== parallel-scale-nfsv4 test compilebench: compilebench ============================================== 01:26:15 (1610846775) OPTIONS: cbench_DIR=/usr/bin cbench_IDIRS=2 cbench_RUNS=2 trevis-71vm1.trevis.whamcloud.com trevis-71vm2 free space = 14256896 KB ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.2951786 -i 2 -r 2 --makej using working directory /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.2951786, 2 intial dirs 2 runs native unpatched native-0 222MB in 149.44 seconds (1.49 MB/s) native patched native-0 109MB in 45.34 seconds (2.42 MB/s) native patched compiled native-0 691MB in 53.28 seconds (12.98 MB/s) create dir kernel-0 222MB in 143.40 seconds (1.55 MB/s) create dir kernel-1 222MB in 155.16 seconds (1.43 MB/s) compile dir kernel-1 680MB in 58.32 seconds (11.67 MB/s) Traceback (most recent call last): File "./compilebench", line 594, in <module> if not compile_one_dir(dset, rnd): File "./compilebench", line 368, in compile_one_dir mbs = run_directory(ch[0], dir, "compile dir") File "./compilebench", line 245, in run_directory fp.close() IOError: [Errno 5] Input/output error parallel-scale-nfsv4 test_compilebench: @@@@@@ FAIL: compilebench failed: 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6273:error() = /usr/lib64/lustre/tests/functions.sh:351:run_compilebench()
Looking at the server console logs, there are multiple network errors. Looking at the OSS (vm3) console log, we see
[107931.885582] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 01:26:15 (1610846775) [107932.481635] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.2951786 -i 2 -r 2 --makej [107933.042074] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.2951786 -i 2 -r 2 --makej [108719.026251] LNetError: 18274:0:(socklnd.c:1696:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.19@tcp[2], ip 10.9.4.19:1021, with error, wanted: 286961, left: 286961, last alive is 0 secs ago [108719.029742] LustreError: 18274:0:(events.c:458:server_bulk_callback()) event type 3, status -5, desc 00000000b69b67b2 [108719.031582] LustreError: 1997118:0:(ldlm_lib.c:3518:target_bulk_io()) @@@ network error on bulk WRITE req@00000000a03f2779 x1689004403769024/t0(0) o4->f5d9198b-e3d2-4ea0-ba76-29ca5121b852@10.9.4.19@tcp:719/0 lens 488/448 e 1 to 0 dl 1610847579 ref 1 fl Interpret:/0/0 rc 0/0 job:'nfsd.0' [108719.035763] Lustre: lustre-OST0007: Bulk IO write error with f5d9198b-e3d2-4ea0-ba76-29ca5121b852 (at 10.9.4.19@tcp), client will retry: rc = -110 [108726.258716] Lustre: 18281:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1610847560/real 1610847560] req@000000005d27034e x1688982717184768/t0(0) o400->MGC10.9.4.19@tcp@10.9.4.19@tcp:26/25 lens 224/224 e 0 to 1 dl 1610847571 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:5.0' [108726.262736] Lustre: lustre-MDT0000-lwp-OST0004: Connection to lustre-MDT0000 (at 10.9.4.19@tcp) was lost; in progress operations using this service will wait for recovery to complete [108726.263957] Lustre: 18281:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 8 previous similar messages [108726.266537] Lustre: Skipped 1 previous similar message [108726.268198] LustreError: 166-1: MGC10.9.4.19@tcp: Connection to MGS (at 10.9.4.19@tcp) was lost; in progress operations using this service will fail [108736.434536] Lustre: 18282:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1610847570/real 1610847570] req@00000000700c3125 x1688982717189952/t0(0) o400->lustre-MDT0002-lwp-OST0003@10.9.4.19@tcp:12/10 lens 224/224 e 0 to 1 dl 1610847581 ref 1 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'kworker/u4:3.0' [108736.439954] Lustre: 18282:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 31 previous similar messages [108743.154343] Lustre: ll_ost_io00_024: service thread pid 1997067 was inactive for 41.012 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [108743.155010] Pid: 1997053, comm: ll_ost_io00_015 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020 [108743.157608] Lustre: Skipped 1 previous similar message [108743.159443] Call Trace TBD: [108743.160869] Pid: 1997067, comm: ll_ost_io00_024 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020 [108743.162581] Call Trace TBD: [108743.163076] Pid: 1997070, comm: ll_ost_io00_027 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Wed Dec 9 21:09:51 UTC 2020 [108743.163670] Lustre: ll_ost_io00_062: service thread pid 1997107 was inactive for 41.022 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [108743.164929] Call Trace TBD: [108753.458192] LustreError: 1997201:0:(sec.c:2538:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 0(37648) req@00000000c2a5484e x1689004403769088/t0(0) o4->f5d9198b-e3d2-4ea0-ba76-29ca5121b852@10.9.4.19@tcp:744/0 lens 488/448 e 2 to 0 dl 1610847604 ref 1 fl Interpret:/0/0 rc 0/0 job:'nfsd.0' [108753.458207] Lustre: lustre-OST0001: Bulk IO write error with f5d9198b-e3d2-4ea0-ba76-29ca5121b852 (at 10.9.4.19@tcp), client will retry: rc = -110 [108753.462443] LustreError: 1997201:0:(sec.c:2538:sptlrpc_svc_unwrap_bulk()) Skipped 1 previous similar message [108762.061641] Lustre: lustre-OST0007: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 10.9.4.19@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp 00000000c1106cb7, cur 1610847607 expire 1610847577 last 1610847558 [108762.065243] Lustre: Skipped 7 previous similar messages [108762.865902] LustreError: 1997068:0:(ldlm_lib.c:3506:target_bulk_io()) @@@ Eviction on bulk WRITE req@000000004cf7b69a x1689004403784192/t0(0) o4->f5d9198b-e3d2-4ea0-ba76-29ca5121b852@10.9.4.19@tcp:2/0 lens 488/448 e 2 to 0 dl 1610847617 ref 1 fl Interpret:/0/0 rc 0/0 job:'nfsd.0' [108762.865964] Lustre: lustre-OST0003: Bulk IO write error with f5d9198b-e3d2-4ea0-ba76-29ca5121b852 (at 10.9.4.19@tcp), client will retry: rc = -107 [108762.870372] LustreError: 1997068:0:(ldlm_lib.c:3506:target_bulk_io()) Skipped 2 previous similar messages [108762.872561] Lustre: Skipped 8 previous similar messages [108763.529767] LustreError: 18276:0:(events.c:458:server_bulk_callback()) event type 3, status -110, desc 000000004e9cea01 [108763.545504] Lustre: lustre-OST0000: Client f5d9198b-e3d2-4ea0-ba76-29ca5121b852 (at 10.9.4.19@tcp) reconnecting [108763.546465] Lustre: lustre-MDT0000-lwp-OST0005: Connection restored to 10.9.4.19@tcp (at 10.9.4.19@tcp) [108763.547306] Lustre: Skipped 21 previous similar messages [108763.550553] Lustre: Skipped 8 previous similar messages [108763.598295] Lustre: lustre-OST0000: deleting orphan objects from 0x0:997331 to 0x0:998377 [108763.600040] Lustre: lustre-OST0001: deleting orphan objects from 0x0:545675 to 0x0:547777 [108763.613586] Lustre: lustre-OST0002: deleting orphan objects from 0x0:554315 to 0x0:554881 [108763.628447] Lustre: lustre-OST0003: deleting orphan objects from 0x0:548811 to 0x0:550145 [108763.638884] Lustre: lustre-OST0004: deleting orphan objects from 0x0:545259 to 0x0:545857 [108763.650571] Lustre: lustre-OST0005: deleting orphan objects from 0x0:546795 to 0x0:548097 [108763.660883] Lustre: lustre-OST0006: deleting orphan objects from 0x0:544427 to 0x0:545761 [108763.671884] Lustre: lustre-OST0007: deleting orphan objects from 0x0:543915 to 0x0:545377 [108763.682881] Lustre: lustre-OST0000: deleting orphan objects from 0x300000400:7381 to 0x300000400:7489 [108763.684620] Lustre: lustre-OST0001: deleting orphan objects from 0x340000400:1328 to 0x340000400:1409 [108763.686377] Lustre: lustre-OST0002: deleting orphan objects from 0x380000400:1165 to 0x380000400:1249 [108763.688168] Lustre: lustre-OST0003: deleting orphan objects from 0x3c0000400:1197 to 0x3c0000400:1281 [108763.689911] Lustre: lustre-OST0005: deleting orphan objects from 0x440000402:1164 to 0x440000402:1249 [108763.691615] Lustre: lustre-OST0004: deleting orphan objects from 0x400000400:1163 to 0x400000400:1249 [108763.693287] Lustre: lustre-OST0007: deleting orphan objects from 0x4c0000402:1163 to 0x4c0000402:1249 [108763.695065] Lustre: lustre-OST0006: deleting orphan objects from 0x480000400:1164 to 0x480000400:1249 [108768.054448] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv4 test_compilebench: @@@@@@ FAIL: compilebench failed: 1
Looking at the MDS1,3 (vm4) console log, we see
[87633.096129] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench ============================================== 01:26:15 (1610846775) [87633.685094] Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.compilebench.2951786 -i 2 -r 2 --makej [87634.228695] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.2951786 -i 2 -r 2 --makej [88420.235627] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 00000000460de44c [88420.238194] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 00000000ad9a45f5 [88420.240371] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 0000000009a3ca4b [88420.242267] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 0000000009a3ca4b [88420.244175] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 00000000344043f6 [88420.246332] Lustre: 2831:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1610847547/real 1610847547] req@00000000758886a2 x1689004403772096/t0(0) o4->lustre-OST0001-osc-ffff8c4f08916800@10.9.4.18@tcp:6/4 lens 488/448 e 1 to 1 dl 1610847585 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'nfsd.0' [88420.252345] Lustre: lustre-OST0001-osc-ffff8c4f08916800: Connection to lustre-OST0001 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88420.256827] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 000000004fc94c50 [88420.259263] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 000000009d02cc75 [88420.261181] LustreError: 2825:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 000000008da0c078 [88421.820146] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1610847547/real 1610847547] req@000000007a04513c x1689004403772160/t0(0) o4->lustre-OST0006-osc-ffff8c4f08916800@10.9.4.18@tcp:6/4 lens 13208/448 e 0 to 1 dl 1610847560 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'nfsd.0' [88421.825480] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [88421.827178] Lustre: lustre-OST0006-osc-ffff8c4f08916800: Connection to lustre-OST0006 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88421.830356] Lustre: Skipped 2 previous similar messages [88429.004033] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1610847561/real 0] req@00000000cffc2e75 x1689004403785344/t0(0) o13->lustre-OST0002-osc-MDT0000@10.9.4.18@tcp:7/4 lens 224/368 e 0 to 1 dl 1610847572 ref 2 fl Rpc:Xr/0/ffffffff rc 0/-1 job:'osp-pre-2-0.0' [88429.009465] Lustre: lustre-OST0002-osc-MDT0000: Connection to lustre-OST0002 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88431.052053] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1610847564/real 0] req@000000009518c3c5 x1689004403790976/t0(0) o13->lustre-OST0001-osc-MDT0000@10.9.4.18@tcp:7/4 lens 224/368 e 0 to 1 dl 1610847575 ref 2 fl Rpc:Xr/0/ffffffff rc 0/-1 job:'osp-pre-1-0.0' [88431.057566] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [88431.914013] Lustre: MGS: Client 2869abe0-8e18-4316-bbca-0fcc869e5152 (at 10.9.4.18@tcp) reconnecting [88432.756040] Lustre: lustre-OST0002-osc-MDT0002: Connection to lustre-OST0002 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88432.759262] Lustre: Skipped 5 previous similar messages [88433.037572] Lustre: lustre-MDT0002: Client lustre-MDT0002-lwp-OST0000_UUID (at 10.9.4.18@tcp) reconnecting [88433.039917] Lustre: Skipped 1 previous similar message [88435.147954] Lustre: 2831:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1610847566/real 0] req@000000009ae9c8ba x1689004403794112/t0(0) o13->lustre-OST0007-osc-MDT0002@10.9.4.18@tcp:7/4 lens 224/368 e 0 to 1 dl 1610847579 ref 2 fl Rpc:Xr/0/ffffffff rc 0/-1 job:'osp-pre-7-2.0' [88435.153301] Lustre: 2831:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 6 previous similar messages [88436.419028] Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0007_UUID (at 10.9.4.18@tcp) reconnecting [88436.421320] Lustre: Skipped 12 previous similar messages [88442.243709] Lustre: lustre-OST0007-osc-ffff8c4f08916800: Connection to lustre-OST0007 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88442.246871] Lustre: Skipped 9 previous similar messages [88446.347612] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1610847577/real 0] req@00000000cfe294af x1689004403805376/t0(0) o4->lustre-OST0007-osc-ffff8c4f08916800@10.9.4.18@tcp:6/4 lens 488/448 e 0 to 1 dl 1610847590 ref 3 fl Rpc:Xr/0/ffffffff rc 0/-1 job:'nfsd.0' [88446.353001] Lustre: 2830:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [88453.402600] Lustre: MGS: Client 2869abe0-8e18-4316-bbca-0fcc869e5152 (at 10.9.4.18@tcp) reconnecting [88453.404504] Lustre: Skipped 18 previous similar messages [88464.754061] LustreError: 11-0: lustre-OST0004-osc-ffff8c4f08916800: operation ldlm_enqueue to node 10.9.4.18@tcp failed: rc = -107 [88464.756507] LustreError: Skipped 1 previous similar message [88464.757727] Lustre: lustre-OST0003-osc-ffff8c4f08916800: Connection to lustre-OST0003 (at 10.9.4.18@tcp) was lost; in progress operations using this service will wait for recovery to complete [88464.760837] Lustre: Skipped 2 previous similar messages [88464.764017] LustreError: 167-0: lustre-OST0002-osc-ffff8c4f08916800: This client was evicted by lustre-OST0002; in progress operations using this service will fail. [88464.767473] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x14002:0x0]// may get corrupted (rc -5) [88464.768214] Lustre: Evicted from lustre-OST0000_UUID (at 10.9.4.18@tcp) after server handle changed from 0x706c34710f6750f1 to 0x706c34711104c4da [88464.770382] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13fb6:0x0]/ may get corrupted (rc -5) [88464.773555] Lustre: 2831:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13f7a:0x0]/ may get corrupted (rc -5) [88464.796801] Lustre: lustre-OST0000-osc-MDT0000: Connection restored to 10.9.4.18@tcp (at 10.9.4.18@tcp) [88464.798420] Lustre: Skipped 2 previous similar messages [88464.843726] Lustre: 2831:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13f70:0x0]/ may get corrupted (rc -5) [88464.853795] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13ff4:0x0]/ may get corrupted (rc -5) [88464.880568] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13fb5:0x0]/ may get corrupted (rc -5) [88464.883431] Lustre: 2831:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13fef:0x0]/ may get corrupted (rc -5) [88464.886326] Lustre: 2831:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13ffb:0x0]/ may get corrupted (rc -5) [88464.900822] Lustre: 2831:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13fe2:0x0]/ may get corrupted (rc -5) [88464.904692] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13f5a:0x0]/ may get corrupted (rc -5) [88464.907792] Lustre: 2830:0:(llite_lib.c:3140:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.19@tcp:/lustre/fid: [0x2000923f5:0x13fe3:0x0]/ may get corrupted (rc -5) [88468.976312] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv4 test_compilebench: @@@@@@ FAIL: compilebench failed: 1 [88469.801555] Lustre: DEBUG MARKER: parallel-scale-nfsv4 test_compilebench: @@@@@@ FAIL: compilebench failed: 1
The error looks like LU-10062, but we don’t see the call trace in that ticket.