[LU-10683] write checksum errors - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:
centos7, skylake, opa, zfs 0.7.6, lustre 2.10.3 + https://review.whamcloud.com/#/c/29992/. kernel nopti on servers, pti on clients, 3.10.0-693.17.1.el7.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hiya,

these appeared last night.
john100,101 are clients. arkle3,6 are OSS's. transom1 is running the fabric manager.

in light of the similar looking ~~LU-9305~~ I thought I would create this ticket.
we run default (4M) rpcs on clients and servers. our OSTs are each 4 raidz3's in 1 zpool, and have recordsize=2M. 2 OSTs per OSS. 16 OSTs total.

I suppose it could be a OPA network glitch, but it affected 2 clients and 2 servers so that seems unlikely.

we have just moved from zfs 0.7.5 to zfs 0.7.6. we ran ior and mdtest after this change and they were fine. these errors occurred a couple of days after that.

Feb 19 23:45:12 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 54e81a18
Feb 19 23:45:12 john100 kernel: LNetError: 899:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1591386600741136 length 1048576 too big: 1048176 left, 1048176 allowed
Feb 19 23:45:12 arkle3 kernel: LNet: Using FMR for registration
Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881764648600
Feb 19 23:45:12 arkle3 kernel: LustreError: 298612:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881764648600
Feb 19 23:45:12 arkle3 kernel: LustreError: 337237:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff881687561450 x1591386600741136/t0(0) o4->8c8018f7-2e02-6c2b-cbcf-29133ecabf02@192.168.44.200@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
Feb 19 23:45:12 arkle3 kernel: Lustre: dagg-OST0005: Bulk IO write error with 8c8018f7-2e02-6c2b-cbcf-29133ecabf02 (at 192.168.44.200@o2ib44), client will retry: rc = -110
Feb 19 23:45:12 arkle3 kernel: LNet: Skipped 1 previous similar message
Feb 19 23:45:12 john101 kernel: LNetError: 904:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.36@o2ib44, match 1591895282655776 length 1048576 too big: 1048176 left, 1048176 allowed
Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8810cb714600
Feb 19 23:45:12 arkle6 kernel: LustreError: 298397:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8810cb714600
Feb 19 23:45:12 arkle6 kernel: LustreError: 42356:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff88276c3d5850 x1591895282655776/t0(0) o4->400cfa1c-7c7d-1d14-09ed-f6043574fd7c@192.168.44.201@o2ib44:173/0 lens 608/448 e 0 to 0 dl 1519044318 ref 1 fl Interpret:/0/0 rc 0/0
Feb 19 23:45:12 arkle6 kernel: Lustre: dagg-OST000b: Bulk IO write error with 400cfa1c-7c7d-1d14-09ed-f6043574fd7c (at 192.168.44.201@o2ib44), client will retry: rc = -110
Feb 19 23:45:12 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 54e81a18 (type 4), client csum now efde5b36
Feb 19 23:45:12 john100 kernel: LustreError: 924:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880e7a9b4e00 x1591386600740944/t197580821951(197580821951) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044319 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:13 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum df7fda9d
Feb 19 23:45:13 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum df7fda9d (type 4), client csum now efde5b36
Feb 19 23:45:13 john100 kernel: LustreError: 911:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8807492fe300 x1591386600747696/t197580821955(197580821955) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044320 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:15 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 87da008a
Feb 19 23:45:15 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 87da008a (type 4), client csum now efde5b36
Feb 19 23:45:15 john100 kernel: LustreError: 910:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880eb3d51b00 x1591386600751360/t197580821956(197580821956) o4->dagg-OST0005-osc-ffff88015a1ad800@192.168.44.33@o2ib44:6/4 lens 608/416 e 0 to 0 dl 1519044322 ref 2 fl Interpret:RM/0/0 rc 0/0
Feb 19 23:45:18 arkle3 kernel: LustreError: 168-f: dagg-OST0005: BAD WRITE CHECKSUM: from 12345-192.168.44.200@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111]: client csum efde5b36, server csum 1cc7a793
Feb 19 23:45:18 john100 kernel: LustreError: 132-0: dagg-OST0005-osc-ffff88015a1ad800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.44.33@o2ib44 inode [0x20000c02e:0x9ef:0x0] object 0x0:2683981 extent [331258364-335450111], original client csum efde5b36 (type 4), server csum 1cc7a793 (type 4), client csum now efde5b36

full log attached.
there were hung and unkillable user processes on the 2 clients afterwards. a reboot of the 2 clients has cleared up the looping messages of the type shown below.

Feb 20 14:28:15 arkle6 kernel: LustreError: 298395:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef2c00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88176544be00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8814072e0200
Feb 20 14:28:19 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88176544be00
Feb 20 14:28:19 arkle3 kernel: LustreError: 298611:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8814072e0200
Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff882e25ef3800
Feb 20 14:28:47 arkle6 kernel: LustreError: 298394:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff882e25ef3800
Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881615f0f000
Feb 20 14:28:51 arkle3 kernel: LustreError: 298610:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880be3931c00
Feb 20 14:28:51 arkle3 kernel: LustreError: 298609:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881615f0f000

cheers,
robin

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2018-04-18-03.LU10683.neterror.tgz
46.88 MB
18/Apr/18 8:26 AM
messages_lustre-servers-and-clients-20180316.log
654 kB
16/Mar/18 4:16 AM
messages-20180220.txt
105 kB
20/Feb/18 4:36 AM

Issue Links

is related to

LU-11663 corrupt data after page-unaligned write with zfs backend lustre 2.10

Resolved

LU-11093 clients hang when over quota

Resolved

Activity

[LU-10683] write checksum errors

SC Admin added a comment - 12/Apr/18 3:25 PM

Hi Amir,

I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error.
however it doesn't appear safe to turn on +net.
I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off.

Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------
Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0
Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_
zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int
el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic
ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg
 mei
Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom 
sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp
ort_sas pps_core libcrc32c [last unloaded: hfi1]
Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P           OE  ------------   3.10.0-693.17.1.el7.x86_64 #1
Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018
Apr 12 18:57:25 arkle5 kernel: Call Trace:
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100
Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20
Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd]
Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt]
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]---
Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------

I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS)

Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo
wed
Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00
Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00
Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0
Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485]  req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44)
Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44)
Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages
Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed
Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00
Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00
Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0
Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110
Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete

I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past

Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800
Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800
Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800
Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800
Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800

debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you.

cheers,
robin

SC Admin added a comment - 12/Apr/18 3:25 PM Hi Amir, I setup a script to tail syslog and run dk on anything that hits a CHECKSUM error. however it doesn't appear safe to turn on +net. I've seen 132 stack traces like these across all the servers and clients since I turned on +net this afternoon, so I've now turned it off. Apr 12 18:57:25 arkle5 kernel: ------------[ cut here ]------------ Apr 12 18:57:25 arkle5 kernel: WARNING: CPU: 1 PID: 127223 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: Modules linked in: sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag ip6table_filter ip6_tables iptable_filter osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mptctl mptbase 8021q garp mrp stp llc hfi1 sunrpc xfs dm_round_robin dcdbas int el_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr iTCO_wdt iTCO_vendor_support zfs(POE) zunic ode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) mgag200 ttm dm_multipath drm_kms_helper ses syscopyarea enclosure dm_mod sysfillrect sysimgblt fb_sys_fops drm mei_me lpc_ich shpchp i2c_i801 sg mei Apr 12 18:57:25 arkle5 kernel: ipmi_si ipmi_devintf nfit ipmi_msghandler libnvdimm acpi_power_meter tpm_crb rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm binfmt_misc ip_tables ib_ipoib ib_cm sr_mod cdrom sd_mod crc_t10dif crct10dif_generic bonding bnx2x rdmavt ahci i2c_algo_bit libahci crct10dif_pclmul mpt3sas mdio crct10dif_common i2c_core crc32c_intel ptp raid_class ib_core libata megaraid_sas scsi_transp ort_sas pps_core libcrc32c [last unloaded: hfi1] Apr 12 18:57:25 arkle5 kernel: CPU: 1 PID: 127223 Comm: hfi1_cq0 Tainted: P OE ------------ 3.10.0-693.17.1.el7.x86_64 #1 Apr 12 18:57:25 arkle5 kernel: Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 1.3.7 02/08/2018 Apr 12 18:57:25 arkle5 kernel: Call Trace: Apr 12 18:57:25 arkle5 kernel: [<ffffffff816a6071>] dump_stack+0x19/0x1b Apr 12 18:57:25 arkle5 kernel: [<ffffffff810895e8>] __warn+0xd8/0x100 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8108972d>] warn_slowpath_null+0x1d/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffff81091be2>] __local_bh_enable_ip+0x82/0xb0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816ade8e>] _raw_spin_unlock_bh+0x1e/0x20 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc06183b5>] cfs_trace_unlock_tcd+0x55/0x90 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623708>] libcfs_debug_vmsg2+0x6d8/0xb40 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810cfb6c>] ? dequeue_entity+0x11c/0x5d0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810c95d5>] ? sched_clock_cpu+0x85/0xc0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff8102954d>] ? __switch_to+0xcd/0x500 Apr 12 18:57:25 arkle5 kernel: [<ffffffffc0623bc7>] libcfs_debug_msg+0x57/0x80 [libcfs] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc069682a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] Apr 12 18:57:25 arkle5 kernel: [<ffffffffc03ab4a2>] send_complete+0x32/0x50 [rdmavt] Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2ac0>] kthread_worker_fn+0x80/0x180 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2a40>] ? kthread_stop+0xe0/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b270f>] kthread+0xcf/0xe0 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Apr 12 18:57:25 arkle5 kernel: [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 Apr 12 18:57:25 arkle5 kernel: ---[ end trace aaf779f5b67c32db ]--- Apr 12 18:57:25 arkle7 kernel: ------------[ cut here ]------------ I also have one client that looks permanently upset now (john50 is a client, arkle3 is an OSS) Apr 12 21:28:05 john50 kernel: LNetError: 909:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751341360 length 1048576 too big: 1045288 left, 1045288 allo wed Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 296380:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880c21c90a00 Apr 12 21:28:05 arkle3 kernel: LustreError: 233673:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8817643ce450 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:626/0 lens 608/448 e 0 to 0 dl 1523532491 ref 1 fl Interpret:/0/0 rc 0/0 Apr 12 21:28:05 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1523532485/real 1523532485] req@ffff8817b5bd6900 x1594448751341360/t0(0) o4->dagg-OST0004-osc-ffff88189aa1f000@192.168.44.33@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1523532492 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Apr 12 21:28:12 john50 kernel: Lustre: 933:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete Apr 12 21:28:12 john50 kernel: Lustre: Skipped 3 previous similar messages Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Client 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) reconnecting Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Connection restored to 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection restored to 192.168.44.33@o2ib44 (at 192.168.44.33@o2ib44) Apr 12 21:28:12 john50 kernel: Lustre: Skipped 2 previous similar messages Apr 12 21:28:12 john50 kernel: LNetError: 910:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.33@o2ib44, match 1594448751343216 length 1048576 too big: 1045288 left, 1045288 allowed Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 296379:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88169ed25a00 Apr 12 21:28:12 arkle3 kernel: LustreError: 50811:0:(ldlm_lib.c:3242:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8816dd9f2c50 x1594448751341360/t0(0) o4->98af46e3-9fa3-6f5b-2dcd-f89325115978@192.168.44.150@o2ib44:633/0 lens 608/448 e 0 to 0 dl 1523532498 ref 1 fl Interpret:/2/0 rc 0/0 Apr 12 21:28:12 arkle3 kernel: Lustre: dagg-OST0004: Bulk IO write error with 98af46e3-9fa3-6f5b-2dcd-f89325115978 (at 192.168.44.150@o2ib44), client will retry: rc = -110 Apr 12 21:28:19 john50 kernel: Lustre: dagg-OST0004-osc-ffff88189aa1f000: Connection to dagg-OST0004 (at 192.168.44.33@o2ib44) was lost; in progress operations using this service will wait for recovery to complete I'll reboot the client as (IIRC) this has cleared up this kind of problem in the past Apr 13 01:22:00 arkle3 kernel: LustreError: 296378:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff880f6ea28800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d4800 Apr 13 01:22:07 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d4800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff88150b4d2800 Apr 13 01:22:15 arkle3 kernel: LustreError: 296377:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff88150b4d2800 debug +neterror (a default) is still enabled and the dk will still catch that. hopefully that will be enough for you. cheers, robin

Amir Shehata (Inactive) added a comment - 10/Apr/18 2:44 PM

Hi Robin,

From both client and server would be great, separated into two files respectively.

thanks
amir

Amir Shehata (Inactive) added a comment - 10/Apr/18 2:44 PM Hi Robin, From both client and server would be great, separated into two files respectively. thanks amir

SC Admin added a comment - 10/Apr/18 7:04 AM

Hi Amir,

the incidents are rare. the last one was 3 days ago. probably over-quota related. we've tried to reproduce them but haven't managed to with simple dd's etc.

I guess I could turn on 'net neterror' debug and write a script to tail syslog and automatically run dk's when it sees the next burst of CHECKSUM's.
does that sound ok?
do you want dk just from servers or clients too?

cheers,
robin

SC Admin added a comment - 10/Apr/18 7:04 AM Hi Amir, the incidents are rare. the last one was 3 days ago. probably over-quota related. we've tried to reproduce them but haven't managed to with simple dd's etc. I guess I could turn on 'net neterror' debug and write a script to tail syslog and automatically run dk's when it sees the next burst of CHECKSUM's. does that sound ok? do you want dk just from servers or clients too? cheers, robin

Amir Shehata (Inactive) added a comment - 10/Apr/18 2:19 AM

Is it possible to turn on net logging and capture logs for a short period of time when this happens:

lctl set_param debug=+"net neterror"
lctl dk > log

Amir Shehata (Inactive) added a comment - 10/Apr/18 2:19 AM Is it possible to turn on net logging and capture logs for a short period of time when this happens: lctl set_param debug=+ "net neterror" lctl dk > log

Robin Humble (Inactive) added a comment - 28/Mar/18 4:45 AM - edited

Hi Andreas,

on our big /dagg filesystem I left the zfs recordsize at 2MB. it's a significant performance loss to set it to 1M - halves the size of all the i/o's to disks.

however these events also happen to the /home lustre filesystem where the zfs recordsize is 1M and always has been. those are in the logs above.

oh, damn, looks like /sys/module/zfs/parameters/zfs_max_recordsize=2M for these /home filesystems though, even though nothing uses that. that must be a left over from testing.

I haven't tried to change the RPC size for anything, sorry. all the lustre filesystems in this cluster seem to be using max_pages_per_rpc=1024. is that the right number to look at? we haven't tweaked anything around that. I just assumed 4M was the default these days.

we aren't using any cap_mask= options for hfi1. on clients it is

options hfi1 sge_copy_mode=2 krcvqs=4 piothreshold=0 wss_threshold=70 max_mtu=10240 eager_buffer_size=4194304

and on lustre servers it's

options hfi1 sge_copy_mode=2 krcvqs=8 piothreshold=0 wss_threshold=70

with max_mtu=10240 being a driver default, and the default eager_buffer_size is 2097152. but AFAIK lustre's verbs doesn't use that eager_buffer stuff - it's for PSM2 comms I think.

I actually don't know what most of these options do. it's just what various Intel people and docs told us was good.
only krcvqs seems to impact performance (<=half a socket seems best), so we've tuned that a bit.

Oleg - yes it's a hdf5 library. we're looking into the hdf5 code and will try to communicate with the user and see what options they used when calling it in parallel. I suspect it uses whatever MPIIO uses, but haven't looked at either of those in years.

thanks for looking into this.

cheers,
robin

Robin Humble (Inactive) added a comment - 28/Mar/18 4:45 AM - edited Hi Andreas, on our big /dagg filesystem I left the zfs recordsize at 2MB. it's a significant performance loss to set it to 1M - halves the size of all the i/o's to disks. however these events also happen to the /home lustre filesystem where the zfs recordsize is 1M and always has been. those are in the logs above. oh, damn, looks like /sys/module/zfs/parameters/zfs_max_recordsize=2M for these /home filesystems though, even though nothing uses that. that must be a left over from testing. I haven't tried to change the RPC size for anything, sorry. all the lustre filesystems in this cluster seem to be using max_pages_per_rpc=1024. is that the right number to look at? we haven't tweaked anything around that. I just assumed 4M was the default these days. we aren't using any cap_mask= options for hfi1. on clients it is options hfi1 sge_copy_mode=2 krcvqs=4 piothreshold=0 wss_threshold=70 max_mtu=10240 eager_buffer_size=4194304 and on lustre servers it's options hfi1 sge_copy_mode=2 krcvqs=8 piothreshold=0 wss_threshold=70 with max_mtu=10240 being a driver default, and the default eager_buffer_size is 2097152. but AFAIK lustre's verbs doesn't use that eager_buffer stuff - it's for PSM2 comms I think. I actually don't know what most of these options do. it's just what various Intel people and docs told us was good. only krcvqs seems to impact performance (<=half a socket seems best), so we've tuned that a bit. Oleg - yes it's a hdf5 library. we're looking into the hdf5 code and will try to communicate with the user and see what options they used when calling it in parallel. I suspect it uses whatever MPIIO uses, but haven't looked at either of those in years. thanks for looking into this. cheers, robin

Oleg Drokin added a comment - 27/Mar/18 5:36 PM

What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance?

Oleg Drokin added a comment - 27/Mar/18 5:36 PM What is the hdf5 library (I assume by the H5 prefix) do you use for this io (if any), does it use direct IO internally by any chance?

Andreas Dilger added a comment - 27/Mar/18 5:34 PM

Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount).

Also, are you using TID RDMA (cap_mask=....) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled.

Andreas Dilger added a comment - 27/Mar/18 5:34 PM Hi Robin, did you reduce the recordsize to 1MB on the filesystem? We haven't done any testing ourselves with larger recordsize. The clients would need to be remounted to also get a smaller RPC size (they default to RPC size == max blocksize for ZFS at mount). Also, are you using TID RDMA ( cap_mask=.... ) for your OPA connection? We've seen problems with that under load, and if yes it should be disabled.

Peter Jones added a comment - 26/Mar/18 12:38 PM

Hongchao

Can you please advise with this one?

Thanks

Peter

Peter Jones added a comment - 26/Mar/18 12:38 PM Hongchao Can you please advise with this one? Thanks Peter

Robin Humble (Inactive) added a comment - 16/Mar/18 4:15 AM

a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though...

I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming.

any thoughts on whether these are just from quota events or not? is there any way we can easily tell that?

I'll attach messages for today's errors to this ticket in a min.

the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes.

HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168:
  #000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 345 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 791 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed
    major: Dataset
    minor: Write failed
  #004: H5Dselect.c line 309 in H5D__select_write(): write error
    major: Dataspace
    minor: Write failed
  #005: H5Dselect.c line 220 in H5D__select_io(): write error
    major: Dataspace
    minor: Write failed
  #006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write
    major: Dataset
    minor: Can't operate on object
  #007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation
    major: Internal error (too specific to document in detail)
    minor: Can't operate on object
  #008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed
    major: Dataset
    minor: Write failed
  #009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #011: H5Faccum.c line 834 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #012: H5FDint.c line 308 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018
, filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052
    major: Low-level I/O
    minor: Write failed

cheers,
robin

Robin Humble (Inactive) added a comment - 16/Mar/18 4:15 AM a bunch more CHECKSUM and LNet errors today. this lot were again definitely associated with over quota. I don't know if all the incidents are though... I guess I'd be very worried if these weren't purely over quota events, which is why the read checksum messages were very alarming. any thoughts on whether these are just from quota events or not? is there any way we can easily tell that? I'll attach messages for today's errors to this ticket in a min. the user has many (~1100 so far) of the below in their job output. looks like they have ~127 write processes across their 28 nodes and 896 cores. the code is looping on the nodes trying to complete the writes. HDF5-DIAG: Error detected in HDF5 (1.10.1) MPI-process 168: #000: H5Dio.c line 269 in H5Dwrite(): can't prepare for writing data major: Dataset minor: Write failed #001: H5Dio.c line 345 in H5D__pre_write(): can't write data major: Dataset minor: Write failed #002: H5Dio.c line 791 in H5D__write(): can't write data major: Dataset minor: Write failed #003: H5Dcontig.c line 642 in H5D__contig_write(): contiguous write failed major: Dataset minor: Write failed #004: H5Dselect.c line 309 in H5D__select_write(): write error major: Dataspace minor: Write failed #005: H5Dselect.c line 220 in H5D__select_io(): write error major: Dataspace minor: Write failed #006: H5Dcontig.c line 1267 in H5D__contig_writevv(): can't perform vectorized sieve buffer write major: Dataset minor: Can't operate on object #007: H5VM.c line 1500 in H5VM_opvv(): can't perform operation major: Internal error (too specific to document in detail) minor: Can't operate on object #008: H5Dcontig.c line 1014 in H5D__contig_writevv_sieve_cb(): block write failed major: Dataset minor: Write failed #009: H5Fio.c line 195 in H5F_block_write(): write through page buffer failed major: Low-level I/O minor: Write failed #010: H5PB.c line 1041 in H5PB_write(): write through metadata accumulator failed major: Page Buffering minor: Write failed #011: H5Faccum.c line 834 in H5F__accum_write(): file write failed major: Low-level I/O minor: Write failed #012: H5FDint.c line 308 in H5FD_write(): driver write request failed major: Virtual File Layer minor: Write failed #013: H5FDsec2.c line 810 in H5FD_sec2_write(): file write failed: time = Fri Mar 16 03:17:20 2018 , filename = './L35_N2650/snapshot_050.24.hdf5', file descriptor = 24, errno = 122, error message = 'Disk quota exceeded', buf = 0x2b07c54a3010, total write size = 31457280, bytes this sub-write = 31457280, bytes actually written = 184467440737052 major: Low-level I/O minor: Write failed cheers, robin

Robin Humble (Inactive) added a comment - 15/Mar/18 2:12 PM

Hi Nathaniel,

all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded.
grep for CHECKSUM

if you'd like console logs from these machines too then please let us know.

cheers,
robin

Robin Humble (Inactive) added a comment - 15/Mar/18 2:12 PM Hi Nathaniel, all messages for the lustre servers and from the 7 clients affected so far for all of 2018 have been uploaded. grep for CHECKSUM if you'd like console logs from these machines too then please let us know. cheers, robin

Nathaniel Clark added a comment - 14/Mar/18 5:51 PM

Below are instructions for uploading logs to our write-only ftp site:

Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name.

In the following example, there are three debug logs in a single directory and the JIRA issue ~~LU-4242~~ has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.

$ ls -lh
total 333M
-rw-r--r-- 1 mjmac mjmac  98M Feb 23 17:36 mds-debug
-rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug
-rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug
$ ncftp ftp.hpdd.intel.com
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/).
Connecting to 99.96.190.235...
(vsFTPd 2.2.2)
Logging in...
Login successful.
Logged in to ftp.hpdd.intel.com.
ncftp / > cd uploads
Directory successfully changed.
ncftp /uploads > mkdir LU-4242
ncftp /uploads > cd LU-4242
Directory successfully changed.
ncftp /uploads/LU-4242 > put *
mds-debug:                                              97.66 MB   11.22 MB/s
oss-00-debug:                                          117.19 MB   11.16 MB/s
oss-01-debug:                                          117.48 MB   11.18 MB/s
ncftp /uploads/LU-4242 >

Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them.

Nathaniel Clark added a comment - 14/Mar/18 5:51 PM Below are instructions for uploading logs to our write-only ftp site: Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, HPDD provides an anonymous write-only FTP upload service. In order to use this service, you'll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the 'uploads' directory and create a new subdirectory using your Jira issue as a name. In the following example, there are three debug logs in a single directory and the JIRA issue LU-4242 has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs. $ ls -lh total 333M -rw-r--r-- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug -rw-r--r-- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug $ ncftp ftp.hpdd.intel.com NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (http://www.NcFTP.com/contact/). Connecting to 99.96.190.235... (vsFTPd 2.2.2) Logging in... Login successful. Logged in to ftp.hpdd.intel.com. ncftp / > cd uploads Directory successfully changed. ncftp /uploads > mkdir LU-4242 ncftp /uploads > cd LU-4242 Directory successfully changed. ncftp /uploads/LU-4242 > put * mds-debug: 97.66 MB 11.22 MB/s oss-00-debug: 117.19 MB 11.16 MB/s oss-01-debug: 117.48 MB 11.18 MB/s ncftp /uploads/LU-4242 > Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you've created, nor will you (or anyone other than HPDD staff) be able to see or read them.

People

Assignee:: Hongchao Zhang

Reporter:: SC Admin

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 20/Feb/18 4:40 AM

Updated:: 28/Nov/18 5:12 PM

Resolved:: 24/Jul/18 4:08 PM