[LU-7106] Lustre client fail with error vvp_io.c:1081:vvp_io_commit_write()) even went there are space in OST and MDT Created: 04/Sep/15  Updated: 20/Oct/15  Resolved: 20/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Haisong Cai (Inactive) Assignee: Yang Sheng
Resolution: Incomplete Votes: 0
Labels: None
Environment:

client:
lustre-client-2.5.3-2.6.32_431.29.2.el6.x86_64.x86_64
lustre-client-source-2.4.3-2.6.32_431.20.3.el6.x86_64_gfbfbc94.x86_64
lustre-client-modules-2.5.3-2.6.32_431.29.2.el6.x86_64.x86_64

server:

lustre-osd-zfs-mount-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-iokit-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-modules-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-osd-zfs-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-tests-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-source-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64
lustre-2.7.56-3.10.73_1.el6.elrepo.x86_64_g1ef0185.x86_64


Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

Clients are getting error when writing to Lustre server (build 2.7.56):

commands like "cp" will return "no space left on device" error.
Here are the corresponding logs:

Sep 4 10:06:16 oasis-dm1 kernel: LustreError: 4891:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 9477610 of inode ffff8803d965b138 failed -28
Sep 4 10:06:16 oasis-dm1 kernel: LustreError: 4891:0:(vvp_io.c:1081:vvp_io_commit_write()) Skipped 1 previous similar message
Sep 4 10:19:39 oasis-dm1 kernel: LustreError: 5492:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 804864 of inode ffff88049e96abb8 failed -28
Sep 4 10:19:39 oasis-dm1 kernel: LustreError: 5492:0:(vvp_io.c:1081:vvp_io_commit_write()) Skipped 3 previous similar messages
Sep 4 10:41:32 oasis-dm1 kernel: LustreError: 7446:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 8473626 of inode ffff88016af646b8 failed -28
Sep 4 10:41:32 oasis-dm1 kernel: LustreError: 7446:0:(vvp_io.c:1081:vvp_io_commit_write()) Skipped 6 previous similar messages
Sep 4 12:00:54 oasis-dm1 kernel: LustreError: 17162:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 3805354 of inode ffff880940c2cb38 failed -28
Sep 4 12:00:54 oasis-dm1 kernel: LustreError: 17162:0:(vvp_io.c:1081:vvp_io_commit_write()) Skipped 1 previous similar message
Sep 4 12:04:37 oasis-dm1 kernel: LustreError: 17541:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 6265883 of inode ffff880254611138 failed -28
Sep 4 12:04:37 oasis-dm1 kernel: LustreError: 17541:0:(vvp_io.c:1081:vvp_io_commit_write()) Skipped 1 previous similar message

OST/MDT are not lack space/inode ( avail ~16TB / 10+ million on average), checked from client with

grep '[0-9]' /proc/fs/lustre/osc/*/kbytes

{free,avail,total}
grep '[0-9]' /proc/fs/lustre/osc/*/files{free,total}
grep '[0-9]' /proc/fs/lustre/mdc/*/kbytes{free,avail,total}

grep '[0-9]' /proc/fs/lustre/mdc/*/files

{free,total}

 Comments   
Comment by Haisong Cai (Inactive) [ 04/Sep/15 ]

A strace taken from an application to an OST:

read(3, "\305\251z:8\36]?\216\17\273\203v{\267Yo4\207\347g\227{\7#7\37~#\17\26v"..., 4194304) = 4194304
write(4, "\305\251z:8\36]?\216\17\273\203v{\267Yo4\207\347g\227{\7#7\37~#\17\26v"..., 4194304) = 2985984
write(4, "[\345&\247\201?\377>\371\205/\0\277~\220\10\206\300\252\221\1,OT \350w\26\355\254\213\301"..., 1208320) = -1 ENOSPC (No space left on device)

Yet

lfs getstripe NA19240.chrom6.SOLID.bfast.YRI.high_coverage.20100311.bam
NA19240.chrom6.SOLID.bfast.YRI.high_coverage.20100311.bam
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 2862322 0x2bacf2 0

panda-OST0001_UUID 28497036288 10651272192 17823039488 37% /oasis/scratch/comet[OST:1]

Comment by Peter Jones [ 04/Sep/15 ]

Yang Sheng

Could you please help with this issue?

Thanks

Peter

Comment by Yang Sheng [ 06/Sep/15 ]

Hi, Haisong,

Looks this is a zfs backend. So could you tell the zfs version? I was confused by your server kernel version. It has a 'el6' name but with a '3.10.73' version number. I would very appreciated if you can provided lustre debuglog while the issue hit on. Both server & client is best.

Thanks,
YangSheng

Comment by Yang Sheng [ 13/Oct/15 ]

Hi Haisong,

Could you please give us a status update for this ticket? Does it still need further work or should we close it?

Thanks,
YangSheng

Comment by John Fuchs-Chesney (Inactive) [ 20/Oct/15 ]

Haisong,

I am marking this one as resolved/incomplete. If you would prefer that we do some more work on this issue, just let us know, and provide the information that Yang Sheng has asked for above and we will try to make more progress.

Many thanks,
~ jfc.

Generated at Sat Feb 10 02:06:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.