[LU-6538] Client s getting "operation ost_write failed with -3." Created: 29/Apr/15  Updated: 22/Jun/16  Resolved: 22/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Haisong Cai (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: sdsc
Environment:

Linux meerkat-mds-10-1.local 2.6.32-431.29.2.el6_lustre.gb8d9077.x86_64 #1 SMP Mon Apr 27 12:55:47 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux


Attachments: File aggregated_messages_4-27.gz     HTML File debugfs_out     HTML File lustre_messages_today     HTML File proc_out    
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

Some clients can't write to some OSTs.

user@gordon-ln3 LTEST]$ cd /oasis/projects/nsf/use300/mahidhar/LTEST
[user@gordon-ln3 LTEST]$ lfs setstripe -c 1 -i 54 .
[user@gordon-ln3 LTEST]$ ls
IOR-2.10.3.tar
[user@gordon-ln3 LTEST]$ rm IOR-2.10.3.tar
[user@gordon-ln3 LTEST]$ cp ../IOR-2.10.3.tar .
cp: writing `./IOR-2.10.3.tar': No such process
[user@gordon-ln3 LTEST]$

Client dmesg got:

LustreError: 11-0: meerkat-OST0015-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx244@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST000e-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST000d-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0024-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST002a-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0009-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST000e-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST000d-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0024-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST002a-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0009-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0026-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0026-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0025-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0017-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0031-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0025-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0036-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST0031-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.
LustreError: 11-0: meerkat-OST000c-osc-ffff88102f998c00: Communicating with xx.xx.xx.xx@tcp, operation ost_write failed with -3.

Server log see attachment



 Comments   
Comment by Rick Wagner (Inactive) [ 29/Apr/15 ]

Minh, can you add an sdsc label to this ticket?

Comment by Zhenyu Xu [ 29/Apr/15 ]

From the log, it possibly relates to osd quota operation osd_acct_index_lookup() which calls vfs_get_dqblk()

int vfs_get_dqblk(struct super_block *sb, int type, qid_t id,
                  struct if_dqblk *di)
{
        struct dquot *dquot;

        dquot = dqget(sb, id, type);
        if (!dquot)
                return -ESRCH;
        do_get_dqblk(dquot, di);
        dqput(dquot);

        return 0;
}

and it cannot get reference to dquot.

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Hi Zhenyu,

Is this a bug or setting problem?

Haisong

Comment by Niu Yawei (Inactive) [ 29/Apr/15 ]

Hi, Haisong

This looks like a setting problem.

Do all the OST/MDT devices have quota feature enabled? You can run following command to verify it: "debugfs -c -R features $dev"
And please check the quota proc file "/proc/fs/lustre/osd-ldiskfs/$MDT_or_$OST/quota_slave/info" to see if the quota accounting is setup properly. Thank you.

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Hi Niu,

proc and debugfs checked out in OSS and MDS.
Besides a few fields differences between OST and MDT, all targets have identical settings (see attached 2 output files)

thanks,
Haisong

Comment by Niu Yawei (Inactive) [ 29/Apr/15 ]

All devices looks have quota feature enabled, then I suspect that quota wasn't enabled on mount for some OSTs. (OST0034 for instance)

When did you observe such write failures (-3)? Any error happened on these OSTs before the -3 failures? Do you happen to have earlier log messages so that we can check if quota wasn't enabled successfully on mount time?

Could you pick on problematic OST (OST0034 for instance), and perform umount -> mount to see if that will resolve the problem? And please also attach the log messages when umount/mount the OST. Thank you.

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Niu,

1) We are mounting OST simply like this:
mount -t lustre /dev/ost_sdc /meerkat/ost_sdc

Are there options we should specify when mounting?

2) (-3) error is happening one client side when someone write to problematic OSTs. It doesn't happen for user and either every time. The filesystem was upgraded from 1.8.7 to 2.4.3 a year ago and didn't had problem. Last Monday we upgraded from 2.4.3 to 2.5.3. Almost right after that we started getting write errors.

3) I will gather the logs when we mount OSTs last Monday and upload here

4) I will also try to umount and mount the problematic and attach log here.

Haisong

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Tried unmount and mount an bad OST:

[root@meerkat-oss-11-4 ~]# mount -t lustre /dev/ost_sdi /meerkat/ost_sdi
mount.lustre: set /sys/block/sdi/queue/max_sectors_kb to 280

[root@meerkat-oss-11-4 ~]#

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Last few lines from dmesg on 11-4:

VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
LustreError: 4787:0:(qsd_handler.c:1155:qsd_op_adjust()) meerkat-OST000e: fail to locate lqe for id:505418, type:0
LustreError: 4787:0:(qsd_handler.c:1155:qsd_op_adjust()) Skipped 9787 previous similar messages
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
Lustre: 12307:0:(qsd_reint.c:486:qsd_reint_main()) meerkat-OST0036: reint global for [0x200000006:0x20000:0x0] failed. -3
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
LustreError: 4787:0:(qsd_handler.c:1155:qsd_op_adjust()) meerkat-OST000e: fail to locate lqe for id:505418, type:0
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.
VFS: find_free_dqentry(): Data block full but it shouldn't.
VFS: Error -5 occurred while creating quota.

Comment by Niu Yawei (Inactive) [ 29/Apr/15 ]

Looks the quota accounting files are corrupted somehow, could you make sure that latest e2fsprogs are installed on the problematic OSTs? (1.42.12-wc1)
Then try to fix the quota files by:

  • umount the OST;
  • tune2fs -O ^quota $ostdev;
  • tune2fs -O quota $ostdev;
  • mount OST;
Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Aggregated logs from all MDS/OSS, before and after 2.4.3 -> 2.5.3 upgrade.
Sometime between 3pm and 4pm on that day was when we mounted OST/MDT.

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

On all OSTs or just the ones reporting problem?

thanks,
Haisong

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

By the way....

[root@meerkat-oss-11-4 log]# rpm -aq | grep e2fsprog
e2fsprogs-libs-1.42.12.wc1-7.el6.x86_64
e2fsprogs-1.42.12.wc1-7.el6.x86_64

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

[root@meerkat-oss-11-4 log]# umount /meerkat/ost_sdi
[root@meerkat-oss-11-4 log]# tune2fs -O ^quota /dev/ost_sdi
tune2fs 1.42.12.wc1 (15-Sep-2014)
[root@meerkat-oss-11-4 log]# tune2fs -O quota /dev/ost_sdi
tune2fs 1.42.12.wc1 (15-Sep-2014)
[root@meerkat-oss-11-4 log]# mount -t lustre /dev/ost_sdi /meerkat/ost_sdi
mount.lustre: set /sys/block/sdi/queue/max_sectors_kb to 280

[root@meerkat-oss-11-4 log]#

Testing and will let you know if it fix the problem or not.

Comment by Niu Yawei (Inactive) [ 29/Apr/15 ]

I'm not sure if the latest e2fsprogs has been installed when you enable quota feature for the OST/MDT devices (when upgrading from 2.1 to 2.4). If it wasn't the latest version, I suggest you to run it for all MDT/OSTs. Of course, you'd choose the problematic one first to see if it can fix the problem.

Comment by Haisong Cai (Inactive) [ 29/Apr/15 ]

Niu,

A quick test on one the problematic OSTs show that fix the write problem.
We are going to initialize all OSTs and MDT now.

Haisong

Comment by Niu Yawei (Inactive) [ 23/Jul/15 ]

Haisong, anything else need we do on this ticket? Can we close this one?

Generated at Sat Feb 10 02:01:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.