[LU-3421] (ost_handler.c:1762:ost_blocking_ast()) Error -2 syncing data on lock cancel causes first ENOSPC client issues then MDS server locks up - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0
Affects Version/s: Lustre 2.4.0
Labels:
- llnl
Environment:
RHEL6.4 running with Lutsre 2.4-rc2 and cray clients running [2.4.0-rc1 SLES11 SP1 / 2.4.0-rc2 SLES11 SP2] clients

Severity:
3
Rank (Obsolete):
8486

Description

Several times during our test shot we would encounter a situation where OSTs would report ENOSPC even tho there was enough space and inodes available. In time the MDS would become pinned and it would have to be rebooted to get a working file system again. I have attached the console logs for the MDS and OSS.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mds-kern.log
2.59 MB
30/May/13 4:50 PM
oss11a4-kern.log
49 kB
30/May/13 4:50 PM

Issue Links

is related to

LU-1947 OST ZFS grant shortage on precreate

Resolved

Activity

[LU-3421] (ost_handler.c:1762:ost_blocking_ast()) Error -2 syncing data on lock cancel causes first ENOSPC client issues then MDS server locks up

Peter Jones added a comment - 03/Sep/13 1:11 PM

Thanks James!

Peter Jones added a comment - 03/Sep/13 1:11 PM Thanks James!

James A Simmons added a comment - 03/Sep/13 1:08 PM

I haven't seen this bug in some time. You can close it.

James A Simmons added a comment - 03/Sep/13 1:08 PM I haven't seen this bug in some time. You can close it.

Peter Jones added a comment - 03/Sep/13 3:46 AM

James

This patch landed has landed both for 2.4.1 and 2.5. Has the issue reproduced in any test runs featuring this patch? If not, then perhaps we can close the ticket and reopen if it ever does reappear...

Peter

Peter Jones added a comment - 03/Sep/13 3:46 AM James This patch landed has landed both for 2.4.1 and 2.5. Has the issue reproduced in any test runs featuring this patch? If not, then perhaps we can close the ticket and reopen if it ever does reappear... Peter

James A Simmons added a comment - 07/Jun/13 4:08 PM - edited

Looks like the end of July for our next test shot. I will see if I can duplicate it at a smaller scale in the mean time.

James A Simmons added a comment - 07/Jun/13 4:08 PM - edited Looks like the end of July for our next test shot. I will see if I can duplicate it at a smaller scale in the mean time.

Peter Jones added a comment - 06/Jun/13 12:55 PM

ok James. Do you have a timeframe for your next testshot yet?

Peter Jones added a comment - 06/Jun/13 12:55 PM ok James. Do you have a timeframe for your next testshot yet?

James A Simmons added a comment - 06/Jun/13 12:51 PM

I tested your patch at small scale and it works fine. I like to keep this ticket open until our next test shot to ensure this addresses the problem. Andreas point out other issues as well.

James A Simmons added a comment - 06/Jun/13 12:51 PM I tested your patch at small scale and it works fine. I like to keep this ticket open until our next test shot to ensure this addresses the problem. Andreas point out other issues as well.

Johann Lombardi (Inactive) added a comment - 04/Jun/13 3:27 PM

While running some tests locally, i found out that the space reserved for precreation always decreases, eventually reaches 0 and stays there. It seems that we exit from ofd_grant() at line 641:

 637         /* align grant on block size */
 638         grant &= ~((1ULL << ofd->ofd_blockbits) - 1);
 639 
 640         if (!grant)
 641                 RETURN(0);

I think there are two issues:

ofd_grant_create() is not aggressive enough in reserving space for precreation and ends up requesting an amount of grant space smaller than a block
the rounding in ofd_grant() turns the <4KB allocation into 0.

I will provide a patch.

Johann Lombardi (Inactive) added a comment - 04/Jun/13 3:27 PM While running some tests locally, i found out that the space reserved for precreation always decreases, eventually reaches 0 and stays there. It seems that we exit from ofd_grant() at line 641: 637 /* align grant on block size */ 638 grant &= ~((1ULL << ofd->ofd_blockbits) - 1); 639 640 if (!grant) 641 RETURN(0); I think there are two issues: ofd_grant_create() is not aggressive enough in reserving space for precreation and ends up requesting an amount of grant space smaller than a block the rounding in ofd_grant() turns the <4KB allocation into 0. I will provide a patch.

Johann Lombardi (Inactive) added a comment - 04/Jun/13 8:32 AM - edited

This is subtracting the granted space from the available space returned to the MDS,

This is actually only subtracting the space reserved for pre-creation for which we use the self export.

but I think it should be adding the granted space back into os_bavail so that the MDS does not consider the grant space as "used". Otherwise, if the clients have reserved a lot of space on the OSTs they may not actually get to use it because the MDS will never allocate space there.

I think there is nothing to add back, since only tot_dirty and tot_pending are taken into account here. Please note that tot_granted is not withdrawn anywhere.

A secondary issue is that there is no coordination between the space granted to a specific client and the objects that the MDS allocates to that client, which would become more important as the free space is running out.

I would agree if we were taking tot_granted out in statfs reply, however i don't think this is the case.

There could be some kind of (IMHO complex) coordination here between the MDS and clients/OSTs, but I think it would be easier if we just got the grant shrinking code to work again, as there is no guarantee that (a) clients doing IO will have any grant at all, and (b) the clients have grant on the OSTs for which they have been asked to write on. Returning unused grant to the OSTs as the free space shrinks is the best way to ensure that there is some left for the clients actually doing IO.

I am all for resurrecting grant shrinking, although i haven't had the opportunity to do it yet by lack of time. In fact, we might as well just disconnect (and therefore release grants) from OSTs when idle and when the replay list is empty. We could then reconnect on demand. IMHO, such a scheme would have other benefits: less clients to wait for during recovery and less PING requests.

As for the original problem, it seems that precreate requests fails with ENOSPC on the OST:

May 29 09:00:06 widow-oss11a4 kernel: [79146.403713] LustreError: 25344:0:(ofd_obd.c:1338:ofd_create()) routed1-OST016b: unable to precreate: rc = -28

James, could you please dump the grant state on the OST by running "lctl get_param obdfilter..tot obdfilter..grant"?

Johann Lombardi (Inactive) added a comment - 04/Jun/13 8:32 AM - edited This is subtracting the granted space from the available space returned to the MDS, This is actually only subtracting the space reserved for pre-creation for which we use the self export. but I think it should be adding the granted space back into os_bavail so that the MDS does not consider the grant space as "used". Otherwise, if the clients have reserved a lot of space on the OSTs they may not actually get to use it because the MDS will never allocate space there. I think there is nothing to add back, since only tot_dirty and tot_pending are taken into account here. Please note that tot_granted is not withdrawn anywhere. A secondary issue is that there is no coordination between the space granted to a specific client and the objects that the MDS allocates to that client, which would become more important as the free space is running out. I would agree if we were taking tot_granted out in statfs reply, however i don't think this is the case. There could be some kind of (IMHO complex) coordination here between the MDS and clients/OSTs, but I think it would be easier if we just got the grant shrinking code to work again, as there is no guarantee that (a) clients doing IO will have any grant at all, and (b) the clients have grant on the OSTs for which they have been asked to write on. Returning unused grant to the OSTs as the free space shrinks is the best way to ensure that there is some left for the clients actually doing IO. I am all for resurrecting grant shrinking, although i haven't had the opportunity to do it yet by lack of time. In fact, we might as well just disconnect (and therefore release grants) from OSTs when idle and when the replay list is empty. We could then reconnect on demand. IMHO, such a scheme would have other benefits: less clients to wait for during recovery and less PING requests. As for the original problem, it seems that precreate requests fails with ENOSPC on the OST: May 29 09:00:06 widow-oss11a4 kernel: [79146.403713] LustreError: 25344:0:(ofd_obd.c:1338:ofd_create()) routed1-OST016b: unable to precreate: rc = -28 James, could you please dump the grant state on the OST by running "lctl get_param obdfilter. .tot obdfilter. .grant "?

Andreas Dilger added a comment - 03/Jun/13 10:56 PM

James, how many clients were mounting this filesystem? If each OST is 250GB, and each client gets a 32MB grant, that means 32 clients/GB of free space, so 8000 clients would essentially pin all of the available space on each client. I see something a bit strange in the code that might be causing a problem here:

static int ofd_statfs(...)
{
        osfs->os_bavail -= min_t(obd_size, osfs->os_bavail,
                                 ((ofd->ofd_tot_dirty + ofd->ofd_tot_pending +
                                   osfs->os_bsize - 1) >> ofd->ofd_blockbits));
        :
        :
        /* The QoS code on the MDS does not care about space reserved for
         * precreate, so take it out. */
        if (exp_connect_flags(exp) & OBD_CONNECT_MDS) {
                struct filter_export_data *fed;

                fed = &obd->obd_self_export->exp_filter_data;
                osfs->os_bavail -= min_t(obd_size, osfs->os_bavail,
                                         fed->fed_grant >> ofd->ofd_blockbits);
        }

This is subtracting the granted space from the available space returned to the MDS, but I think it should be adding the granted space back into os_bavail so that the MDS does not consider the grant space as "used". Otherwise, if the clients have reserved a lot of space on the OSTs they may not actually get to use it because the MDS will never allocate space there.

A secondary issue is that there is no coordination between the space granted to a specific client and the objects that the MDS allocates to that client, which would become more important as the free space is running out. There could be some kind of (IMHO complex) coordination here between the MDS and clients/OSTs, but I think it would be easier if we just got the grant shrinking code to work again, as there is no guarantee that (a) clients doing IO will have any grant at all, and (b) the clients have grant on the OSTs for which they have been asked to write on. Returning unused grant to the OSTs as the free space shrinks is the best way to ensure that there is some left for the clients actually doing IO.

Andreas Dilger added a comment - 03/Jun/13 10:56 PM James, how many clients were mounting this filesystem? If each OST is 250GB, and each client gets a 32MB grant, that means 32 clients/GB of free space, so 8000 clients would essentially pin all of the available space on each client. I see something a bit strange in the code that might be causing a problem here: static int ofd_statfs(...) { osfs->os_bavail -= min_t(obd_size, osfs->os_bavail, ((ofd->ofd_tot_dirty + ofd->ofd_tot_pending + osfs->os_bsize - 1) >> ofd->ofd_blockbits)); : : /* The QoS code on the MDS does not care about space reserved for * precreate, so take it out. */ if (exp_connect_flags(exp) & OBD_CONNECT_MDS) { struct filter_export_data *fed; fed = &obd->obd_self_export->exp_filter_data; osfs->os_bavail -= min_t(obd_size, osfs->os_bavail, fed->fed_grant >> ofd->ofd_blockbits); } This is subtracting the granted space from the available space returned to the MDS, but I think it should be adding the granted space back into os_bavail so that the MDS does not consider the grant space as "used". Otherwise, if the clients have reserved a lot of space on the OSTs they may not actually get to use it because the MDS will never allocate space there. A secondary issue is that there is no coordination between the space granted to a specific client and the objects that the MDS allocates to that client, which would become more important as the free space is running out. There could be some kind of (IMHO complex) coordination here between the MDS and clients/OSTs, but I think it would be easier if we just got the grant shrinking code to work again, as there is no guarantee that (a) clients doing IO will have any grant at all, and (b) the clients have grant on the OSTs for which they have been asked to write on. Returning unused grant to the OSTs as the free space shrinks is the best way to ensure that there is some left for the clients actually doing IO.

James A Simmons added a comment - 31/May/13 11:49 AM

The only OSS logs I have are like the one I posted here.

James A Simmons added a comment - 31/May/13 11:49 AM The only OSS logs I have are like the one I posted here.

People

Assignee:: Zhenyu Xu

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/May/13 4:50 PM

Updated:: 18/Sep/14 9:00 PM

Resolved:: 03/Sep/13 1:11 PM