Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4345

failed to update accounting ZAP for user

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • Lustre 2.4.0-19chaos
    • 3
    • 11907

    Description

      We are using lustre 2.4.0-19chaos on our servers running with the ZFS OSD. On some of the OSS nodes we are seeing messages like this:

      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST0007: failed to update accounting ZAP for user 132245 (-2)
      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) Skipped 5 previous similar messages
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000f: failed to update accounting ZAP for user 122392 (-2)
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) Skipped 3 previous similar messages
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 122708 (-2)
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) Skipped 4 previous similar messages
      
      Nov  7 00:31:36 porter31 kernel: LustreError: 7704:0:(osd_object.c:973:osd_attr_set()) lse-OST001f: failed to update accounting ZAP for user 54916 (-2)
      Nov  7 02:53:05 porter19 kernel: LustreError: 9380:0:(osd_object.c:973:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for user 7230 (-2)
      
      Dec  3 12:01:21 stout7 kernel: Lustre: Skipped 3 previous similar messages
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0003: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) Skipped 3 previous similar messages
      Dec  3 13:52:30 stout1 kernel: LustreError: 15324:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout1 kernel: LustreError: 15784:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout14 kernel: LustreError: 16345:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000d: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout12 kernel: LustreError: 32355:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout2 kernel: LustreError: 15145:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0001: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout10 kernel: LustreError: 14570:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0009: failed to update accounting ZAP for user 1752876224 (-2)
      

      First of all, these messages are terrible. If you look at osd_attr_set() there are four exactly identical messages that are printed. Ok, granted, we can look them up by line number. But even better would be to make them unique.

      So looking them up by line numbers 967 and 973, it would appear that we have hit at least the first two of the "filed to update accounting ZAP for user" messages.

      Note that the UID numbers do not look correct to me. Many of them are clearly not in the valid UID range. But then I don't completely understand what is going on here yet.

      Attachments

        Issue Links

          Activity

            [LU-4345] failed to update accounting ZAP for user
            yujian Jian Yu added a comment -

            Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/11435

            yujian Jian Yu added a comment - Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/11435

            > spa_history_log() and spa_history_log_sync don't calls dmu_tx_create_assigned()

            They don't call it because there aren't any atomic requirements for the history. But if you need to make sure all the tx's end up being atomic you'll need dmu_tx_create_assigned() which is what the quota code uses. Even better just use the tx passed to the synctask. This is created in dsl_pool_sync() and will be tied to the txg being synced.

            behlendorf Brian Behlendorf added a comment - > spa_history_log() and spa_history_log_sync don't calls dmu_tx_create_assigned() They don't call it because there aren't any atomic requirements for the history. But if you need to make sure all the tx's end up being atomic you'll need dmu_tx_create_assigned() which is what the quota code uses. Even better just use the tx passed to the synctask. This is created in dsl_pool_sync() and will be tied to the txg being synced.

            OK, I'll have a closer look at the debug options, but notice that spa_history_log() and spa_history_log_sync don't calls dmu_tx_create_assigned() and this still work?

            I'll also try to use dmu_tx_create_assigned(), thanks.

            bzzz Alex Zhuravlev added a comment - OK, I'll have a closer look at the debug options, but notice that spa_history_log() and spa_history_log_sync don't calls dmu_tx_create_assigned() and this still work? I'll also try to use dmu_tx_create_assigned(), thanks.

            Perhaps it's just me. But I find it hard to walk the code and verify that the tx is being constructed correctly. Are we perhaps taking extra holds now? One thing you could try is to build ZFS with with the --enable-debug-dmu-tx option. This will enable a variety of checks to ensure that tx are constructed and managed properly. If they're not you'll ASSERT. They're somewhat expensive so they're disabled by default.

            As for doing thing like the internal ZFS code quota is managed more like I described above. If you look at dsl_pool_sync() which is called in the sync'ing context like a synctask you'll see that it creates a new tx for the correct txg with dmu_tx_create_assigned(). It then goes through the list of dirty dnodes per dataset and updates the zap's accordingly. I don't see why Lustre couldn't do something similar.

            behlendorf Brian Behlendorf added a comment - Perhaps it's just me. But I find it hard to walk the code and verify that the tx is being constructed correctly. Are we perhaps taking extra holds now? One thing you could try is to build ZFS with with the --enable-debug-dmu-tx option. This will enable a variety of checks to ensure that tx are constructed and managed properly. If they're not you'll ASSERT. They're somewhat expensive so they're disabled by default. As for doing thing like the internal ZFS code quota is managed more like I described above. If you look at dsl_pool_sync() which is called in the sync'ing context like a synctask you'll see that it creates a new tx for the correct txg with dmu_tx_create_assigned(). It then goes through the list of dirty dnodes per dataset and updates the zap's accordingly. I don't see why Lustre couldn't do something similar.

            hmm, osd_trans_stop() does this, otherwise we would block in the very first txg? sort of interesting thing is that the patch getting stuck quite rarely.. it actually passed many maloo runs and it was hard to hit the issue locally as well.

            bzzz Alex Zhuravlev added a comment - hmm, osd_trans_stop() does this, otherwise we would block in the very first txg? sort of interesting thing is that the patch getting stuck quite rarely.. it actually passed many maloo runs and it was hard to hit the issue locally as well.

            The critical bit is the call to dmu_tx_commit() after the dsl_sync_task_nowait() call. Where does this happen in the Lustre code for this tx?

            behlendorf Brian Behlendorf added a comment - The critical bit is the call to dmu_tx_commit() after the dsl_sync_task_nowait() call. Where does this happen in the Lustre code for this tx?

            Exactly right. You just need this tx to end up in the same txg. As long as that's true you'll have the atomic guarantee. One failure either both will be on disk or neither will. But you don't need to leverage an existing tx to achieve that. If you know the txg you can use dmu_tx_create_assigned() to ensure your tx is part of that txg.

            behlendorf Brian Behlendorf added a comment - Exactly right. You just need this tx to end up in the same txg. As long as that's true you'll have the atomic guarantee. One failure either both will be on disk or neither will. But you don't need to leverage an existing tx to achieve that. If you know the txg you can use dmu_tx_create_assigned() to ensure your tx is part of that txg.

            I had a quick look at spa_history_log() and not sure what's the difference. yes, it does create "own" tx, but how is it really different given all the logic is done in spa_history_log_sync() which is scheduled with dsl_sync_task_nowait() ?

            in my patch I do have a normal tx created the same normal way, I check this is the first time I see this specific txg, then allocate a structure (which is per-txg essentially) and schedule a function with dsl_sync_task_nowait() which does update couple ZAPs.. seem to be exactly the same?

            bzzz Alex Zhuravlev added a comment - I had a quick look at spa_history_log() and not sure what's the difference. yes, it does create "own" tx, but how is it really different given all the logic is done in spa_history_log_sync() which is scheduled with dsl_sync_task_nowait() ? in my patch I do have a normal tx created the same normal way, I check this is the first time I see this specific txg, then allocate a structure (which is per-txg essentially) and schedule a function with dsl_sync_task_nowait() which does update couple ZAPs.. seem to be exactly the same?

            hmm, what I'm trying to do is to hit the same transaction (otherwise accounting is not atomic), but at the point when all the normal changes within this txg are done. - this is the purpose of dsl_sync_task_nowai(), right? and basically that whole state is per txg and it's supposed to be totally non-racy.

            yet another point is that this is exactly how space accounting works now? (or very similarly)

            bzzz Alex Zhuravlev added a comment - hmm, what I'm trying to do is to hit the same transaction (otherwise accounting is not atomic), but at the point when all the normal changes within this txg are done. - this is the purpose of dsl_sync_task_nowai(), right? and basically that whole state is per txg and it's supposed to be totally non-racy. yet another point is that this is exactly how space accounting works now? (or very similarly)

            Ahh OK. I remember now.

            It looks to me like you're failing to call dmu_tx_commit() after dsl_sync_task_nowait(). The commit is responsible for dropping all the dnode holds and notifying any waiters. Without it the holds are just going to accumulate and I'd expect to see exactly what you're describing.

            My suggestion would be to take a good look at the spa_history_log() function in zfs. It's a fairly nice example of how to go about this. In this case they create a tx per history update since there aren't that many of them. In the Lustre case I agree it's probably a good idea to batch them as your doing. However, the commit callback I'd still strongly suggest allowing a dedicated tx for this purpose. I think it would make the code more readable and easy to verify that you've constructed the tx properly. There's a nice comment in include/sys/dmu.h describing what you can and cannot do when constructing a tx. If you break any of those rules you're likely to see some strange problems.

            /*
             * You must create a transaction, then hold the objects which you will
             * (or might) modify as part of this transaction.  Then you must assign
             * the transaction to a transaction group.  Once the transaction has
             * been assigned, you can modify buffers which belong to held objects as
             * part of this transaction.  You can't modify buffers before the
             * transaction has been assigned; you can't modify buffers which don't
             * belong to objects which this transaction holds; you can't hold
             * objects once the transaction has been assigned.  You may hold an
             * object which you are going to free (with dmu_object_free()), but you
             * don't have to.
             *
             * You can abort the transaction before it has been assigned.
             *
             * Note that you may hold buffers (with dmu_buf_hold) at any time,
             * regardless of transaction state.
             */
            

            The racy comment was just a subjective feeling I had about the code. There's so much non-private state you're depending on it's hard to easily look at the code and convince yourself it's safe. An good example of this is the reuse the the ot_tx. It's hard to know exactly what state the tx is in when you enter the function, and depending on what state that is there are certain things you must not do. If you were to create a new tx here it would be much clearer.

            behlendorf Brian Behlendorf added a comment - Ahh OK. I remember now. It looks to me like you're failing to call dmu_tx_commit() after dsl_sync_task_nowait(). The commit is responsible for dropping all the dnode holds and notifying any waiters. Without it the holds are just going to accumulate and I'd expect to see exactly what you're describing. My suggestion would be to take a good look at the spa_history_log() function in zfs. It's a fairly nice example of how to go about this. In this case they create a tx per history update since there aren't that many of them. In the Lustre case I agree it's probably a good idea to batch them as your doing. However, the commit callback I'd still strongly suggest allowing a dedicated tx for this purpose. I think it would make the code more readable and easy to verify that you've constructed the tx properly. There's a nice comment in include/sys/dmu.h describing what you can and cannot do when constructing a tx. If you break any of those rules you're likely to see some strange problems. /* * You must create a transaction, then hold the objects which you will * (or might) modify as part of this transaction. Then you must assign * the transaction to a transaction group. Once the transaction has * been assigned, you can modify buffers which belong to held objects as * part of this transaction. You can't modify buffers before the * transaction has been assigned; you can't modify buffers which don't * belong to objects which this transaction holds; you can't hold * objects once the transaction has been assigned. You may hold an * object which you are going to free (with dmu_object_free()), but you * don't have to. * * You can abort the transaction before it has been assigned. * * Note that you may hold buffers (with dmu_buf_hold) at any time, * regardless of transaction state. */ The racy comment was just a subjective feeling I had about the code. There's so much non-private state you're depending on it's hard to easily look at the code and convince yourself it's safe. An good example of this is the reuse the the ot_tx. It's hard to know exactly what state the tx is in when you enter the function, and depending on what state that is there are certain things you must not do. If you were to create a new tx here it would be much clearer.

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: