Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.4
    • Lustre 2.5.3
    • None
    • 3
    • 15702

    Description

      This was introduced when porting the 3902ff4c54925b2f1fcb732a32ed7ee5428e9f77

      Some bits in osd_declare_write() are lost during porting.

      Attachments

        Activity

          [LU-5612] typo in osd_declare_write()

          Hi, Ryan

          I think it's not caused by this typo, however, this typo can cause insufficient credits in certain case definitely, so you'd better have this fix as well.

          The different stack trace (the crash in dqput path) is indeed different with LU-5040, it probably reveals another problem: When changing owner/group, if the original owner/group has no limits and current inode is the last file for original user/group, the quota entry could be deleted, that requires additional journal credits. This sounds a quite rare comparing with LU-5040, I'll try to work out a fix in LU-5250. Thanks for bringing this to my attention.

          niu Niu Yawei (Inactive) added a comment - Hi, Ryan I think it's not caused by this typo, however, this typo can cause insufficient credits in certain case definitely, so you'd better have this fix as well. The different stack trace (the crash in dqput path) is indeed different with LU-5040 , it probably reveals another problem: When changing owner/group, if the original owner/group has no limits and current inode is the last file for original user/group, the quota entry could be deleted, that requires additional journal credits. This sounds a quite rare comparing with LU-5040 , I'll try to work out a fix in LU-5250 . Thanks for bringing this to my attention.
          haasken Ryan Haasken added a comment -

          Now that I look more closely at our own stack traces, it turns out we got a stack trace including dqget and do_insert_tree when we attempted to restart the file system after the crash. Niu, can you confirm that the fix which landed for LU-5040 fixes the bug shown in this stack trace:

          https://jira.hpdd.intel.com/browse/LU-5040?focusedCommentId=90730&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-90730

          That stack trace is not for the bug which was caused by this typo, is it?

          haasken Ryan Haasken added a comment - Now that I look more closely at our own stack traces, it turns out we got a stack trace including dqget and do_insert_tree when we attempted to restart the file system after the crash. Niu, can you confirm that the fix which landed for LU-5040 fixes the bug shown in this stack trace: https://jira.hpdd.intel.com/browse/LU-5040?focusedCommentId=90730&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-90730 That stack trace is not for the bug which was caused by this typo, is it?
          haasken Ryan Haasken added a comment -

          Thanks for pointing those other tickets out. Our stack trace is slightly different than those listed in LU-5040. Here it is, copied from LU-5250:

          [exception RIP: jbd2_journal_dirty_metadata+268]
          RIP: ffffffffa02cc86c RSP: ffff88087be375e0 RFLAGS: 00010246
          RAX: ffff8806485b3bc0 RBX: ffff8806f520d588 RCX: ffff88084223bcf8
          RDX: 0000000000000000 RSI: ffff88084223bcf8 RDI: 0000000000000000
          RBP: ffff88087be37600 R8: f010000000000000 R9: f79fde5390e73e02
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801eb760748
          R13: ffff88084223bcf8 R14: ffff88086b22d800 R15: 0000000000000c00
          ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
          #4 [ffff88087be37608] __ldiskfs_handle_dirty_metadata at ffffffffa02ee0bb [ldiskfs]
          #5 [ffff88087be37648] ldiskfs_quota_write at ffffffffa0324b95 [ldiskfs]
          #6 [ffff88087be376b8] write_blk at ffffffff811e44ae
          #7 [ffff88087be376c8] remove_tree at ffffffff811e4da1
          #8 [ffff88087be37738] remove_tree at ffffffff811e4bf8
          #9 [ffff88087be377a8] remove_tree at ffffffff811e4bf8
          #10 [ffff88087be37818] qtree_delete_dquot at ffffffff811e4fe3
          #11 [ffff88087be37838] qtree_release_dquot at ffffffff811e501f
          #12 [ffff88087be37848] v2_release_dquot at ffffffff811e3cc0
          #13 [ffff88087be37858] dquot_release at ffffffff811df8e5
          #14 [ffff88087be37898] ldiskfs_release_dquot at ffffffffa03235be [ldiskfs]
          #15 [ffff88087be378b8] dqput at ffffffff811e0489
          #16 [ffff88087be378e8] dquot_transfer at ffffffff811e3253
          #17 [ffff88087be379c8] vfs_dq_transfer at ffffffff811dfc0c
          #18 [ffff88087be379e8] osd_quota_transfer at ffffffffa0ba98a5 [osd_ldiskfs]
          #19 [ffff88087be37a58] osd_attr_set at ffffffffa0bbcb8a [osd_ldiskfs]
          #20 [ffff88087be37ab8] dt_attr_set.clone.2 at ffffffffa083a969 [ofd]
          #21 [ffff88087be37ac8] ofd_attr_set at ffffffffa083e472 [ofd]
          #22 [ffff88087be37b28] ofd_setattr at ffffffffa082fe68 [ofd]
          #23 [ffff88087be37bb8] ost_setattr at ffffffffa06461fb [ost]
          #24 [ffff88087be37c18] ost_handle at ffffffffa06491fd [ost]
          #25 [ffff88087be37d68] ptlrpc_server_handle_request at ffffffffa06df4d5 [ptlrpc]
          #26 [ffff88087be37e48] ptlrpc_main at ffffffffa06e083d [ptlrpc]
          #27 [ffff88087be37ee8] kthread at ffffffff81096136
          #28 [ffff88087be37f48] kernel_thread at ffffffff8100c0ca
          #0 [ffff88087be37400] die at ffffffff8100f18b
          

          This is very similar to the stack traces posted by Mahmoud on August 4th in LU-5040, but those stack traces are in dqget rather than dqput.

          ... 
          [<ffffffff811e029c>] dqget+0x2ac/0x390^M
          [<ffffffff811e1b86>] dquot_transfer+0x116/0x620^M
          [<ffffffff811e09ab>] ? dquot_initialize+0x1fb/0x240^M
          [<ffffffffa0be0558>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]^M
          [<ffffffff811de4bc>] vfs_dq_transfer+0x6c/0xd0^M
          ...
          

          Is this still the same bug? Why are we hitting the assertion in dqput rather than dqget?

          haasken Ryan Haasken added a comment - Thanks for pointing those other tickets out. Our stack trace is slightly different than those listed in LU-5040 . Here it is, copied from LU-5250 : [exception RIP: jbd2_journal_dirty_metadata+268] RIP: ffffffffa02cc86c RSP: ffff88087be375e0 RFLAGS: 00010246 RAX: ffff8806485b3bc0 RBX: ffff8806f520d588 RCX: ffff88084223bcf8 RDX: 0000000000000000 RSI: ffff88084223bcf8 RDI: 0000000000000000 RBP: ffff88087be37600 R8: f010000000000000 R9: f79fde5390e73e02 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801eb760748 R13: ffff88084223bcf8 R14: ffff88086b22d800 R15: 0000000000000c00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff88087be37608] __ldiskfs_handle_dirty_metadata at ffffffffa02ee0bb [ldiskfs] #5 [ffff88087be37648] ldiskfs_quota_write at ffffffffa0324b95 [ldiskfs] #6 [ffff88087be376b8] write_blk at ffffffff811e44ae #7 [ffff88087be376c8] remove_tree at ffffffff811e4da1 #8 [ffff88087be37738] remove_tree at ffffffff811e4bf8 #9 [ffff88087be377a8] remove_tree at ffffffff811e4bf8 #10 [ffff88087be37818] qtree_delete_dquot at ffffffff811e4fe3 #11 [ffff88087be37838] qtree_release_dquot at ffffffff811e501f #12 [ffff88087be37848] v2_release_dquot at ffffffff811e3cc0 #13 [ffff88087be37858] dquot_release at ffffffff811df8e5 #14 [ffff88087be37898] ldiskfs_release_dquot at ffffffffa03235be [ldiskfs] #15 [ffff88087be378b8] dqput at ffffffff811e0489 #16 [ffff88087be378e8] dquot_transfer at ffffffff811e3253 #17 [ffff88087be379c8] vfs_dq_transfer at ffffffff811dfc0c #18 [ffff88087be379e8] osd_quota_transfer at ffffffffa0ba98a5 [osd_ldiskfs] #19 [ffff88087be37a58] osd_attr_set at ffffffffa0bbcb8a [osd_ldiskfs] #20 [ffff88087be37ab8] dt_attr_set.clone.2 at ffffffffa083a969 [ofd] #21 [ffff88087be37ac8] ofd_attr_set at ffffffffa083e472 [ofd] #22 [ffff88087be37b28] ofd_setattr at ffffffffa082fe68 [ofd] #23 [ffff88087be37bb8] ost_setattr at ffffffffa06461fb [ost] #24 [ffff88087be37c18] ost_handle at ffffffffa06491fd [ost] #25 [ffff88087be37d68] ptlrpc_server_handle_request at ffffffffa06df4d5 [ptlrpc] #26 [ffff88087be37e48] ptlrpc_main at ffffffffa06e083d [ptlrpc] #27 [ffff88087be37ee8] kthread at ffffffff81096136 #28 [ffff88087be37f48] kernel_thread at ffffffff8100c0ca #0 [ffff88087be37400] die at ffffffff8100f18b This is very similar to the stack traces posted by Mahmoud on August 4th in LU-5040 , but those stack traces are in dqget rather than dqput. ... [<ffffffff811e029c>] dqget+0x2ac/0x390^M [<ffffffff811e1b86>] dquot_transfer+0x116/0x620^M [<ffffffff811e09ab>] ? dquot_initialize+0x1fb/0x240^M [<ffffffffa0be0558>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]^M [<ffffffff811de4bc>] vfs_dq_transfer+0x6c/0xd0^M ... Is this still the same bug? Why are we hitting the assertion in dqput rather than dqget?
          Niu, could this typo/omission be the cause for LU-5250? Did you see some symptoms that caused you to open this bug?
          

          LU-5250 is probably dup of LU-5040. This defect (LU-5612) was found while testing patch of LU-5040.

          niu Niu Yawei (Inactive) added a comment - Niu, could this typo/omission be the cause for LU-5250? Did you see some symptoms that caused you to open this bug? LU-5250 is probably dup of LU-5040 . This defect ( LU-5612 ) was found while testing patch of LU-5040 .
          haasken Ryan Haasken added a comment -

          Niu, could this typo/omission be the cause for LU-5250? Did you see some symptoms that caused you to open this bug?

          haasken Ryan Haasken added a comment - Niu, could this typo/omission be the cause for LU-5250 ? Did you see some symptoms that caused you to open this bug?
          pjones Peter Jones added a comment -

          Landed for 2.5.4. Not needed on master,

          pjones Peter Jones added a comment - Landed for 2.5.4. Not needed on master,
          niu Niu Yawei (Inactive) added a comment - patch for b2_5: http://review.whamcloud.com/11889

          People

            niu Niu Yawei (Inactive)
            niu Niu Yawei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: