Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6075

Interop 2.6.0<->2.7 sanity-scrub test_7: D process in MDS

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • Lustre 2.7.0
    • None
    • server: lustre-master build #2770
      client: 2.6.0
    • 3
    • 16902

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a1aa0448-7e7b-11e4-8b8b-5254006e85c2.

      The sub-test test_7 failed with the following error:

      test failed to respond and timed out
      
      01:03:59:LustreError: 17421:0:(qmt_handler.c:427:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0004_UUID release:262188 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:394749 soft:375952 granted:0 time:0 qunit:16384 edquot:0 may_rel:0 revoke:0
      01:03:59:LustreError: 17421:0:(qmt_handler.c:427:qmt_dqacq0()) Skipped 1 previous similar message
      01:03:59:Lustre: *** cfs_fail_loc=191, val=0***
      01:03:59:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000bd0:0x100:0x0], rc = 0 [2]
      01:03:59:INFO: task jbd2/dm-0-8:17400 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:jbd2/dm-0-8   D 0000000000000001     0 17400      2 0x00000080
      01:03:59: ffff88005b709d20 0000000000000046 ffff880065334090 ffff880065334080
      01:03:59: ffff8800606a7680 ffff8800023168e8 0000000000017997 ffff88006d06d540
      01:03:59: ffff88006d06daf8 ffff88005b709fd8 000000000000fbc8 ffff88006d06daf8
      01:03:59:Call Trace:
      01:03:59: [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
      01:03:59: [<ffffffffa03df80f>] jbd2_journal_commit_transaction+0x19f/0x1500 [jbd2]
      01:03:59: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
      01:03:59: [<ffffffff81083e1c>] ? lock_timer_base+0x3c/0x70
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03e5a58>] kjournald2+0xb8/0x220 [jbd2]
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03e59a0>] ? kjournald2+0x0/0x220 [jbd2]
      01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
      01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
      01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      01:03:59:INFO: task osp-syn-0-0:17440 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:osp-syn-0-0   D 0000000000000000     0 17440      2 0x00000080
      01:03:59: ffff8800605c1700 0000000000000046 0000000000000000 ffffffff8128d756
      01:03:59: 0000000000000008 ffff88006a811a80 ffff8800605c2f68 ffff88005e305600
      01:03:59: ffff88006aeb9ab8 ffff8800605c1fd8 000000000000fbc8 ffff88006aeb9ab8
      01:03:59:Call Trace:
      01:03:59: [<ffffffff8128d756>] ? vsnprintf+0x336/0x5e0
      01:03:59: [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2]
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2]
      01:03:59: [<ffffffffa05b55d4>] ? llog_osd_declare_write_rec+0xd4/0x3f0 [obdclass]
      01:03:59: [<ffffffffa04367b6>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
      01:03:59: [<ffffffffa0d0c7cf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
      01:03:59: [<ffffffffa05a4b7c>] llog_write+0x22c/0x420 [obdclass]
      01:03:59: [<ffffffffa05a4e3f>] llog_cancel_rec+0xaf/0x690 [obdclass]
      01:03:59: [<ffffffffa05aadf3>] llog_cat_cancel_records+0x133/0x2f0 [obdclass]
      01:03:59: [<ffffffffa10b13f2>] osp_sync_process_committed+0x2b2/0x820 [osp]
      01:03:59: [<ffffffffa10b19d3>] osp_sync_process_queues+0x73/0x700 [osp]
      01:03:59: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05aab28>] llog_cat_process_cb+0x448/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa05aa6e0>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05a8dfd>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      01:03:59: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
      01:03:59: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a8f69>] llog_cat_process+0x19/0x20 [obdclass]
      01:03:59: [<ffffffffa10b2673>] osp_sync_thread+0x243/0x7d0 [osp]
      01:03:59: [<ffffffff815296ee>] ? thread_return+0x4e/0x770
      01:03:59: [<ffffffffa10b2430>] ? osp_sync_thread+0x0/0x7d0 [osp]
      01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
      01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
      01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      01:03:59:INFO: task osp-syn-1-0:17442 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:osp-syn-1-0   D 0000000000000000     0 17442      2 0x00000080
      01:03:59: ffff880057c99700 0000000000000046 0000000000000000 ffffffff8128d756
      01:03:59: 0000000000000008 ffff88006a811b40 ffff88005d7d0f68 ffff88005e305600
      01:03:59: ffff88006cf95098 ffff880057c99fd8 000000000000fbc8 ffff88006cf95098
      01:03:59:Call Trace:
      01:03:59: [<ffffffff8128d756>] ? vsnprintf+0x336/0x5e0
      01:03:59: [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2]
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2]
      01:03:59: [<ffffffffa05b55d4>] ? llog_osd_declare_write_rec+0xd4/0x3f0 [obdclass]
      01:03:59: [<ffffffffa04367b6>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
      01:03:59: [<ffffffffa0d0c7cf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
      01:03:59: [<ffffffffa05a4b7c>] llog_write+0x22c/0x420 [obdclass]
      01:03:59: [<ffffffffa05a4e3f>] llog_cancel_rec+0xaf/0x690 [obdclass]
      01:03:59: [<ffffffffa05aadf3>] llog_cat_cancel_records+0x133/0x2f0 [obdclass]
      01:03:59: [<ffffffffa10b13f2>] osp_sync_process_committed+0x2b2/0x820 [osp]
      01:03:59: [<ffffffffa10b19d3>] osp_sync_process_queues+0x73/0x700 [osp]
      01:03:59: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05aab28>] llog_cat_process_cb+0x448/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa05aa6e0>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05a8dfd>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      01:03:59: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
      01:03:59: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a8f69>] llog_cat_process+0x19/0x20 [obdclass]
      01:03:59: [<ffffffffa10b2673>] osp_sync_thread+0x243/0x7d0 [osp]
      01:03:59: [<ffffffff815296ee>] ? thread_return+0x4e/0x770
      01:03:59: [<ffffffffa10b2430>] ? osp_sync_thread+0x0/0x7d0 [osp]
      01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
      01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
      01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      01:03:59:INFO: task osp-syn-2-0:17444 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:osp-syn-2-0   D 0000000000000001     0 17444      2 0x00000080
      01:03:59: ffff880079239700 0000000000000046 0000000000000000 ffffffff8128d756
      01:03:59: 0000000000000008 ffff88006ca29d80 ffff88007923cf68 ffff88005e305600
      01:03:59: ffff880057dbd098 ffff880079239fd8 000000000000fbc8 ffff880057dbd098
      01:03:59:Call Trace:
      01:03:59: [<ffffffff8128d756>] ? vsnprintf+0x336/0x5e0
      01:03:59: [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2]
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2]
      01:03:59: [<ffffffffa05b55d4>] ? llog_osd_declare_write_rec+0xd4/0x3f0 [obdclass]
      01:03:59: [<ffffffffa04367b6>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
      01:03:59: [<ffffffffa0d0c7cf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
      01:03:59: [<ffffffffa05a4b7c>] llog_write+0x22c/0x420 [obdclass]
      01:03:59: [<ffffffffa05a4e3f>] llog_cancel_rec+0xaf/0x690 [obdclass]
      01:03:59: [<ffffffffa05aadf3>] llog_cat_cancel_records+0x133/0x2f0 [obdclass]
      01:03:59: [<ffffffffa10b13f2>] osp_sync_process_committed+0x2b2/0x820 [osp]
      01:03:59: [<ffffffffa10b19d3>] osp_sync_process_queues+0x73/0x700 [osp]
      01:03:59: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05aab28>] llog_cat_process_cb+0x448/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa05aa6e0>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05a8dfd>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      01:03:59: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
      01:03:59: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a8f69>] llog_cat_process+0x19/0x20 [obdclass]
      01:03:59: [<ffffffffa10b2673>] osp_sync_thread+0x243/0x7d0 [osp]
      01:03:59: [<ffffffff815296ee>] ? thread_return+0x4e/0x770
      01:03:59: [<ffffffffa10b2430>] ? osp_sync_thread+0x0/0x7d0 [osp]
      01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
      01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
      01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      01:03:59:INFO: task osp-syn-3-0:17446 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:osp-syn-3-0   D 0000000000000001     0 17446      2 0x00000080
      01:03:59: ffff8800581b5700 0000000000000046 0000000000000000 ffffffff8128d756
      01:03:59: 0000000000000008 ffff880065334680 ffff8800581b6f68 ffff88005e305600
      01:03:59: ffff880066fde5f8 ffff8800581b5fd8 000000000000fbc8 ffff880066fde5f8
      01:03:59:Call Trace:
      01:03:59: [<ffffffff8128d756>] ? vsnprintf+0x336/0x5e0
      01:03:59: [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2]
      01:03:59: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
      01:03:59: [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2]
      01:03:59: [<ffffffffa05b55d4>] ? llog_osd_declare_write_rec+0xd4/0x3f0 [obdclass]
      01:03:59: [<ffffffffa04367b6>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
      01:03:59: [<ffffffffa0d0c7cf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
      01:03:59: [<ffffffffa05a4b7c>] llog_write+0x22c/0x420 [obdclass]
      01:03:59: [<ffffffffa05a4e3f>] llog_cancel_rec+0xaf/0x690 [obdclass]
      01:03:59: [<ffffffffa05aadf3>] llog_cat_cancel_records+0x133/0x2f0 [obdclass]
      01:03:59: [<ffffffffa10b13f2>] osp_sync_process_committed+0x2b2/0x820 [osp]
      01:03:59: [<ffffffffa10b19d3>] osp_sync_process_queues+0x73/0x700 [osp]
      01:03:59: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05aab28>] llog_cat_process_cb+0x448/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a5ec9>] llog_process_thread+0xaa9/0xe80 [obdclass]
      01:03:59: [<ffffffffa05aa6e0>] ? llog_cat_process_cb+0x0/0x5e0 [obdclass]
      01:03:59: [<ffffffffa05a7abf>] llog_process_or_fork+0x13f/0x540 [obdclass]
      01:03:59: [<ffffffffa05a8dfd>] llog_cat_process_or_fork+0x1ad/0x300 [obdclass]
      01:03:59: [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
      01:03:59: [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      01:03:59: [<ffffffffa10b1960>] ? osp_sync_process_queues+0x0/0x700 [osp]
      01:03:59: [<ffffffffa05a8f69>] llog_cat_process+0x19/0x20 [obdclass]
      01:03:59: [<ffffffffa10b2673>] osp_sync_thread+0x243/0x7d0 [osp]
      01:03:59: [<ffffffff815296ee>] ? thread_return+0x4e/0x770
      01:03:59: [<ffffffffa10b2430>] ? osp_sync_thread+0x0/0x7d0 [osp]
      01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
      01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
      01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      01:03:59:INFO: task mdt00_004:18960 blocked for more than 120 seconds.
      01:03:59:      Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      01:03:59:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      01:03:59:mdt00_004     D 0000000000000001     0 18960      2 0x00000080
      01:03:59: ffff88007942f7b0 0000000000000046 0000000000000000 ffffffff8116f933
      01:03:59: ffff88007942f760 ffffffff8116f933 ffff88007942f7d0 ffffffffa0cb5d0f
      01:03:59: ffff88006c955ab8 ffff88007942ffd8 000000000000fbc8 ffff88006c955ab8
      

      Attachments

        Activity

          [LU-6075] Interop 2.6.0<->2.7 sanity-scrub test_7: D process in MDS

          Patch landed to Master.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13311/
          Subject: LU-6075 osd: race for check/chance od_dirent_journal
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: fe486c9894aa1d2275b66b3b3b0baee69a016afa

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13311/ Subject: LU-6075 osd: race for check/chance od_dirent_journal Project: fs/lustre-release Branch: master Current Patch Set: Commit: fe486c9894aa1d2275b66b3b3b0baee69a016afa

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13311
          Subject: LU-6075 osd: race for check/chance od_dirent_journal
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b559a828a4318882a5c619d84ec836d21ecbe987

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13311 Subject: LU-6075 osd: race for check/chance od_dirent_journal Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b559a828a4318882a5c619d84ec836d21ecbe987

          Liang, do you have any idea about the PDO lock?

          yong.fan nasf (Inactive) added a comment - Liang, do you have any idea about the PDO lock?

          this test passed on the latest master version
          https://testing.hpdd.intel.com/test_sessions/98b0df84-8fcd-11e4-9293-5254006e85c2

          That's fine.

          The left issue is why sanity-scrub test_7 hung. According to the test scripts, there is no unlink operation during the test. So there should be some other non-test threads running during the sanity-scrub. What they were? It may be that another competitive thread held the PDO lock as to block the unlink operation. Unfortunately, there were no logs for that.

          yong.fan nasf (Inactive) added a comment - this test passed on the latest master version https://testing.hpdd.intel.com/test_sessions/98b0df84-8fcd-11e4-9293-5254006e85c2 That's fine. The left issue is why sanity-scrub test_7 hung. According to the test scripts, there is no unlink operation during the test. So there should be some other non-test threads running during the sanity-scrub. What they were? It may be that another competitive thread held the PDO lock as to block the unlink operation. Unfortunately, there were no logs for that.
          sarah Sarah Liu added a comment - this test passed on the latest master version https://testing.hpdd.intel.com/test_sessions/98b0df84-8fcd-11e4-9293-5254006e85c2

          About sanity-scrub test_7 failure: the OI scrub thread was blocked at jbd2_journal_start() because the thread jbd2/dm-0-8 was making journal flush at jbd2_journal_commit_transaction(); but thread jbd2/dm-0-8 was waiting the thread mdt_007 to stop its current transaction. Unfortunately, the thread mdt_007 was hung inside mdd_unlink for OSD PDO lock:

          01:03:59: [<ffffffff811bf647>] ? __find_get_block+0x97/0xe0
          01:03:59: [<ffffffffa0441a8a>] htree_node_lock_try+0x5aa/0x720 [ldiskfs]
          01:03:59: [<ffffffffa04234ae>] ldiskfs_htree_node_lock+0x9e/0xc0 [ldiskfs]
          01:03:59: [<ffffffffa042524f>] dx_probe+0x3cf/0x6f0 [ldiskfs]
          01:03:59: [<ffffffffa042580e>] ldiskfs_dx_find_entry+0x29e/0x2e0 [ldiskfs]
          01:03:59: [<ffffffffa0cb5d0f>] ? qsd_op_begin+0x5f/0xb40 [lquota]
          01:03:59: [<ffffffffa0425b6b>] ldiskfs_find_entry+0x31b/0x4a0 [ldiskfs]
          01:03:59: [<ffffffffa0d1543a>] osd_index_ea_delete+0x2ba/0xc90 [osd_ldiskfs]
          01:03:59: [<ffffffffa0fd0117>] lod_index_delete+0xa7/0x140 [lod]
          01:03:59: [<ffffffffa103758c>] __mdd_index_delete_only+0xdc/0x280 [mdd]
          01:03:59: [<ffffffffa1038500>] __mdd_index_delete+0x50/0x2c0 [mdd]
          01:03:59: [<ffffffffa103df5c>] mdd_unlink+0x81c/0xea0 [mdd]
          01:03:59: [<ffffffffa0efe008>] mdo_unlink+0x18/0x50 [mdt]
          01:03:59: [<ffffffffa0f080b6>] mdt_reint_unlink+0x9c6/0x10c0 [mdt]
          01:03:59: [<ffffffffa060e810>] ? lu_ucred+0x20/0x30 [obdclass]
          01:03:59: [<ffffffffa0efe09d>] mdt_reint_rec+0x5d/0x200 [mdt]
          01:03:59: [<ffffffffa0ee218b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
          01:03:59: [<ffffffffa0ee29eb>] mdt_reint+0x6b/0x120 [mdt]
          01:03:59: [<ffffffffa0873ade>] tgt_request_handle+0x6fe/0xaf0 [ptlrpc]
          01:03:59: [<ffffffffa0823411>] ptlrpc_main+0xe41/0x1950 [ptlrpc]
          01:03:59: [<ffffffffa08225d0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
          01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0
          01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20
          01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
          01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20
          

          There were no more logs to indicate why mdt_007 was blocked by the PDO lock.

          yong.fan nasf (Inactive) added a comment - About sanity-scrub test_7 failure: the OI scrub thread was blocked at jbd2_journal_start() because the thread jbd2/dm-0-8 was making journal flush at jbd2_journal_commit_transaction(); but thread jbd2/dm-0-8 was waiting the thread mdt_007 to stop its current transaction. Unfortunately, the thread mdt_007 was hung inside mdd_unlink for OSD PDO lock: 01:03:59: [<ffffffff811bf647>] ? __find_get_block+0x97/0xe0 01:03:59: [<ffffffffa0441a8a>] htree_node_lock_try+0x5aa/0x720 [ldiskfs] 01:03:59: [<ffffffffa04234ae>] ldiskfs_htree_node_lock+0x9e/0xc0 [ldiskfs] 01:03:59: [<ffffffffa042524f>] dx_probe+0x3cf/0x6f0 [ldiskfs] 01:03:59: [<ffffffffa042580e>] ldiskfs_dx_find_entry+0x29e/0x2e0 [ldiskfs] 01:03:59: [<ffffffffa0cb5d0f>] ? qsd_op_begin+0x5f/0xb40 [lquota] 01:03:59: [<ffffffffa0425b6b>] ldiskfs_find_entry+0x31b/0x4a0 [ldiskfs] 01:03:59: [<ffffffffa0d1543a>] osd_index_ea_delete+0x2ba/0xc90 [osd_ldiskfs] 01:03:59: [<ffffffffa0fd0117>] lod_index_delete+0xa7/0x140 [lod] 01:03:59: [<ffffffffa103758c>] __mdd_index_delete_only+0xdc/0x280 [mdd] 01:03:59: [<ffffffffa1038500>] __mdd_index_delete+0x50/0x2c0 [mdd] 01:03:59: [<ffffffffa103df5c>] mdd_unlink+0x81c/0xea0 [mdd] 01:03:59: [<ffffffffa0efe008>] mdo_unlink+0x18/0x50 [mdt] 01:03:59: [<ffffffffa0f080b6>] mdt_reint_unlink+0x9c6/0x10c0 [mdt] 01:03:59: [<ffffffffa060e810>] ? lu_ucred+0x20/0x30 [obdclass] 01:03:59: [<ffffffffa0efe09d>] mdt_reint_rec+0x5d/0x200 [mdt] 01:03:59: [<ffffffffa0ee218b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] 01:03:59: [<ffffffffa0ee29eb>] mdt_reint+0x6b/0x120 [mdt] 01:03:59: [<ffffffffa0873ade>] tgt_request_handle+0x6fe/0xaf0 [ptlrpc] 01:03:59: [<ffffffffa0823411>] ptlrpc_main+0xe41/0x1950 [ptlrpc] 01:03:59: [<ffffffffa08225d0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] 01:03:59: [<ffffffff8109abf6>] kthread+0x96/0xa0 01:03:59: [<ffffffff8100c20a>] child_rip+0xa/0x20 01:03:59: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 01:03:59: [<ffffffff8100c200>] ? child_rip+0x0/0x20 There were no more logs to indicate why mdt_007 was blocked by the PDO lock.

          In the test:
          https://testing.hpdd.intel.com/test_sets/a1aa0448-7e7b-11e4-8b8b-5254006e85c2.

          sanity-scrub test_6 failed for LU-6012, which has been fixed by the patch: http://review.whamcloud.com/#/c/13020/. The build #2770 does not contains such patch. So please retry the latest master.

          yong.fan nasf (Inactive) added a comment - In the test: https://testing.hpdd.intel.com/test_sets/a1aa0448-7e7b-11e4-8b8b-5254006e85c2 . sanity-scrub test_6 failed for LU-6012 , which has been fixed by the patch: http://review.whamcloud.com/#/c/13020/ . The build #2770 does not contains such patch. So please retry the latest master.

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: