Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6880

recovery timeout during 24 hours failover test

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Recovery can not finish in time during 24 hours failover test, after 23 times failover

      Server failover period: 600 seconds
      Exited after:           13229 seconds
      Number of failovers before exit:
      mds1: 2 times
      mds2: 7 times
      mds3: 1 times
      mds4: 1 times
      mds5: 3 times
      mds6: 3 times
      mds7: 3 times
      mds8: 3 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      Status: FAIL: rc=7
      

      Attachments

        Issue Links

          Activity

            [LU-6880] recovery timeout during 24 hours failover test

            Landed for 2.8.

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15682/
            Subject: LU-6880 update: after reply move dtrq to finish list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2a874ec011e680f49405a7e901d8d0d35dcb4f1a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15682/ Subject: LU-6880 update: after reply move dtrq to finish list Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2a874ec011e680f49405a7e901d8d0d35dcb4f1a

            wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15682
            Subject: LU-6880 update: after reply move dtrq to finish list
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 400355f4bc9e353d20638f3264ef3c80b799a5cb

            gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15682 Subject: LU-6880 update: after reply move dtrq to finish list Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 400355f4bc9e353d20638f3264ef3c80b799a5cb
            tdtd-1        S 000000000000000a     0 22764      2 0x00000080
             ffff8807ef81fd80 0000000000000046 0000000000000000 0000000000000000
             000000060011bf89 00000000fffffff4 ffff8807ef81fd40 ffff8808125f7b30
             ffff8808309a5af8 ffff8807ef81ffd8 000000000000fbc8 ffff8808309a5af8
            Call Trace:
             [<ffffffffa08fc32d>] distribute_txn_commit_thread+0xfed/0x1750 [ptlrpc]
             [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
             [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
             [<ffffffffa08fb340>] ? distribute_txn_commit_thread+0x0/0x1750 [ptlrpc]
             [<ffffffff8109abf6>] kthread+0x96/0xa0
             [<ffffffff8100c20a>] child_rip+0xa/0x20
             [<ffffffff8109ab60>] ? kthread+0x0/0xa0
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            lod0001_rec00 S 0000000000000007     0 22766      2 0x00000080
             ffff8807ef825910 0000000000000046 0000000000000000 ffff8807ef8258d4
             000021d557def61b 0000000000000286 ffff8807ef8258b0 ffffffff81083e1c
             ffff8807ef823ab8 ffff8807ef825fd8 000000000000fbc8 ffff8807ef823ab8
            Call Trace:
             [<ffffffff81083e1c>] ? lock_timer_base+0x3c/0x70
             [<ffffffff8152a512>] schedule_timeout+0x192/0x2e0
             [<ffffffff81083f30>] ? process_timeout+0x0/0x10
             [<ffffffffa0874f99>] ptlrpc_set_wait+0x319/0xa30 [ptlrpc]
             [<ffffffffa086a510>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
             [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
             [<ffffffffa08811a5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
             [<ffffffffa0875731>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
             [<ffffffffa10de271>] osp_remote_sync+0x121/0x190 [osp]
             [<ffffffffa10c289d>] osp_attr_get+0x40d/0x6c0 [osp]
             [<ffffffffa10c42a4>] osp_object_init+0x1b4/0x320 [osp]
             [<ffffffffa0657db8>] lu_object_alloc+0xd8/0x320 [obdclass]
             [<ffffffffa0659161>] lu_object_find_try+0x151/0x260 [obdclass]
             [<ffffffffa0659321>] lu_object_find_at+0xb1/0xe0 [obdclass]
             [<ffffffff8116ef30>] ? cache_alloc_refill+0x1c0/0x240
             [<ffffffffa065a1bc>] dt_locate_at+0x1c/0xa0 [obdclass]
             [<ffffffffa061934e>] llog_osd_get_cat_list+0x8e/0xcd0 [obdclass]
             [<ffffffffa0ff4bc0>] lod_sub_prep_llog+0x110/0x7b0 [lod]
             [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
             [<ffffffffa0fc97f6>] lod_sub_recovery_thread+0x196/0xbc0 [lod]
             [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
             [<ffffffffa0fc9660>] ? lod_sub_recovery_thread+0x0/0xbc0 [lod]
             [<ffffffff8109abf6>] kthread+0x96/0xa0
             [<ffffffff8100c20a>] child_rip+0xa/0x20
             [<ffffffff8109ab60>] ? kthread+0x0/0xa0
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            

            Looks like log retrieve process is blocked by import recovery.

            di.wang Di Wang (Inactive) added a comment - tdtd-1 S 000000000000000a 0 22764 2 0x00000080 ffff8807ef81fd80 0000000000000046 0000000000000000 0000000000000000 000000060011bf89 00000000fffffff4 ffff8807ef81fd40 ffff8808125f7b30 ffff8808309a5af8 ffff8807ef81ffd8 000000000000fbc8 ffff8808309a5af8 Call Trace: [<ffffffffa08fc32d>] distribute_txn_commit_thread+0xfed/0x1750 [ptlrpc] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa08fb340>] ? distribute_txn_commit_thread+0x0/0x1750 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 lod0001_rec00 S 0000000000000007 0 22766 2 0x00000080 ffff8807ef825910 0000000000000046 0000000000000000 ffff8807ef8258d4 000021d557def61b 0000000000000286 ffff8807ef8258b0 ffffffff81083e1c ffff8807ef823ab8 ffff8807ef825fd8 000000000000fbc8 ffff8807ef823ab8 Call Trace: [<ffffffff81083e1c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8152a512>] schedule_timeout+0x192/0x2e0 [<ffffffff81083f30>] ? process_timeout+0x0/0x10 [<ffffffffa0874f99>] ptlrpc_set_wait+0x319/0xa30 [ptlrpc] [<ffffffffa086a510>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 [<ffffffffa08811a5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0875731>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa10de271>] osp_remote_sync+0x121/0x190 [osp] [<ffffffffa10c289d>] osp_attr_get+0x40d/0x6c0 [osp] [<ffffffffa10c42a4>] osp_object_init+0x1b4/0x320 [osp] [<ffffffffa0657db8>] lu_object_alloc+0xd8/0x320 [obdclass] [<ffffffffa0659161>] lu_object_find_try+0x151/0x260 [obdclass] [<ffffffffa0659321>] lu_object_find_at+0xb1/0xe0 [obdclass] [<ffffffff8116ef30>] ? cache_alloc_refill+0x1c0/0x240 [<ffffffffa065a1bc>] dt_locate_at+0x1c/0xa0 [obdclass] [<ffffffffa061934e>] llog_osd_get_cat_list+0x8e/0xcd0 [obdclass] [<ffffffffa0ff4bc0>] lod_sub_prep_llog+0x110/0x7b0 [lod] [<ffffffff81058bd3>] ? __wake_up+0x53/0x70 [<ffffffffa0fc97f6>] lod_sub_recovery_thread+0x196/0xbc0 [lod] [<ffffffff81061d12>] ? default_wake_function+0x12/0x20 [<ffffffffa0fc9660>] ? lod_sub_recovery_thread+0x0/0xbc0 [lod] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Looks like log retrieve process is blocked by import recovery.

            People

              di.wang Di Wang (Inactive)
              di.wang Di Wang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: