Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6829

runtests test_1: MDS hung

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • client and server: lustre-master build # 3094 RHEL7 DNE
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/24729d7e-2640-11e5-8b33-5254006e85c2.

      The sub-test test_1 failed with the following error:

      test failed to respond and timed out
      

      MDS dmesg

      [ 1244.736431] Lustre: DEBUG MARKER: /usr/sbin/lctl mark copying 880 files from \/etc \/bin to \/mnt\/lustre\/d1.runtests\/etc \/bin at Wed Jul  8 20:30:42 UTC 2015
      [ 1244.857091] Lustre: DEBUG MARKER: copying 880 files from /etc /bin to /mnt/lustre/d1.runtests/etc /bin at Wed Jul 8 20:30:42 UTC 2015
      [ 1252.974179] Lustre: 2786:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387443/real 1436387443]  req@ffff880069257300 x1506161297721592/t0(0) o400->lustre-MDT0002-osp-MDT0000@10.1.5.252@tcp:24/4 lens 224/224 e 0 to 1 dl 1436387450 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1252.979486] Lustre: lustre-MDT0002-osp-MDT0000: Connection to lustre-MDT0002 (at 10.1.5.252@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 1257.239137] Lustre: 2785:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387448/real 1436387448]  req@ffff880068c86700 x1506161297721644/t0(0) o400->lustre-MDT0002-osp-MDT0000@10.1.5.252@tcp:24/4 lens 224/224 e 0 to 1 dl 1436387455 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1257.243885] Lustre: 2785:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 1258.983125] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387450/real 1436387450]  req@ffff88006912db00 x1506161297721660/t0(0) o38->lustre-MDT0003-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387456 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1258.988595] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 1273.982152] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387460/real 1436387460]  req@ffff880069d87300 x1506161297721744/t0(0) o38->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387471 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1273.987550] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 1285.839237] Lustre: 2815:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387476/real 1436387476]  req@ffff880069d87300 x1506161297721892/t0(0) o104->lustre-MDT0000@10.1.5.252@tcp:15/16 lens 296/224 e 0 to 1 dl 1436387483 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
      [ 1285.844932] Lustre: 2815:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      [ 1292.205644] Lustre: lustre-MDT0000: haven't heard from client lustre-MDT0000-lwp-MDT0001_UUID (at 10.1.5.252@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007bc28400, cur 1436387490 expire 1436387460 last 1436387441
      [ 1293.982194] Lustre: 2784:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436387475/real 1436387475]  req@ffff880069d85b00 x1506161297721880/t0(0) o38->lustre-MDT0001-osp-MDT0000@10.1.5.252@tcp:24/4 lens 520/544 e 0 to 1 dl 1436387491 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1318.879183] LNet: Service thread pid 2815 was inactive for 40.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 1318.882589] Pid: 2815, comm: mdt00_001
      [ 1318.883358] 
      Call Trace:
      [ 1318.884177]  [<ffffffff8160a409>] schedule+0x29/0x70
      [ 1318.884703]  [<ffffffff816082b5>] schedule_timeout+0x175/0x2d0
      [ 1318.885134]  [<ffffffff8107ee80>] ? process_timeout+0x0/0x10
      [ 1318.885707]  [<ffffffffa09a8320>] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
      [ 1318.886241]  [<ffffffffa09b2978>] ptlrpc_set_wait+0x4b8/0x9e0 [ptlrpc]
      [ 1318.886787]  [<ffffffff810a9650>] ? default_wake_function+0x0/0x20
      [ 1318.887253]  [<ffffffffa09b2f1d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
      [ 1318.887764]  [<ffffffffa0f433c5>] osp_remote_sync+0xd5/0x1b0 [osp]
      [ 1318.888210]  [<ffffffffa0f2a25d>] osp_attr_get+0x41d/0x690 [osp]
      [ 1318.888728]  [<ffffffffa0f25ab4>] osp_object_init+0x114/0x280 [osp]
      [ 1318.889220]  [<ffffffffa076058f>] lu_object_alloc+0xdf/0x310 [obdclass]
      [ 1318.889785]  [<ffffffffa076099c>] lu_object_find_try+0x17c/0x2c0 [obdclass]
      [ 1318.890295]  [<ffffffffa0760b8c>] lu_object_find_at+0xac/0xe0 [obdclass]
      [ 1318.890940]  [<ffffffffa0e6ebb5>] ? lod_index_lookup+0x25/0x30 [lod]
      [ 1318.891407]  [<ffffffffa0ecc307>] ? __mdd_lookup.isra.17+0x317/0x440 [mdd]
      [ 1318.891993]  [<ffffffffa0760bd6>] lu_object_find+0x16/0x20 [obdclass]
      [ 1318.892575]  [<ffffffffa0d8a73b>] mdt_object_find+0x4b/0x170 [mdt]
      [ 1318.893146]  [<ffffffffa0db0726>] mdt_reint_open+0x2776/0x2d50 [mdt]
      [ 1318.893624]  [<ffffffffa07795f9>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass]
      [ 1318.894169]  [<ffffffff812ddd72>] ? strlcpy+0x42/0x60
      [ 1318.894571]  [<ffffffffa0da46e0>] mdt_reint_rec+0x80/0x210 [mdt]
      [ 1318.894985]  [<ffffffffa0d884e9>] mdt_reint_internal+0x5d9/0xb20 [mdt]
      [ 1318.895456]  [<ffffffffa0d88b92>] mdt_intent_reint+0x162/0x410 [mdt]
      [ 1318.896056]  [<ffffffffa0d921ea>] mdt_intent_policy+0x57a/0xb30 [mdt]
      [ 1318.896619]  [<ffffffffa0970223>] ldlm_lock_enqueue+0x353/0x930 [ptlrpc]
      [ 1318.897117]  [<ffffffffa0998da2>] ldlm_handle_enqueue0+0x4f2/0x16e0 [ptlrpc]
      [ 1318.897725]  [<ffffffffa09c0360>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      [ 1318.898274]  [<ffffffffa0a1eed2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 1318.898762]  [<ffffffffa0a2329b>] tgt_request_handle+0x88b/0x1100 [ptlrpc]
      [ 1318.899278]  [<ffffffffa09cafbb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 1318.899926]  [<ffffffffa09c8078>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [ 1318.900416]  [<ffffffff810a9662>] ? default_wake_function+0x12/0x20
      [ 1318.900956]  [<ffffffff810a0898>] ? __wake_up_common+0x58/0x90
      [ 1318.901397]  [<ffffffffa09ce900>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 1318.901953]  [<ffffffffa09cdd00>] ? ptlrpc_main+0x0/0x1f60 [ptlrpc]
      [ 1318.902407]  [<ffffffff8109739f>] kthread+0xcf/0xe0
      [ 1318.902851]  [<ffffffff810972d0>] ? kthread+0x0/0xe0
      [ 1318.903215]  [<ffffffff81614f7c>] ret_from_fork+0x7c/0xb0
      [ 1318.903640]  [<ffffffff810972d0>] ? kthread+0x0/0xe0
      
      

      Attachments

        Issue Links

          Activity

            People

              di.wang Di Wang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: