Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6927

recovery_small test_66 failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This seems to be hanging for me relatively frequently:

      == recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458)
      fail_loc=0x8000030c
        File: `/mnt/lustre/f66.recovery-small'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 144115809252278274  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2015-07-29 21:24:19.000000000 -0400
      Modify: 2015-07-29 21:24:19.000000000 -0400
      Change: 2015-07-29 21:24:19.000000000 -0400
      fail_loc=0
      fail_loc=0x80000136
      mdc.lustre-MDT0000-mdc-ffff8800815b17f0.import=connection=192.168.10.211@tcp
      ./../tests/test-framework.sh: line 2399: /mnt/lustre/recon: Input/output error
      

      In dmesg we get:

      [195769.388166] Lustre: DEBUG MARKER: == recovery-small test 66: lock enqueue re-send vs client eviction == 21:24:18 (1438219458)
      [195773.859644] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) lustre-MDT0000: evicting f797195c-b579-5409-6da6-f79d52d8f961 at adminstrative request
      [195773.860464] Lustre: 17831:0:(genops.c:1491:obd_export_evict_by_uuid()) Skipped 1 previous similar message
      [195773.906764] LustreError: 17610:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 sleeping for 40000ms
      [195809.905048] Lustre: lustre-OST0001: haven't heard from client 9007d4b0-ffa9-1973-fae0-b8ec6df9deb5 (at 0@lo) in 48 seconds. I think it's dead, and I am evicting it. exp ffff88003bea07f0, cur 1438219499 expire 1438219469 last 1438219451
      [195813.904060] LustreError: 17610:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 136 awake
      [195813.904062] LNet: Service thread pid 17610 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [195813.904064] Pid: 17610, comm: mdt01_003
      [195813.904065] 
      [195813.904065] Call Trace:
      [195813.904075]  [<ffffffff81522574>] ? _spin_lock_irqsave+0x24/0x30
      [195813.904079]  [<ffffffff810866a2>] ? del_timer_sync+0x22/0x30
      [195813.904081]  [<ffffffff8151f279>] ? schedule_timeout+0x199/0x2e0
      [195813.904091]  [<ffffffffa0909c41>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [195813.904102]  [<ffffffffa0906c4c>] ? __cfs_fail_timeout_set+0x13c/0x160 [libcfs]
      [195813.904116]  [<ffffffffa0c113a5>] ? mdt_getattr_name_lock+0x475/0x1910 [mdt]
      [195813.904125]  [<ffffffffa0c16cd3>] ? old_init_ucred+0x1e3/0x390 [mdt]
      [195813.904133]  [<ffffffffa0c12d62>] ? mdt_intent_getattr+0x292/0x470 [mdt]
      [195813.904141]  [<ffffffffa0c02674>] ? mdt_intent_policy+0x494/0xc40 [mdt]
      [195813.904169]  [<ffffffffa053ef6f>] ? ldlm_lock_enqueue+0x12f/0x860 [ptlrpc]
      [195813.904194]  [<ffffffffa056ab97>] ? ldlm_handle_enqueue0+0x7e7/0x1520 [ptlrpc]
      [195813.904227]  [<ffffffffa05f1111>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
      [195813.904256]  [<ffffffffa05f1d22>] ? tgt_request_handle+0xa42/0x1230 [ptlrpc]
      [195813.904282]  [<ffffffffa059d5e4>] ? ptlrpc_main+0xd74/0x1850 [ptlrpc]
      [195813.904284]  [<ffffffff8152240e>] ? _spin_unlock_irq+0xe/0x20
      [195813.904287]  [<ffffffff81061829>] ? finish_task_switch+0x59/0xe0
      [195813.904289]  [<ffffffff8151e3ee>] ? thread_return+0x4e/0x7e0
      [195813.904314]  [<ffffffffa059c870>] ? ptlrpc_main+0x0/0x1850 [ptlrpc]
      [195813.904317]  [<ffffffff8109ce4e>] ? kthread+0x9e/0xc0
      [195813.904319]  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
      [195813.904321]  [<ffffffff8109cdb0>] ? kthread+0x0/0xc0
      [195813.904322]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
      [195813.904323] 
      [195813.904325] LustreError: dumping log to /tmp/lustre-log.1438219503.17610
      [195813.913595] LNet: Service thread pid 17610 completed after 40.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: