Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2165

sanity test 64b hang with zfs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 5195

    Description

      single node FSTYPE=zfs sanity.sh is hanging for me:

      [ 4020.591245] Lustre: DEBUG MARKER: == sanity test 64b: check out-of-space detection on client ============= 18:00:03 (1350079203)
      [ 4301.008200] LNet: Service thread pid 29286 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 4301.018042] Pid: 29286, comm: ll_ost_io01_005
      [ 4301.018946]
      [ 4301.018947] Call Trace:
      [ 4301.019836]  [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0
      [ 4301.021092]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
      [ 4301.022300]  [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl]
      [ 4301.023736]  [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0
      [ 4301.025058]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      [ 4301.026319]  [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl]
      [ 4301.027593]  [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs]
      [ 4301.028851]  [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs]
      [ 4301.030237]  [<ffffffffa0a5732d>] ? dmu_tx_wait+0x11d/0x120 [zfs]
      [ 4301.031719]  [<ffffffffa0a5740e>] ? dmu_tx_assign+0x8e/0x4b0 [zfs]
      [ 4301.033237]  [<ffffffffa08b955c>] ? osd_trans_start+0x9c/0x430 [osd_zfs]
      [ 4301.034868]  [<ffffffffa0cebe1d>] ? ofd_trans_start+0x22d/0x450 [ofd]
      [ 4301.036448]  [<ffffffffa0cf0772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd]
      [ 4301.037807]  [<ffffffffa0cf1775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd]
      [ 4301.039079]  [<ffffffffa0c2dbd8>] ? obd_commitrw+0x128/0x3d0 [ost]
      [ 4301.040326]  [<ffffffffa0c34e34>] ? ost_brw_write+0xd04/0x15d0 [ost]
      [ 4301.041592]  [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
      [ 4301.042939]  [<ffffffffa0c3a250>] ? ost_handle+0x3120/0x4550 [ost]
      [ 4301.044167]  [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 4301.045471]  [<ffffffffa0551483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
      [ 4301.047047]  [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 4301.048460]  [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 4301.049727]  [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [ 4301.051178]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 4301.052242]  [<ffffffffa055401a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
      [ 4301.053519]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4301.054678]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [ 4301.055860]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4301.057251]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4301.058523]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 4301.059582]
      [ 4301.059918] LustreError: dumping log to /tmp/lustre-log.1350079484.29286
      [ 4307.044564] LNet: Service thread pid 29646 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 4307.049281] Pid: 29646, comm: ll_ost_io00_007
      [ 4307.050021]
      [ 4307.050022] Call Trace:
      [ 4307.050711]  [<ffffffff8100e530>] ? dump_trace+0x190/0x3b0
      [ 4307.051628]  [<ffffffff81174d23>] ? create_object+0x33/0x2c0
      [ 4307.052579]  [<ffffffff814fdb54>] ? _spin_lock_irqsave+0x24/0x30
      [ 4307.053624]  [<ffffffff81061580>] ? pick_next_task_fair+0xd0/0x130
      [ 4307.054684]  [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0
      [ 4307.055661]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
      [ 4307.056646]  [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl]
      [ 4307.057801]  [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0
      [ 4307.058899]  [<ffffffffa09ed0ca>] ? kmem_alloc_debug+0x13a/0x4c0 [spl]
      [ 4307.060010]  [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl]
      [ 4307.061059]  [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs]
      [ 4307.062048]  [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs]
      [ 4307.063270]  [<ffffffffa0a577fb>] dmu_tx_assign+0x47b/0x4b0 [zfs]
      [ 4307.064295]  [<ffffffffa08b955c>] osd_trans_start+0x9c/0x430 [osd_zfs]
      [ 4307.065440]  [<ffffffffa0cebe1d>] ofd_trans_start+0x22d/0x450 [ofd]
      [ 4307.066507]  [<ffffffffa0cf0772>] ofd_commitrw_write+0x652/0x12a0 [ofd]
      [ 4307.067656]  [<ffffffffa0cf1775>] ofd_commitrw+0x3b5/0x9c0 [ofd]
      [ 4307.068690]  [<ffffffffa0c2dbd8>] obd_commitrw+0x128/0x3d0 [ost]
      [ 4307.069707]  [<ffffffffa0c34e34>] ost_brw_write+0xd04/0x15d0 [ost]
      [ 4307.070777]  [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
      [ 4307.071911]  [<ffffffffa0c3a250>] ost_handle+0x3120/0x4550 [ost]
      [ 4307.072975]  [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [ 4307.074082]  [<ffffffffa0551483>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
      [ 4307.075370]  [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 4307.076445]  [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 4307.077561]  [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      [ 4307.078655]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 4307.079601]  [<ffffffffa055401a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
      [ 4307.080673]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4307.081885]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 4307.082707]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4307.083715]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [ 4307.084911]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 4307.085896]
      ...
      [ 4914.044577] Lustre: 29335:0:(service.c:1266:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
      [ 4914.044580]   req@ffff8801a51c9bf0 x1415657954602061/t0(0) o4->5dc665da-eb3e-3cef-f2b8-d3318e082c88@0@lo:0/0 lens 488/448 e 5 to 0 dl 1350080102 ref 2 fl Interpret:/0/0 rc 0/0
      [ 5024.072116] Lustre: 4182:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350079284/real 1350079284]  req@ffff88016fb9fbf0 x1415657954602058/t0(0) o4->lustre-OST0001-osc-ffff88023539fbf0@0@lo:6/4 lens 488/448 e 5 to 1 dl 1350080207 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 5024.080010] Lustre: lustre-OST0001-osc-ffff88023539fbf0: Connection to lustre-OST0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [ 5024.083451] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) reconnecting
      [ 5024.087183] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) refused reconnection, still busy with 1 active RPCs
      [ 5024.090527] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16
      

      And it never completes.

      Attachments

        Activity

          People

            utopiabound Nathaniel Clark
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: