[LU-2165] sanity test 64b hang with zfs Created: 12/Oct/12 Updated: 01/Jul/15 Resolved: 01/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Nathaniel Clark |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | zfs | ||
| Severity: | 3 |
| Rank (Obsolete): | 5195 |
| Description |
|
single node FSTYPE=zfs sanity.sh is hanging for me: [ 4020.591245] Lustre: DEBUG MARKER: == sanity test 64b: check out-of-space detection on client ============= 18:00:03 (1350079203) [ 4301.008200] LNet: Service thread pid 29286 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 4301.018042] Pid: 29286, comm: ll_ost_io01_005 [ 4301.018946] [ 4301.018947] Call Trace: [ 4301.019836] [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0 [ 4301.021092] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [ 4301.022300] [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl] [ 4301.023736] [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0 [ 4301.025058] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 4301.026319] [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl] [ 4301.027593] [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs] [ 4301.028851] [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs] [ 4301.030237] [<ffffffffa0a5732d>] ? dmu_tx_wait+0x11d/0x120 [zfs] [ 4301.031719] [<ffffffffa0a5740e>] ? dmu_tx_assign+0x8e/0x4b0 [zfs] [ 4301.033237] [<ffffffffa08b955c>] ? osd_trans_start+0x9c/0x430 [osd_zfs] [ 4301.034868] [<ffffffffa0cebe1d>] ? ofd_trans_start+0x22d/0x450 [ofd] [ 4301.036448] [<ffffffffa0cf0772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd] [ 4301.037807] [<ffffffffa0cf1775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd] [ 4301.039079] [<ffffffffa0c2dbd8>] ? obd_commitrw+0x128/0x3d0 [ost] [ 4301.040326] [<ffffffffa0c34e34>] ? ost_brw_write+0xd04/0x15d0 [ost] [ 4301.041592] [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 4301.042939] [<ffffffffa0c3a250>] ? ost_handle+0x3120/0x4550 [ost] [ 4301.044167] [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 4301.045471] [<ffffffffa0551483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 4301.047047] [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 4301.048460] [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 4301.049727] [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 4301.051178] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 4301.052242] [<ffffffffa055401a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 4301.053519] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4301.054678] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 4301.055860] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4301.057251] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4301.058523] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 4301.059582] [ 4301.059918] LustreError: dumping log to /tmp/lustre-log.1350079484.29286 [ 4307.044564] LNet: Service thread pid 29646 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 4307.049281] Pid: 29646, comm: ll_ost_io00_007 [ 4307.050021] [ 4307.050022] Call Trace: [ 4307.050711] [<ffffffff8100e530>] ? dump_trace+0x190/0x3b0 [ 4307.051628] [<ffffffff81174d23>] ? create_object+0x33/0x2c0 [ 4307.052579] [<ffffffff814fdb54>] ? _spin_lock_irqsave+0x24/0x30 [ 4307.053624] [<ffffffff81061580>] ? pick_next_task_fair+0xd0/0x130 [ 4307.054684] [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0 [ 4307.055661] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 [ 4307.056646] [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl] [ 4307.057801] [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0 [ 4307.058899] [<ffffffffa09ed0ca>] ? kmem_alloc_debug+0x13a/0x4c0 [spl] [ 4307.060010] [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl] [ 4307.061059] [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs] [ 4307.062048] [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs] [ 4307.063270] [<ffffffffa0a577fb>] dmu_tx_assign+0x47b/0x4b0 [zfs] [ 4307.064295] [<ffffffffa08b955c>] osd_trans_start+0x9c/0x430 [osd_zfs] [ 4307.065440] [<ffffffffa0cebe1d>] ofd_trans_start+0x22d/0x450 [ofd] [ 4307.066507] [<ffffffffa0cf0772>] ofd_commitrw_write+0x652/0x12a0 [ofd] [ 4307.067656] [<ffffffffa0cf1775>] ofd_commitrw+0x3b5/0x9c0 [ofd] [ 4307.068690] [<ffffffffa0c2dbd8>] obd_commitrw+0x128/0x3d0 [ost] [ 4307.069707] [<ffffffffa0c34e34>] ost_brw_write+0xd04/0x15d0 [ost] [ 4307.070777] [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 4307.071911] [<ffffffffa0c3a250>] ost_handle+0x3120/0x4550 [ost] [ 4307.072975] [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 4307.074082] [<ffffffffa0551483>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 4307.075370] [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 4307.076445] [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 4307.077561] [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 4307.078655] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 4307.079601] [<ffffffffa055401a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 4307.080673] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4307.081885] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 4307.082707] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4307.083715] [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 4307.084911] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 4307.085896] ... [ 4914.044577] Lustre: 29335:0:(service.c:1266:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply [ 4914.044580] req@ffff8801a51c9bf0 x1415657954602061/t0(0) o4->5dc665da-eb3e-3cef-f2b8-d3318e082c88@0@lo:0/0 lens 488/448 e 5 to 0 dl 1350080102 ref 2 fl Interpret:/0/0 rc 0/0 [ 5024.072116] Lustre: 4182:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1350079284/real 1350079284] req@ffff88016fb9fbf0 x1415657954602058/t0(0) o4->lustre-OST0001-osc-ffff88023539fbf0@0@lo:6/4 lens 488/448 e 5 to 1 dl 1350080207 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 5024.080010] Lustre: lustre-OST0001-osc-ffff88023539fbf0: Connection to lustre-OST0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [ 5024.083451] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) reconnecting [ 5024.087183] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) refused reconnection, still busy with 1 active RPCs [ 5024.090527] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16 And it never completes. |
| Comments |
| Comment by Oleg Drokin [ 12/Oct/12 ] |
|
This one seems to be fully repeatable for me with: |
| Comment by Oleg Drokin [ 13/Oct/12 ] |
|
Hm, addition, this only seems to be fully repeatable when I have kmemleak enabled and running (Which put some extra mem pressure on things, [ 2148.341196] Lustre: DEBUG MARKER: == sanity test 64b: check out-of-space dete ction on client ============= 19:51:42 (1350172302) [ 2430.988224] LNet: Service thread pid 21584 was inactive for 200.00s. The thre ad might be hung, or it might only be slow and will resume later. Dumping the st ack trace for debugging purposes:[ 2430.991114] Pid: 21584, comm: ll_ost_io01_008 [ 2430.991874] [ 2430.991874] Call Trace: [ 2430.992557] [<ffffffff8100e530>] ? dump_trace+0x190/0x3b0 [ 2430.993560] [<ffffffff81174d23>] ? create_object+0x33/0x2c0 [ 2430.994549] [<ffffffff81174e9c>] ? create_object+0x1ac/0x2c0 [ 2430.996102] [<ffffffffa09f5984>] ? cv_wait_common+0xa4/0x1b0 [spl] [ 2430.997777] [<ffffffffa09f58d3>] ? __cv_timedwait+0x13/0x20 [spl] [ 2430.999375] [<ffffffffa0a956b7>] ? txg_delay+0xf7/0x140 [zfs] [ 2431.000944] [<ffffffffa0a6d5fe>] ? dsl_dir_tempreserve_space+0x1de/0x1f0 [zf s] [ 2431.002927] [<ffffffffa0a575f8>] ? dmu_tx_assign+0x278/0x4b0 [zfs][ 2431.004618] [<ffffffffa0b5655c>] ? osd_trans_start+0x9c/0x430 [osd_zfs] [ 2431.006422] [<ffffffffa0d49e1d>] ? ofd_trans_start+0x22d/0x450 [ofd] [ 2431.008114] [<ffffffffa0d4e772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd] [ 2431.009985] [<ffffffffa0d4f775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd] [ 2431.011623] [<ffffffffa0caabd8>] ? obd_commitrw+0x128/0x3d0 [ost] [ 2431.013192] [<ffffffffa0cb1e34>] ? ost_brw_write+0xd04/0x15d0 [ost] [ 2431.014756] [<ffffffffa05032f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 2431.016372] [<ffffffffa0cb7250>] ? ost_handle+0x3120/0x4550 [ost] [ 2431.017889] [<ffffffffa01f4464>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 2431.019396] [<ffffffffa0550483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 2431.021231] [<ffffffffa01e866e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 2431.022757] [<ffffffffa0549171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 2431.024408] [<ffffffffa01f86d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 2431.025997] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 2431.027116] [<ffffffffa055301a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 2431.028214] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2431.029378] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 2431.030288] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2431.031413] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2431.032518] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 2431.033434] [ 2431.033718] LustreError: dumping log to /tmp/lustre-log.1350172585.21584 [ 2434.700279] LNet: Service thread pid 3314 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 2434.704227] Pid: 3314, comm: ll_ost_io00_001 [ 2434.705219] [ 2434.705220] Call Trace: [ 2434.706092] [<ffffffff81164533>] ? __kmalloc+0x1c3/0x2f0 [ 2434.707182] [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0 [ 2434.708537] [<ffffffffa09ea85b>] kmem_free_debug+0x4b/0x150 [spl] [ 2434.709854] [<ffffffffa09f58d3>] ? __cv_timedwait+0x13/0x20 [spl] [ 2434.711112] [<ffffffffa0a956b7>] ? txg_delay+0xf7/0x140 [zfs] [ 2434.712350] [<ffffffffa0a952d5>] ? txg_wait_open+0x125/0x130 [zfs] [ 2434.713676] [<ffffffffa0a6d5fe>] ? dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs] [ 2434.715165] [<ffffffffa0a577fb>] ? dmu_tx_assign+0x47b/0x4b0 [zfs] [ 2434.716524] [<ffffffffa0b5655c>] ? osd_trans_start+0x9c/0x430 [osd_zfs] [ 2434.717950] [<ffffffffa0d49e1d>] ? ofd_trans_start+0x22d/0x450 [ofd] [ 2434.719345] [<ffffffffa0d4e772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd] [ 2434.720870] [<ffffffffa0d4f775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd] [ 2434.722277] [<ffffffffa0caabd8>] ? obd_commitrw+0x128/0x3d0 [ost] [ 2434.723621] [<ffffffffa0cb1e34>] ? ost_brw_write+0xd04/0x15d0 [ost] [ 2434.725081] [<ffffffffa05032f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 2434.726601] [<ffffffffa0cb7250>] ? ost_handle+0x3120/0x4550 [ost] [ 2434.727907] [<ffffffffa01f4464>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 2434.729364] [<ffffffffa0550483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [ 2434.731075] [<ffffffffa01e866e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 2434.732348] [<ffffffffa0549171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 2434.733622] [<ffffffffa01f86d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 2434.734745] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 2434.735682] [<ffffffffa055301a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [ 2434.736831] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2434.737889] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 2434.738787] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2434.739860] [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 2434.741105] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 2434.742165] [ 2434.742496] LustreError: dumping log to /tmp/lustre-log.1350172588.3314 [ 2519.305054] kmemleak: 998 new suspected memory leaks (see /sys/kernel/debug/kmemleak) |
| Comment by Nathaniel Clark [ 30/Nov/12 ] |
|
This may not be an issue other than ZFS really wants lots of memory and is kind of a pig about it. |
| Comment by Jodi Levi (Inactive) [ 28/Dec/12 ] |
|
Reduced from blocker per Oleg. |
| Comment by Nathaniel Clark [ 01/Jul/15 ] |
|
There are no instances of this bug in maloo. |