[LU-2165] sanity test 64b hang with zfs Created: 12/Oct/12  Updated: 01/Jul/15  Resolved: 01/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Nathaniel Clark
Resolution: Cannot Reproduce Votes: 0
Labels: zfs

Severity: 3
Rank (Obsolete): 5195

 Description   

single node FSTYPE=zfs sanity.sh is hanging for me:

[ 4020.591245] Lustre: DEBUG MARKER: == sanity test 64b: check out-of-space detection on client ============= 18:00:03 (1350079203)
[ 4301.008200] LNet: Service thread pid 29286 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 4301.018042] Pid: 29286, comm: ll_ost_io01_005
[ 4301.018946]
[ 4301.018947] Call Trace:
[ 4301.019836]  [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0
[ 4301.021092]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
[ 4301.022300]  [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl]
[ 4301.023736]  [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0
[ 4301.025058]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
[ 4301.026319]  [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl]
[ 4301.027593]  [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs]
[ 4301.028851]  [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs]
[ 4301.030237]  [<ffffffffa0a5732d>] ? dmu_tx_wait+0x11d/0x120 [zfs]
[ 4301.031719]  [<ffffffffa0a5740e>] ? dmu_tx_assign+0x8e/0x4b0 [zfs]
[ 4301.033237]  [<ffffffffa08b955c>] ? osd_trans_start+0x9c/0x430 [osd_zfs]
[ 4301.034868]  [<ffffffffa0cebe1d>] ? ofd_trans_start+0x22d/0x450 [ofd]
[ 4301.036448]  [<ffffffffa0cf0772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd]
[ 4301.037807]  [<ffffffffa0cf1775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd]
[ 4301.039079]  [<ffffffffa0c2dbd8>] ? obd_commitrw+0x128/0x3d0 [ost]
[ 4301.040326]  [<ffffffffa0c34e34>] ? ost_brw_write+0xd04/0x15d0 [ost]
[ 4301.041592]  [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[ 4301.042939]  [<ffffffffa0c3a250>] ? ost_handle+0x3120/0x4550 [ost]
[ 4301.044167]  [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[ 4301.045471]  [<ffffffffa0551483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[ 4301.047047]  [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[ 4301.048460]  [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[ 4301.049727]  [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[ 4301.051178]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 4301.052242]  [<ffffffffa055401a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[ 4301.053519]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4301.054678]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[ 4301.055860]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4301.057251]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4301.058523]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 4301.059582]
[ 4301.059918] LustreError: dumping log to /tmp/lustre-log.1350079484.29286
[ 4307.044564] LNet: Service thread pid 29646 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 4307.049281] Pid: 29646, comm: ll_ost_io00_007
[ 4307.050021]
[ 4307.050022] Call Trace:
[ 4307.050711]  [<ffffffff8100e530>] ? dump_trace+0x190/0x3b0
[ 4307.051628]  [<ffffffff81174d23>] ? create_object+0x33/0x2c0
[ 4307.052579]  [<ffffffff814fdb54>] ? _spin_lock_irqsave+0x24/0x30
[ 4307.053624]  [<ffffffff81061580>] ? pick_next_task_fair+0xd0/0x130
[ 4307.054684]  [<ffffffff814fb5d1>] schedule_timeout+0x191/0x2e0
[ 4307.055661]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
[ 4307.056646]  [<ffffffffa09f56e0>] __cv_timedwait_common+0xc0/0x280 [spl]
[ 4307.057801]  [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0
[ 4307.058899]  [<ffffffffa09ed0ca>] ? kmem_alloc_debug+0x13a/0x4c0 [spl]
[ 4307.060010]  [<ffffffffa09f58d3>] __cv_timedwait+0x13/0x20 [spl]
[ 4307.061059]  [<ffffffffa0a956b7>] txg_delay+0xf7/0x140 [zfs]
[ 4307.062048]  [<ffffffffa0a6d5fe>] dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs]
[ 4307.063270]  [<ffffffffa0a577fb>] dmu_tx_assign+0x47b/0x4b0 [zfs]
[ 4307.064295]  [<ffffffffa08b955c>] osd_trans_start+0x9c/0x430 [osd_zfs]
[ 4307.065440]  [<ffffffffa0cebe1d>] ofd_trans_start+0x22d/0x450 [ofd]
[ 4307.066507]  [<ffffffffa0cf0772>] ofd_commitrw_write+0x652/0x12a0 [ofd]
[ 4307.067656]  [<ffffffffa0cf1775>] ofd_commitrw+0x3b5/0x9c0 [ofd]
[ 4307.068690]  [<ffffffffa0c2dbd8>] obd_commitrw+0x128/0x3d0 [ost]
[ 4307.069707]  [<ffffffffa0c34e34>] ost_brw_write+0xd04/0x15d0 [ost]
[ 4307.070777]  [<ffffffffa05042f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[ 4307.071911]  [<ffffffffa0c3a250>] ost_handle+0x3120/0x4550 [ost]
[ 4307.072975]  [<ffffffffa0e9b464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[ 4307.074082]  [<ffffffffa0551483>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[ 4307.075370]  [<ffffffffa0e8f66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[ 4307.076445]  [<ffffffffa054a171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[ 4307.077561]  [<ffffffffa0e9f6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[ 4307.078655]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 4307.079601]  [<ffffffffa055401a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[ 4307.080673]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4307.081885]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[ 4307.082707]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4307.083715]  [<ffffffffa0553480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 4307.084911]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 4307.085896]
...
[ 4914.044577] Lustre: 29335:0:(service.c:1266:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-207), not sending early reply
[ 4914.044580]   req@ffff8801a51c9bf0 x1415657954602061/t0(0) o4->5dc665da-eb3e-3cef-f2b8-d3318e082c88@0@lo:0/0 lens 488/448 e 5 to 0 dl 1350080102 ref 2 fl Interpret:/0/0 rc 0/0
[ 5024.072116] Lustre: 4182:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350079284/real 1350079284]  req@ffff88016fb9fbf0 x1415657954602058/t0(0) o4->lustre-OST0001-osc-ffff88023539fbf0@0@lo:6/4 lens 488/448 e 5 to 1 dl 1350080207 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 5024.080010] Lustre: lustre-OST0001-osc-ffff88023539fbf0: Connection to lustre-OST0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[ 5024.083451] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) reconnecting
[ 5024.087183] Lustre: lustre-OST0001: Client 5dc665da-eb3e-3cef-f2b8-d3318e082c88 (at 0@lo) refused reconnection, still busy with 1 active RPCs
[ 5024.090527] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16

And it never completes.



 Comments   
Comment by Oleg Drokin [ 12/Oct/12 ]

This one seems to be fully repeatable for me with:
SLOW=yes REFORMAT=yes FSTYPE=zfs sh sanity.sh

Comment by Oleg Drokin [ 13/Oct/12 ]

Hm, addition, this only seems to be fully repeatable when I have kmemleak enabled and running (Which put some extra mem pressure on things,
with kmemleak disabled test 64b seems to pass for me.
Here's another trace from the last failure I had:

[ 2148.341196] Lustre: DEBUG MARKER: == sanity test 64b: check out-of-space dete
ction on client ============= 19:51:42 (1350172302)
[ 2430.988224] LNet: Service thread pid 21584 was inactive for 200.00s. The thre
ad might be hung, or it might only be slow and will resume later. Dumping the st
ack trace for debugging purposes:[ 2430.991114] Pid: 21584, comm: ll_ost_io01_008
[ 2430.991874]
[ 2430.991874] Call Trace:
[ 2430.992557]  [<ffffffff8100e530>] ? dump_trace+0x190/0x3b0
[ 2430.993560]  [<ffffffff81174d23>] ? create_object+0x33/0x2c0
[ 2430.994549]  [<ffffffff81174e9c>] ? create_object+0x1ac/0x2c0
[ 2430.996102]  [<ffffffffa09f5984>] ? cv_wait_common+0xa4/0x1b0 [spl]
[ 2430.997777]  [<ffffffffa09f58d3>] ? __cv_timedwait+0x13/0x20 [spl]
[ 2430.999375]  [<ffffffffa0a956b7>] ? txg_delay+0xf7/0x140 [zfs]
[ 2431.000944]  [<ffffffffa0a6d5fe>] ? dsl_dir_tempreserve_space+0x1de/0x1f0 [zf
s]
[ 2431.002927]  [<ffffffffa0a575f8>] ? dmu_tx_assign+0x278/0x4b0 [zfs][ 2431.004618]  [<ffffffffa0b5655c>] ? osd_trans_start+0x9c/0x430 [osd_zfs]
[ 2431.006422]  [<ffffffffa0d49e1d>] ? ofd_trans_start+0x22d/0x450 [ofd]
[ 2431.008114]  [<ffffffffa0d4e772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd]
[ 2431.009985]  [<ffffffffa0d4f775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd]
[ 2431.011623]  [<ffffffffa0caabd8>] ? obd_commitrw+0x128/0x3d0 [ost]
[ 2431.013192]  [<ffffffffa0cb1e34>] ? ost_brw_write+0xd04/0x15d0 [ost]
[ 2431.014756]  [<ffffffffa05032f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[ 2431.016372]  [<ffffffffa0cb7250>] ? ost_handle+0x3120/0x4550 [ost]
[ 2431.017889]  [<ffffffffa01f4464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[ 2431.019396]  [<ffffffffa0550483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[ 2431.021231]  [<ffffffffa01e866e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[ 2431.022757]  [<ffffffffa0549171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[ 2431.024408]  [<ffffffffa01f86d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[ 2431.025997]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 2431.027116]  [<ffffffffa055301a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[ 2431.028214]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2431.029378]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[ 2431.030288]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2431.031413]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2431.032518]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 2431.033434]
[ 2431.033718] LustreError: dumping log to /tmp/lustre-log.1350172585.21584
[ 2434.700279] LNet: Service thread pid 3314 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 2434.704227] Pid: 3314, comm: ll_ost_io00_001
[ 2434.705219]
[ 2434.705220] Call Trace:
[ 2434.706092]  [<ffffffff81164533>] ? __kmalloc+0x1c3/0x2f0
[ 2434.707182]  [<ffffffff814fc20c>] ? __mutex_lock_slowpath+0x21c/0x2c0
[ 2434.708537]  [<ffffffffa09ea85b>] kmem_free_debug+0x4b/0x150 [spl]
[ 2434.709854]  [<ffffffffa09f58d3>] ? __cv_timedwait+0x13/0x20 [spl]
[ 2434.711112]  [<ffffffffa0a956b7>] ? txg_delay+0xf7/0x140 [zfs]
[ 2434.712350]  [<ffffffffa0a952d5>] ? txg_wait_open+0x125/0x130 [zfs]
[ 2434.713676]  [<ffffffffa0a6d5fe>] ? dsl_dir_tempreserve_space+0x1de/0x1f0 [zfs]
[ 2434.715165]  [<ffffffffa0a577fb>] ? dmu_tx_assign+0x47b/0x4b0 [zfs]
[ 2434.716524]  [<ffffffffa0b5655c>] ? osd_trans_start+0x9c/0x430 [osd_zfs]
[ 2434.717950]  [<ffffffffa0d49e1d>] ? ofd_trans_start+0x22d/0x450 [ofd]
[ 2434.719345]  [<ffffffffa0d4e772>] ? ofd_commitrw_write+0x652/0x12a0 [ofd]
[ 2434.720870]  [<ffffffffa0d4f775>] ? ofd_commitrw+0x3b5/0x9c0 [ofd]
[ 2434.722277]  [<ffffffffa0caabd8>] ? obd_commitrw+0x128/0x3d0 [ost]
[ 2434.723621]  [<ffffffffa0cb1e34>] ? ost_brw_write+0xd04/0x15d0 [ost]
[ 2434.725081]  [<ffffffffa05032f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
[ 2434.726601]  [<ffffffffa0cb7250>] ? ost_handle+0x3120/0x4550 [ost]
[ 2434.727907]  [<ffffffffa01f4464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
[ 2434.729364]  [<ffffffffa0550483>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[ 2434.731075]  [<ffffffffa01e866e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[ 2434.732348]  [<ffffffffa0549171>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[ 2434.733622]  [<ffffffffa01f86d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[ 2434.734745]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[ 2434.735682]  [<ffffffffa055301a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[ 2434.736831]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2434.737889]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[ 2434.738787]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2434.739860]  [<ffffffffa0552480>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 2434.741105]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[ 2434.742165]
[ 2434.742496] LustreError: dumping log to /tmp/lustre-log.1350172588.3314
[ 2519.305054] kmemleak: 998 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
Comment by Nathaniel Clark [ 30/Nov/12 ]

This may not be an issue other than ZFS really wants lots of memory and is kind of a pig about it.

Comment by Jodi Levi (Inactive) [ 28/Dec/12 ]

Reduced from blocker per Oleg.

Comment by Nathaniel Clark [ 01/Jul/15 ]

There are no instances of this bug in maloo.

Generated at Sat Feb 10 01:22:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.