[LU-7020] OST_DESTROY message times out on MDS repeatedly, indefinitely Created: 18/Aug/15 Updated: 19/Feb/16 Resolved: 20/Jan/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl, p4hc | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are seeing a single message on one of our MDS nodes time out over and over with no apparent end in sight. The message is destined for an OST, so each time it times out the connection with the OST is assumed dead, and the MDS reconnects. Here is a sample of the timeout: 00000100:00000100:15.0:1437780834.157231:0:12825:0:(client.c:1976:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437780188/real 1437780188] req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1 00000100:00000200:15.0:1437780834.157246:0:12825:0:(events.c:99:reply_in_callback()) @@@ type 6, status 0 req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:Xru/2/ffffffff rc -11/-1 00000100:00000200:15.0:1437780834.157250:0:12825:0:(events.c:120:reply_in_callback()) @@@ unlink req@ffff880c55ff7400 x1506778035514940/t0(0) o6->lsd-OST003a-osc-MDT0000@172.19.2.163@o2ib100:28/4 lens 664/432 e 27 to 1 dl 1437780833 ref 1 fl Rpc:X/2/ffffffff rc -11/-1 00000400:00000200:15.0:1437780834.157254:0:12825:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ffff88072424e1c0 00000100:02000400:15.0:1437780834.157258:0:12825:0:(import.c:179:ptlrpc_set_import_discon()) lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.163@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Thread 12825 (ptlrpcd_14) is repeatedly timing out an opcode 6 (OST_DESTROY) with OST lsd-OST003a. Interestingly, it has the same transaction number each time. I cannot, however, see any resend of the request, so the OSS never really has an opportunity to try to reply to the message. It is especially strange that the send time is getting updated each time even though the message does not seem to be resent. This problem has persisted through reboots of both the MDS and OSTs. The sysadmins tried failing over the OST, and the problem followed the OST to the failover OSS (that part probably isn't terribly surprising). This has persisted for weeks now. Currently, the servers are running lustre version 2.5.4-4chaos. |
| Comments |
| Comment by Andreas Dilger [ 19/Aug/15 ] |
|
Hi Chris, When you say the "same transaction number", do you mean the same XID (i.e. x1506778035514940)? In that case it appears the RPC is being resent repeatedly by the PTLRPC layer, thougha it seems unlikely that this would persist across the restart of the MDS. |
| Comment by Christopher Morrone [ 19/Aug/15 ] |
|
Yes, I meant the XID. I had "rpctrace" and "lnet" enabled, and I did not see anywhere that the XID was ever actually transmitted. Actually, I have not retested since the MDS was rebooted, so I don't know if the XID changed after that. It was the same for many hours before the MDS was rebooted. Using the debug_daemon on a production MDS for 15 minutes is not terribly practical. The logs are prohibitive in size. We'll need to be a bit more targeted, at least while it is in production use. If we can avoid a down time for the filesystem to get to the root of the problem, that is certainly preferable. But while gather a 11+ minute log is difficult, I did manually dump two logs of back-to-back timeouts. That allowed me to use the "sent" time from the second timeout to go back in the log from the previous timeout. That log did seem to fully cover the "sent" time, but I saw no messages concerning the transmission of that XID. Granted, I'm not 100% sure that I have captured everything, but thats about as much as we can reasonably do while the system is in production use. |
| Comment by Oleg Drokin [ 20/Aug/15 ] |
|
I wonder if you can do simultaneous OST logs gathering to see if the xid really arrives there? |
| Comment by Hongchao Zhang [ 26/Aug/15 ] |
|
Is the logs available now? |
| Comment by D. Marc Stearman (Inactive) [ 31/Aug/15 ] |
|
We will have a system maintenance window mid September if we need to increase the debugging level. |
| Comment by Peter Jones [ 08/Sep/15 ] |
|
Hongchao Is there any specific debugging we want to ask from LLNL? Thanks Peter |
| Comment by Hongchao Zhang [ 09/Sep/15 ] |
|
Hi, Please enable the "ha", "rpctrace" and "net", and collecting the debug logs from both MDS and OSS. Thanks! |
| Comment by D. Marc Stearman (Inactive) [ 16/Sep/15 ] |
|
Thank you. The system maintenance we had planned for Sept has been postponed to sometime in October. I will provide an update after the maintenance. |
| Comment by D. Marc Stearman (Inactive) [ 16/Oct/15 ] |
|
I have attached the debug logs as requested. |
| Comment by Hongchao Zhang [ 16/Oct/15 ] |
|
Thanks for the logs! and still need some more time to analysis the logs and will update the status of this ticket soon! |
| Comment by Hongchao Zhang [ 19/Oct/15 ] |
|
in the OSS, there is already one OST_DESTROY request being found to be processed, then this request was failed with -EBUSY and Could you please collect the processes stack dump at the OSS by "echo 't' > syrq-trigger"? Thanks! |
| Comment by Cameron Harr [ 19/Oct/15 ] |
|
I've attached a process dump starting at 08:26:26 that should include the OST problem at 08:26:27. |
| Comment by Hongchao Zhang [ 23/Oct/15 ] |
|
as per the logs, the destroy process hung at "txg_wait_open", and it could be the same on as LU-6923 2015-10-19 08:27:21 ll_ost01_001 D 0000000000000006 0 6028 2 0x00000000 2015-10-19 08:27:21 ffff881029d037e0 0000000000000046 ffff881029d03790 ffffffff81065c4e 2015-10-19 08:27:21 0000000000000010 0000000000000003 0000000329d03750 0000000000000001 2015-10-19 08:27:21 ffff881029d037c0 0000000000000086 ffff881030569068 ffff881029d03fd8 2015-10-19 08:27:21 Call Trace: 2015-10-19 08:27:21 [<ffffffff81065c4e>] ? try_to_wake_up+0x24e/0x3e0 2015-10-19 08:27:21 [<ffffffff810a06de>] ? prepare_to_wait_exclusive+0x4e/0x80 2015-10-19 08:27:21 [<ffffffffa03b6fbd>] cv_wait_common+0x11d/0x130 [spl] 2015-10-19 08:27:21 [<ffffffff810a0490>] ? autoremove_wake_function+0x0/0x40 2015-10-19 08:27:21 [<ffffffffa03b7025>] __cv_wait+0x15/0x20 [spl] 2015-10-19 08:27:21 [<ffffffffa046e77b>] txg_wait_open+0x8b/0xd0 [zfs] 2015-10-19 08:27:21 [<ffffffffa042cf27>] dmu_tx_wait+0x3f7/0x400 [zfs] 2015-10-19 08:27:21 [<ffffffffa04415da>] ? dsl_dir_tempreserve_space+0xca/0x190 [zfs] 2015-10-19 08:27:21 [<ffffffffa042d121>] dmu_tx_assign+0xa1/0x570 [zfs] 2015-10-19 08:27:21 [<ffffffffa0e6d64c>] osd_trans_start+0x9c/0x410 [osd_zfs] 2015-10-19 08:27:21 [<ffffffffa0f3094d>] ofd_trans_start+0x22d/0x3f0 [ofd] 2015-10-19 08:27:21 [<ffffffffa0f317c3>] ofd_object_destroy+0x203/0x680 [ofd] 2015-10-19 08:27:21 [<ffffffffa0f258ca>] ofd_destroy_by_fid+0x33a/0x710 [ofd] 2015-10-19 08:27:21 [<ffffffffa09b5cc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa09b7560>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa09e1535>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa0f29337>] ofd_destroy+0x1a7/0x8b0 [ofd] 2015-10-19 08:27:21 [<ffffffffa09e5cb0>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa0edbf89>] ost_handle+0x3ba9/0x44d0 [ost] 2015-10-19 08:27:21 [<ffffffffa09e9fcb>] ? ptlrpc_update_export_timer+0x4b/0x560 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa09f1095>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa06cd52e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2015-10-19 08:27:21 [<ffffffffa06de845>] ? lc_watchdog_touch+0x65/0x170 [libcfs] 2015-10-19 08:27:21 [<ffffffffa09e9ac9>] ? ptlrpc_wait_event+0xb9/0x2e0 [ptlrpc] 2015-10-19 08:27:21 [<ffffffff81058929>] ? __wake_up_common+0x59/0x90 2015-10-19 08:27:21 [<ffffffffa09f381d>] ptlrpc_main+0xaed/0x1930 [ptlrpc] 2015-10-19 08:27:21 [<ffffffffa09f2d30>] ? ptlrpc_main+0x0/0x1930 [ptlrpc] 2015-10-19 08:27:21 [<ffffffff8109fffe>] kthread+0x9e/0xc0 2015-10-19 08:27:21 [<ffffffff8100c24a>] child_rip+0xa/0x20 2015-10-19 08:27:21 [<ffffffff8109ff60>] ? kthread+0x0/0xc0 2015-10-19 08:27:21 [<ffffffff8100c240>] ? child_rip+0x0/0x20 |
| Comment by Jinshan Xiong (Inactive) [ 26/Oct/15 ] |
|
Indeed, in both tickets, the processes are waiting for a txg to open. And in LU-6923, where I didn't run Lustre on top of ZFS, the txg_sync process was stuck on memory allocation. Did you find a similar process doing the same thing in this case? |
| Comment by Hongchao Zhang [ 26/Oct/15 ] |
|
the following process is similar to that stuck "txt_sync" in LU-6923 2015-10-19 08:27:16 txg_sync D 0000000000000009 0 5908 2 0x00000000 2015-10-19 08:27:16 ffff88102d211b70 0000000000000046 0000000000000001 ffff880812c7e770 2015-10-19 08:27:16 0000000000000000 0000000000000000 ffff88102d211ae0 ffffffff81065df2 2015-10-19 08:27:16 ffff88102d211b30 ffffffff81058929 ffff88102d151068 ffff88102d211fd8 2015-10-19 08:27:16 Call Trace: 2015-10-19 08:27:16 [<ffffffff81065df2>] ? default_wake_function+0x12/0x20 2015-10-19 08:27:16 [<ffffffff81058929>] ? __wake_up_common+0x59/0x90 2015-10-19 08:27:16 [<ffffffff810ac3d1>] ? ktime_get_ts+0xb1/0xf0 2015-10-19 08:27:16 [<ffffffff8152e363>] io_schedule+0x73/0xc0 2015-10-19 08:27:16 [<ffffffffa03b6f4f>] cv_wait_common+0xaf/0x130 [spl] 2015-10-19 08:27:16 [<ffffffff810a0490>] ? autoremove_wake_function+0x0/0x40 2015-10-19 08:27:16 [<ffffffffa03b6fe8>] __cv_wait_io+0x18/0x20 [spl] 2015-10-19 08:27:16 [<ffffffffa04b9d6b>] zio_wait+0x10b/0x1e0 [zfs] 2015-10-19 08:27:16 [<ffffffffa0443a83>] dsl_pool_sync+0xb3/0x430 [zfs] 2015-10-19 08:27:16 [<ffffffffa0459943>] spa_sync+0x443/0xb90 [zfs] 2015-10-19 08:27:16 [<ffffffff81058929>] ? __wake_up_common+0x59/0x90 2015-10-19 08:27:16 [<ffffffff8105ce63>] ? __wake_up+0x53/0x70 2015-10-19 08:27:16 [<ffffffff810149f9>] ? read_tsc+0x9/0x20 2015-10-19 08:27:16 [<ffffffffa046f079>] txg_sync_thread+0x389/0x5f0 [zfs] 2015-10-19 08:27:16 [<ffffffffa046ecf0>] ? txg_sync_thread+0x0/0x5f0 [zfs] 2015-10-19 08:27:16 [<ffffffffa046ecf0>] ? txg_sync_thread+0x0/0x5f0 [zfs] 2015-10-19 08:27:16 [<ffffffffa03b27e8>] thread_generic_wrapper+0x68/0x80 [spl] 2015-10-19 08:27:16 [<ffffffffa03b2780>] ? thread_generic_wrapper+0x0/0x80 [spl] 2015-10-19 08:27:16 [<ffffffff8109fffe>] kthread+0x9e/0xc0 2015-10-19 08:27:16 [<ffffffff8100c24a>] child_rip+0xa/0x20 2015-10-19 08:27:16 [<ffffffff8109ff60>] ? kthread+0x0/0xc0 2015-10-19 08:27:16 [<ffffffff8100c240>] ? child_rip+0x0/0x20 |
| Comment by Alex Zhuravlev [ 03/Nov/15 ] |
|
probably the same root cause as in |
| Comment by Hongchao Zhang [ 03/Nov/15 ] |
|
Hi Cameron, Which version of ZFS you are using on you site? and does the problem still exist now? if so, could you please collect one more As per the comment from Alex, could you please check whether the patch http://review.whamcloud.com/#/c/13630/ is contained Thanks |
| Comment by Peter Jones [ 03/Nov/15 ] |
|
I believe that |
| Comment by Cameron Harr [ 03/Nov/15 ] |
|
Hongchao, The original problem does still exist and I've captured another stack trace. Unfortunately, the capture of the trace crashed the node after a while, causing me to lose the top output. The OST is failed over at this point and I can attempt to fail back, though past attempts on failing back the OST (without rebooting both OSSs) have been unsuccessful. Let me know if you would like me to try the trace again to get the top output? |
| Comment by Cameron Harr [ 03/Nov/15 ] |
|
New stack trace attached. This capture resulted in crashing the OSS so it's possible some information is missing. The OST reconnect messages show at 08:20:32. |
| Comment by Cameron Harr [ 03/Nov/15 ] |
|
Peter, |
| Comment by Hongchao Zhang [ 04/Nov/15 ] |
|
there are more Lustre processes in the "D" state in the new stack traces, and also found two almost same "D" multipathd processes. 2015-10-19 08:26:55 multipathd D 0000000000000001 0 2657 1 0x00000000 2015-10-19 08:26:55 ffff88102f979968 0000000000000082 0000000000000000 ffff88102f97992c 2015-10-19 08:26:55 ffff880700000000 ffff88083fe82000 0000315a05c8fc69 ffff8800446158c0 2015-10-19 08:26:55 0000000000000200 000000010336f75d ffff881030de85f8 ffff88102f979fd8 2015-10-19 08:26:55 Call Trace: 2015-10-19 08:26:55 [<ffffffff8152ea95>] schedule_timeout+0x215/0x2e0 2015-10-19 08:26:55 [<ffffffff8138ac3b>] ? scsi_request_fn+0xdb/0x750 2015-10-19 08:26:55 [<ffffffff81088ccd>] ? del_timer+0x7d/0xe0 2015-10-19 08:26:55 [<ffffffff8152e70b>] wait_for_common+0x13b/0x190 2015-10-19 08:26:55 [<ffffffff81065de0>] ? default_wake_function+0x0/0x20 2015-10-19 08:26:55 [<ffffffff8152e81d>] wait_for_completion+0x1d/0x20 2015-10-19 08:26:55 [<ffffffff8127ab2c>] blk_execute_rq+0x8c/0xf0 2015-10-19 08:26:55 [<ffffffff8127a870>] ? blk_rq_map_user+0x1a0/0x280 2015-10-19 08:26:55 [<ffffffff8127f0a5>] sg_io+0x215/0x3d0 2015-10-19 08:26:55 [<ffffffff8127fa30>] scsi_cmd_ioctl+0x400/0x470 2015-10-19 08:26:55 [<ffffffff8127faf1>] scsi_cmd_blk_ioctl+0x51/0x70 2015-10-19 08:26:55 [<ffffffffa03c861f>] sd_ioctl+0xaf/0x110 [sd_mod] 2015-10-19 08:26:55 [<ffffffff8127c757>] __blkdev_driver_ioctl+0x67/0x80 2015-10-19 08:26:55 [<ffffffff8127cc0d>] blkdev_ioctl+0x1ed/0x6e0 2015-10-19 08:26:55 [<ffffffff811cce4c>] block_ioctl+0x3c/0x40 2015-10-19 08:26:55 [<ffffffff811a5882>] vfs_ioctl+0x22/0xa0 2015-10-19 08:26:55 [<ffffffff8128f9f5>] ? _atomic_dec_and_lock+0x55/0x80 2015-10-19 08:26:55 [<ffffffff811a5ea4>] do_vfs_ioctl+0x84/0x5e0 2015-10-19 08:26:55 [<ffffffff81191cf4>] ? __fput+0x1b4/0x230 2015-10-19 08:26:55 [<ffffffff811a6481>] sys_ioctl+0x81/0xa0 2015-10-19 08:26:55 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b 2015-11-03 08:19:43 multipathd D 0000000000000001 0 2657 1 0x00000000 2015-11-03 08:19:43 ffff88102f979968 0000000000000082 0000000000000000 ffff88102f97992c 2015-11-03 08:19:43 ffff880700000000 ffff88083fe82000 0004cf882d534eda ffff8800446158c0 2015-11-03 08:19:43 0000000000000200 0000000150a69d91 ffff881030de85f8 ffff88102f979fd8 2015-11-03 08:19:43 Call Trace: 2015-11-03 08:19:43 [<ffffffff8152ea95>] schedule_timeout+0x215/0x2e0 2015-11-03 08:19:43 [<ffffffff8138ac3b>] ? scsi_request_fn+0xdb/0x750 2015-11-03 08:19:43 [<ffffffff81088ccd>] ? del_timer+0x7d/0xe0 2015-11-03 08:19:43 [<ffffffff8152e70b>] wait_for_common+0x13b/0x190 2015-11-03 08:19:43 [<ffffffff81065de0>] ? default_wake_function+0x0/0x20 2015-11-03 08:19:43 [<ffffffff8152e81d>] wait_for_completion+0x1d/0x20 2015-11-03 08:19:43 [<ffffffff8127ab2c>] blk_execute_rq+0x8c/0xf0 2015-11-03 08:19:43 [<ffffffff8127a870>] ? blk_rq_map_user+0x1a0/0x280 2015-11-03 08:19:43 [<ffffffff8127f0a5>] sg_io+0x215/0x3d0 2015-11-03 08:19:43 [<ffffffff8127fa30>] scsi_cmd_ioctl+0x400/0x470 2015-11-03 08:19:43 [<ffffffff8127faf1>] scsi_cmd_blk_ioctl+0x51/0x70 2015-11-03 08:19:43 [<ffffffffa03c861f>] sd_ioctl+0xaf/0x110 [sd_mod] 2015-11-03 08:19:43 [<ffffffff8127c757>] __blkdev_driver_ioctl+0x67/0x80 2015-11-03 08:19:43 [<ffffffff8127cc0d>] blkdev_ioctl+0x1ed/0x6e0 2015-11-03 08:19:43 [<ffffffff811cce4c>] block_ioctl+0x3c/0x40 2015-11-03 08:19:43 [<ffffffff811a5882>] vfs_ioctl+0x22/0xa0 2015-11-03 08:19:43 [<ffffffff8128f9f5>] ? _atomic_dec_and_lock+0x55/0x80 2015-11-03 08:19:43 [<ffffffff811a5ea4>] do_vfs_ioctl+0x84/0x5e0 2015-11-03 08:19:43 [<ffffffff81191cf4>] ? __fput+0x1b4/0x230 2015-11-03 08:19:43 [<ffffffff811a6481>] sys_ioctl+0x81/0xa0 2015-11-03 08:19:43 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b The other "multipathd" processes are normal (PIDs are 2649, 2651, 2655, 2656, 2658, 3607, 3643, 3661, 3668, 3755, 3791). |
| Comment by Peter Jones [ 04/Nov/15 ] |
|
Cameron I was mostly hoping to stimulate thought being given to different options to break the impasse on this ticket. I will defer to Hongchao as to whether this option is worth pursuing but it certainly sounds like an interesting possibility to me Peter |
| Comment by Hongchao Zhang [ 10/Nov/15 ] |
|
Hi Cameron, Could you please attach the console logs and the dmesg logs in the affected MDT? I want to check whether there are some logs Thanks |
| Comment by D. Marc Stearman (Inactive) [ 10/Nov/15 ] |
|
I have checked the console logs. There are no errors reported. pilsneri:/g/g0/marc# grep -i multi /var/log/conman/console.pilsner-mds1 [root@pilsner-mds1:~]# multipath -ll 35000a7203009d7ac dm-6 STEC,S842E400M2 size=373G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 0:0:7:0 sdh 8:112 active ready running `- 1:0:7:0 sdab 65:176 active ready running 35000a7203009d8b5 dm-15 STEC,S842E400M2 size=373G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 0:0:17:0 sdq 65:0 active ready running `- 1:0:17:0 sdak 66:64 active ready running ... Many more devices All the devices are in an active ready running state, just like the other 9 file systems here at LLNL. |
| Comment by Christopher Morrone [ 11/Nov/15 ] |
|
It is important to remember that a backtrace is just a snapshot in time, it does not imply on its own that a thread is stuck. I did a bit more investigation and asked Brian for help interpreting what I was seeing. Here is what we found: When I logged into the problem OST this time, I immediately noted that the txg_sync and txg_quiesce processes are almost constantly busy, and have accumulated orders of magnitude more CPU time than the the processes on other OSS nodes. Also, the ll_ost_01_005 thread has a pretty constant, but low, CPU usage of about 7.6%. So Brian showed me how to turn on txg history (echo 100 > /sys/module/zfs/parameters/zfs_txg_history). Here is an excerpt from /proc/spl/kstat/zfs/pilsner59/txgs: 33 0 0x01 100 11200 142887020337 544583879125793 txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 76803304746 544583868564365 C 0 0 0 0 0 70508 21160 68986 54491 76803304747 544583868634873 C 0 0 0 0 0 147566 50375 12733 94518 76803304748 544583868782439 C 0 0 0 0 0 71725 21106 67840 53209 76803304749 544583868854164 C 0 0 0 0 0 146596 49691 12989 85313 76803304750 544583869000760 C 0 0 0 0 0 71177 21217 58686 56199 76803304751 544583869071937 C 0 0 0 0 0 138247 51730 12942 92351 76803304752 544583869210184 C 0 0 0 0 0 73329 21253 65425 58371 This means that transaction groups are being synced many, many times per seconds, and very few of those of the txgs contain any actual data (zero bytes written, zero bytes read). Next Brian pointed my to the /proc/spl/kstat/zfs/dmu_tx file, which apparently lists dmu tx error counts: 4 1 0x01 11 528 39909419835 544891782589194 name type data dmu_tx_assigned 4 5185797 dmu_tx_delay 4 0 dmu_tx_error 4 0 dmu_tx_suspended 4 0 dmu_tx_group 4 15267 dmu_tx_memory_reserve 4 2254989223 dmu_tx_memory_reclaim 4 0 dmu_tx_dirty_throttle 4 0 dmu_tx_dirty_delay 4 11520 dmu_tx_dirty_over_max 4 32 dmu_tx_quota 4 0 Brian was kind enough to look up dmu_tx_memory_reserve for me in the code, and it basically means that something is requesting a dmu transaction size that exceeds the entire size of the ARC. The ARC right now is around 33GiB, for reference. So lets go back to that ll_ost_01_005 thread. Here is the backtrace: PID: 6166 TASK: ffff88080f932040 CPU: 5 COMMAND: "ll_ost01_005" #0 [ffff8807bb00d710] schedule at ffffffff8152db20 #1 [ffff8807bb00d7e8] cv_wait_common at ffffffffa03b3fbd [spl] #2 [ffff8807bb00d868] __cv_wait at ffffffffa03b4025 [spl] #3 [ffff8807bb00d878] txg_wait_open at ffffffffa046b77b [zfs] #4 [ffff8807bb00d8b8] dmu_tx_wait at ffffffffa0429f27 [zfs] #5 [ffff8807bb00d948] dmu_tx_assign at ffffffffa042a121 [zfs] #6 [ffff8807bb00da08] osd_trans_start at ffffffffa0e6a64c [osd_zfs] #7 [ffff8807bb00da38] ofd_trans_start at ffffffffa0b3e94d [ofd] #8 [ffff8807bb00da68] ofd_object_destroy at ffffffffa0b3f7c3 [ofd] #9 [ffff8807bb00daa8] ofd_destroy_by_fid at ffffffffa0b338ca [ofd] #10 [ffff8807bb00dba8] ofd_destroy at ffffffffa0b37337 [ofd] #11 [ffff8807bb00dc18] ost_handle at ffffffffa0edff89 [ost] #12 [ffff8807bb00dd68] ptlrpc_server_handle_request at ffffffffa09ee095 [ptlrpc] #13 [ffff8807bb00de48] ptlrpc_main at ffffffffa09f081d [ptlrpc] #14 [ffff8807bb00dee8] kthread at ffffffff8109fffe #15 [ffff8807bb00df48] kernel_thread at ffffffff8100c24a It keeps showing up in ofd_destroy. Could that have anything to do with our problem OST_DESTROY? It does seem quite likely. So it would certainly appear that Lustre is asking for a transaction size that is absurd, ZFS replies with EAGAIN, and then I guess that Lustre forces a sync on the txg? Then Lustre immediately starts the whole thing over again, and repeats it forever. I think that is probably enough information to get you looking in the right place. |
| Comment by Christopher Morrone [ 11/Nov/15 ] |
|
Brian gave me a direct pointer to where ZFS is catching the bad TX size request: It's spinning right here: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/arc.c#L5111 The reserve size is calculated based on the declared size of the TX in question. |
| Comment by Alex Zhuravlev [ 11/Nov/15 ] |
|
I think this is what we were fixing in |
| Comment by Hongchao Zhang [ 11/Nov/15 ] |
|
Hi Alex, Currently, if the object size is less than 16M, the "dmu_object_free" will be used to destroy the object in one transaction. |
| Comment by Alex Zhuravlev [ 11/Nov/15 ] |
|
object size is not equal transaction size. transaction size includes blocks the transaction modifies, but blocks used for user's data aren't modified, so they are not part of transaction. IOW, it's only metadata blocks (dnode, tree). |
| Comment by Cameron Harr [ 11/Nov/15 ] |
|
This is late notice, but we have planned OS upgrades on this system this morning (08:30-10:00 PST) and I will have the system offline for a small amount of time. It looks like we have some pointers to chase down, but let me know if you'd like me to do anything with that downtime. |
| Comment by Christopher Morrone [ 11/Nov/15 ] |
|
Yes, it certainly sounds like |
| Comment by Hongchao Zhang [ 13/Nov/15 ] |
|
the patch back-ported from http://review.whamcloud.com/#/c/13630/ in |
| Comment by Cameron Harr [ 14/Jan/16 ] |
|
I updated to Lustre 2.5.5-3chaos today on the system and there appears to be no change in the behavior. Following is output taken after the system has been updated: Jan 14 09:34:34 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.162@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Jan 14 09:34:34 pilsner59 kernel: Lustre: lsd-OST003a: Client lsd-MDT0000-mdtlov_UUID (at 172.19.2.102@o2ib100) reconnecting Jan 14 09:34:34 pilsner59 kernel: Lustre: lsd-OST003a: deleting orphan objects from 0x0:26258401 to 0x0:26258561 Jan 14 09:34:34 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection restored to lsd-OST003a (at 172.19.2.162@o2ib100) Jan 14 09:34:34 pilsner-mds1 kernel: Lustre: Skipped 1 previous similar message Jan 14 09:45:20 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.162@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Jan 14 09:45:20 pilsner59 kernel: Lustre: lsd-OST003a: Client lsd-MDT0000-mdtlov_UUID (at 172.19.2.102@o2ib100) reconnecting Jan 14 09:45:20 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection restored to lsd-OST003a (at 172.19.2.162@o2ib100) Jan 14 09:45:39 pilsner59 kernel: Lustre: lsd-OST003a: deleting orphan objects from 0x0:26260525 to 0x0:26260961 Jan 14 09:56:05 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection to lsd-OST003a (at 172.19.2.162@o2ib100) was lost; in progress operations using this service will wait for recovery to complete Jan 14 09:56:05 pilsner59 kernel: Lustre: lsd-OST003a: Client lsd-MDT0000-mdtlov_UUID (at 172.19.2.102@o2ib100) reconnecting Jan 14 09:56:05 pilsner59 kernel: Lustre: lsd-OST003a: deleting orphan objects from 0x0:26262647 to 0x0:26262977 Jan 14 09:56:05 pilsner-mds1 kernel: Lustre: lsd-OST003a-osc-MDT0000: Connection restored to lsd-OST003a (at 172.19.2.162@o2ib100) |
| Comment by Hongchao Zhang [ 16/Jan/16 ] |
|
Hi Cameron, The logs in the above seems to be normal. the precreated objects at OST will be deleted by MDT when the MDT recover the connection Is there other logs to indicate the OST_DESTROY request is resent periodically? Thanks |
| Comment by Cameron Harr [ 19/Jan/16 ] |
|
Hongchao, |
| Comment by Peter Jones [ 19/Jan/16 ] |
|
That's good news Cameron. So, can we close this ticket as a duplicate of |
| Comment by Cameron Harr [ 20/Jan/16 ] |
|
Yes. It can be closed. |
| Comment by Peter Jones [ 20/Jan/16 ] |
|
Thanks Cameron |