|
syslog from ost:
-----------------------------
Lustre: DEBUG MARKER: == lustre-rsync-test test 5b: Kill / restart lustre_rsync == 19:47:48 (1305254868)
Lustre: lustre-OST0005: haven't heard from client 6a761607-aedc-55b0-0bdc-d007d810ea33 (at 192.168.4.5@o2ib) in 55 seconds. I think it's dead, and I am evicting it. exp ffff88061fdfb400, cur 1305255094 expire 1305255064 last 1305255039
Lustre: lustre-OST0004: haven't heard from client 6a761607-aedc-55b0-0bdc-d007d810ea33 (at 192.168.4.5@o2ib) in 55 seconds. I think it's dead, and I am evicting it. exp ffff88061f62e400, cur 1305255094 expire 1305255064 last 1305255039
LustreError: 4153:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (107) req@ffff88061e86a400 x1368658213493272/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1305255100 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1
Lustre: 4155:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0005: connection from 6a761607-aedc-55b0-0bdc-d007d810ea33@192.168.4.5@o2ib t0 exp (null) cur 1305255094 last 0
Lustre: 4155:0:(filter.c:2806:filter_connect()) lustre-OST0005: Received MDS connection (0x19c578d6a8d29552); group 0
Lustre: 4155:0:(filter.c:2806:filter_connect()) Skipped 1 previous similar message
Lustre: 4155:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0005->NET_0x50000c0a80405_UUID netid 50000: select flavor null
Lustre: 4155:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages
Lustre: lustre-OST0005: haven't heard from client 6a761607-aedc-55b0-0bdc-d007d810ea33 (at 192.168.4.5@o2ib) in 55 seconds. I think it's dead, and I am evicting it. exp ffff88032fa68000, cur 1305255149 expire 1305255119 last 1305255094
LustreError: 4151:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (107) req@ffff88061a52e800 x1368658213493279/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1305255183 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1
Lustre: 4157:0:(ldlm_lib.c:871:target_handle_connect()) lustre-OST0003: connection from 6a761607-aedc-55b0-0bdc-d007d810ea33@192.168.4.5@o2ib t0 exp (null) cur 1305255177 last 0
Lustre: 4157:0:(filter.c:2806:filter_connect()) lustre-OST0003: Received MDS connection (0x19c578d6a8d29560); group 0
Lustre: 4157:0:(filter.c:2806:filter_connect()) Skipped 1 previous similar message
Lustre: 4163:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0000->NET_0x50000c0a80405_UUID netid 50000: select flavor null
Lustre: 4163:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 2 previous similar messages
LustreError: 4151:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 5 previous similar messages
|
|
syslog from mds:
--------------------------
Lustre: DEBUG MARKER: ----============= acceptance-small: lustre-rsync-test ============---- Fri May 13 12:04:49 PDT 2011
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: Disabling parameter lustre-MDT0000.mdd.quota_type in log lustre-MDT0000
Lustre: Skipped 1 previous similar message
LustreError: 10125:0:(mgs_llog.c:2652:mgs_write_log_param()) err -19 on param 'mdd.quota_type='
LustreError: 10125:0:(mgs_handler.c:969:mgs_iocontrol()) setparam err -19
Lustre: 10302:0:(debug.c:320:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: DEBUG MARKER: only running test 5b
Lustre: DEBUG MARKER: excepting tests:
Lustre: DEBUG MARKER: == lustre-rsync-test test 5b: Kill / restart lustre_rsync == 12:04:56 (1305313496)
Lustre: mdd_obd-lustre-MDT0000-0: changelog on
Lustre: mdd_obd-lustre-MDT0000-0: changelog off
Lustre: MGS: haven't heard from client 2bae2e67-a05e-33c2-0d9a-6856e66881f6 (at 192.168.4.5@o2ib) in 48 seconds. I think it's dead, and I am evicting it. exp ffff88062b134000, cur 1305313632 expire 1305313602 last 1305313584
Lustre: lustre-MDT0000: haven't heard from client 9cb3a3d1-f9b3-8d76-cf39-fa485e920451 (at 192.168.4.5@o2ib) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8805da0c3000, cur 1305313637 expire 1305313607 last 1305313585
LustreError: 9234:0:(mgs_handler.c:677:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS
LustreError: 10576:0:(mdt_handler.c:2813:mdt_recovery()) operation 400 on unconnected MDS from 12345-192.168.4.5@o2ib
LustreError: 10576:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (107) req@ffff88062b09c000 x1368720312157349/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1305313695 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1
Lustre: 10576:0:(ldlm_lib.c:871:target_handle_connect()) lustre-MDT0000: connection from 9cb3a3d1-f9b3-8d76-cf39-fa485e920451@192.168.4.5@o2ib t4294983304 exp (null) cur 1305313689 last 0
Lustre: 10576:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 2 previous similar messages
Lustre: 10576:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-MDT0000->NET_0x50000c0a80405_UUID netid 50000: select flavor null
Lustre: 10576:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 2 previous similar messages
Lustre: MGS: haven't heard from client 2bae2e67-a05e-33c2-0d9a-6856e66881f6 (at 192.168.4.5@o2ib) in 48 seconds. I think it's dead, and I am evicting it. exp ffff8805d7cd9400, cur 1305313737 expire 1305313707 last 1305313689
Lustre: lustre-MDT0000: haven't heard from client 9cb3a3d1-f9b3-8d76-cf39-fa485e920451 (at 192.168.4.5@o2ib) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8802cc611c00, cur 1305313742 expire 1305313712 last 1305313689
LustreError: 9251:0:(mdt_handler.c:2813:mdt_recovery()) operation 400 on unconnected MDS from 12345-192.168.4.5@o2ib
LustreError: 9234:0:(mgs_handler.c:677:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS
LustreError: 9234:0:(ldlm_lib.c:2118:target_send_reply_msg()) @@@ processing error (107) req@ffff8805d9cfb050 x1368720312157364/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1305313796 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1
LustreError: 9234:0:(ldlm_lib.c:2118:target_send_reply_msg()) Skipped 1 previous similar message
Lustre: 9234:0:(ldlm_lib.c:871:target_handle_connect()) MGS: connection from 2bae2e67-a05e-33c2-0d9a-6856e66881f6@192.168.4.5@o2ib t0 exp (null) cur 1305313790 last 0
Lustre: 9234:0:(ldlm_lib.c:871:target_handle_connect()) Skipped 1 previous similar message
Lustre: 9234:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a80405_UUID netid 50000: select flavor null
Lustre: 9234:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 2 previous similar messages
|
|
Oleg
Could you look into this one please?
Thanks
Peter
|
|
Something pretty weird is going on as evidenced by the stack traces below.
The rsync thread calls mdc_ioc_changelog_send that creates a new "Sending to userspace thread" that calls llog_process_flags that schedules a new thread that in turn calls llog_process_flags again that schedules another thread that tries to write to some userspace pipe where nobody reads on the other end and hangs.
I did some more digging and I see that changelog communications is a total mess.
The userspace is responsible to create a pipe for communications with kernel.
Then our implementation (in liblustreapi.c::llapi_changelog_start) just does the ioctl causing the action from above to start.
Then userspace closes the write descriptor so that if kernel closes its side, it's notified.
One problem is - the read descriptor is not closed prior to ioctl, so even when the userspace side dies (due to signal), there are both write and read parts of the pipe opened by the kernel threads and so EPIPE never arises.
Even if we fix this in userspace, malicious people can construct a similar case and cause needless kernel resource usage, so we need to also address this from the kernel side. (is this a real concern or is this only accessible to root?)
In fact I just did the closing in the kernel and it seems to help, the commends in the code are all invalidly presume that the files must remain open where in fact we already hold the pipe reference in the kernel.
I am running the full test now to make sure that nothing break.
[ 284.928469] lustre_rsync D ffff88003f824288 4856 3398 1 0x00000000
[ 284.928469] ffff880039731c40 0000000000000082 0000000000000000 ffff880037460400
[ 284.928469] ffff880039731bb0 0000000000000282 ffff880039731bd0 00000000ffff17b4
[ 284.928469] ffff880030666738 ffff880039731fd8 0000000000010488 ffff880030666738
[ 284.928469] Call Trace:
[ 284.928469] [<ffffffff814d4e15>] schedule_timeout+0x225/0x2f0
[ 284.928469] [<ffffffff81158d36>] ? kfree_debugcheck+0x16/0x40
[ 284.928469] [<ffffffff814d4a7b>] wait_for_common+0x12b/0x180
[ 284.928469] [<ffffffff8105c700>] ? default_wake_function+0x0/0x20
[ 284.928469] [<ffffffff814d4b8d>] wait_for_completion+0x1d/0x20
[ 284.928469] [<ffffffffa0305975>] llog_process_flags+0x115/0x580 [obdclass]
[ 284.928469] [<ffffffffa045a1c7>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[ 284.928469] [<ffffffffa01f89b3>] ? cfs_alloc+0x63/0x90 [libcfs]
[ 284.928469] [<ffffffffa0308a78>] llog_cat_process_flags+0x188/0x2d0 [obdclass]
[ 284.928469] [<ffffffffa0307c0f>] ? llog_init_handle+0x17f/0x9d0 [obdclass]
[ 284.928469] [<ffffffff8115ba03>] ? __kmalloc+0x1c3/0x2e0
[ 284.928469] [<ffffffffa01f89b3>] ? cfs_alloc+0x63/0x90 [libcfs]
[ 284.928469] [<ffffffffa05a1050>] ? changelog_show_cb+0x0/0x310 [mdc]
[ 284.928469] [<ffffffffa05aa63e>] mdc_changelog_send_thread+0x41e/0xad0 [mdc]
[ 284.928469] [<ffffffff81069044>] ? __mmdrop+0x44/0x60
[ 284.928469] [<ffffffff8105a17e>] ? finish_task_switch+0xae/0xd0
[ 284.928469] [<ffffffff8101420a>] child_rip+0xa/0x20
[ 284.928469] [<ffffffffa05aa220>] ? mdc_changelog_send_thread+0x0/0xad0 [mdc]
[ 284.928469] [<ffffffff81014200>] ? child_rip+0x0/0x20
[ 284.928469] lustre_rsync D ffff880025cafcf0 4488 3399 3398 0x00000000
[ 284.928469] ffff880025cafc10 0000000000000082 ffff880001f16a80 ffff88003d0ce200
[ 284.928469] ffff880025cafb80 0000000000000282 ffff880025cafba0 0000000000000282
[ 284.928469] ffff8800374609b8 ffff880025caffd8 0000000000010488 ffff8800374609b8
[ 284.928469] Call Trace:
[ 284.928469] [<ffffffff814d4e15>] schedule_timeout+0x225/0x2f0
[ 284.928469] [<ffffffffa045a1c7>] ? llog_client_read_header+0x187/0x640 [ptlrpc]
[ 284.928469] [<ffffffffa01f89b3>] ? cfs_alloc+0x63/0x90 [libcfs]
[ 284.928469] [<ffffffff814d4a7b>] wait_for_common+0x12b/0x180
[ 284.928469] [<ffffffff8105c700>] ? default_wake_function+0x0/0x20
[ 284.928469] [<ffffffff814d4b8d>] wait_for_completion+0x1d/0x20
[ 284.928469] [<ffffffffa0305975>] llog_process_flags+0x115/0x580 [obdclass]
[ 284.928469] [<ffffffff810976bf>] ? up+0x2f/0x50
[ 284.928469] [<ffffffffa0309939>] llog_cat_process_cb+0x329/0x400 [obdclass]
[ 284.928469] [<ffffffffa0307303>] llog_process_thread+0x943/0xde0 [obdclass]
[ 284.928469] [<ffffffff8101420a>] child_rip+0xa/0x20
[ 284.928469] [<ffffffffa05a1050>] ? changelog_show_cb+0x0/0x310 [mdc]
[ 284.928469] [<ffffffffa03069c0>] ? llog_process_thread+0x0/0xde0 [obdclass]
[ 284.928469] [<ffffffff81014200>] ? child_rip+0x0/0x20
[ 284.928469] lustre_rsync S 000000000000000e 4320 3400 3399 0x00000000
[ 284.928469] ffff880025cc9b20 0000000000000082 0000000000000246 ffff88003c80f120
[ 284.928469] ffff880025cc9af0 ffffffff814d59cc ffff88003c80f138 0000000000000282
[ 284.928469] ffff88003d0ce7b8 ffff880025cc9fd8 0000000000010488 ffff88003d0ce7b8
[ 284.928469] Call Trace:
[ 284.928469] [<ffffffff814d59cc>] ? __mutex_lock_slowpath+0x21c/0x2c0
[ 284.928469] [<ffffffff8117c67b>] pipe_wait+0x5b/0x80
[ 284.928469] [<ffffffff81092780>] ? autoremove_wake_function+0x0/0x40
[ 284.928469] [<ffffffff814d5aa1>] ? mutex_lock+0x31/0x50
[ 284.928469] [<ffffffff8117d392>] pipe_write+0x322/0x660
[ 284.928469] [<ffffffff811728ea>] do_sync_write+0xfa/0x140
[ 284.928469] [<ffffffff81092780>] ? autoremove_wake_function+0x0/0x40
[ 284.928469] [<ffffffffa046fee9>] ? sptlrpc_req_put_ctx+0xa9/0x1b0 [ptlrpc]
[ 284.928469] [<ffffffffa01f8a46>] cfs_user_write+0x66/0xa0 [libcfs]
[ 284.928469] [<ffffffffa020a94b>] libcfs_kkuc_msg_put+0x5b/0x180 [libcfs]
[ 284.928469] [<ffffffffa045b087>] ? llog_client_next_block+0x1c7/0x550 [ptlrpc]
[ 284.928469] [<ffffffffa05a1112>] changelog_show_cb+0xc2/0x310 [mdc]
[ 284.928469] [<ffffffffa0307303>] llog_process_thread+0x943/0xde0 [obdclass]
[ 284.928469] [<ffffffff8101420a>] child_rip+0xa/0x20
[ 284.928469] [<ffffffffa03069c0>] ? llog_process_thread+0x0/0xde0 [obdclass]
[ 284.928469] [<ffffffff81014200>] ? child_rip+0x0/0x20
|
|
Sarah, please try patch in http://review.whamcloud.com/555 to see if it works for the entire rsync and changelog related testsuites.
Seems to be working fine for me.
|
|
Integrated in lustre-master » x86_64,client,sles11,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,client,el5,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,client,el5,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,server,el5,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,client,el5,ofa #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,client,el5,ofa #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,client,el6,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,server,el5,ofa #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,server,el5,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,server,el5,ofa #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,client,el6,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » x86_64,server,el6,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
Integrated in lustre-master » i686,server,el6,inkernel #128
LU-316 Fix changelog deadlock when changelog client is killed
Oleg Drokin : 9e6a0fcbf02942caff9d769d644913964d34c028
Files :
|
|
verified on the latest master build RHEL6/x86_64/#134
|
|
Fixed
|
Generated at Sat Feb 10 01:05:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.