|
Matteo
It is certainly true that 2.2 servers with 2.4 clients is not a combination that we officially test or support. However, rather than jump to conclusions, we should have an engineer review the evidence and see if this is indeed the reason for the problems.
Hongchao
Could you please review the information provided and make an assessment
Thanks
Peter
|
|
Hi Matteo
what is the format of the second log file "lustre_and_login_nodes_logs.tar.bz2ab", which isn't a bzip2 file just like the first one.
the logs in the first file "lustre_and_login_nodes_logs.tar.bz2aa" only contains log of the clients, and the server logs are needed to
get the cause of eviction.
BTW, does the "stuck" of login nodes affect the whole system or only the applications? how long is the maximum duration of the stuck?
Thanks
|
|
Hi Peter and Hongchao,
thanks for your further investigations.
I'm sorry, I forgot to mention that I attached an archive splitted in two pieces (GNU coreutils split) because of the 10MB limits, if necessary I can upload a new one.
The maximun duration of the "stuck" is variable, from seconds to several minutes and it affects only the login node itself requiring a manual system reboot.
Thanks.
|
|
Hi Matteo,
the eviction is related to the "stuck" as you mentioned above, for there is even no "ping" request for more than 227 (the ping request is sent
by an individual thread started alongside of the Lustre). the time interval before eviction can be extended as followings,
echo "newtimeout" > /proc/sys/lustre/timeout
the interval will be "newtimeout * 9 / 4"
if the stuck can't be avoided, how about increasing the timeout value?
Thanks
|
|
Hi Hongchao,
we considered to change the timeout values.
We will do tests and let you know as soon as possible if it solves the problem.
Thanks.
P.S. I don't know how to change the ticket status
|
|
Hi Matteo
It is ok - you do not need to change the ticket status. Please just let us know how you get on with your tests.
Thanks
Peter
|
|
Log of eviction with timeout value of 200 seconds
|
|
Hi Hongchao,
we configure the timeout from the default 100 seconds to 200 seconds with the following command but the nodes was evicted around 457 seconds.
lctl conf_param nero.sys.timeout=200
We try to set a timeout value of 300 seconds and again found eviction around 667 seconds.
Where i can find some best practises for setting the static timeout in proportion to the number of node?
Do you have any other useful advice to try to solve this problem?
Thanks,
Matteo
P.S. We attached the log file with timeout value of 200s [logs_timeout_200s.tar.bz2]
|
|
Hello Matteo,
Sorry for the late update.
Could it be possible that you boot at least one of the impacted login nodes with "nohz=off" boot-parameter and see how it runs with it ??
|
|
Comment from the customer:
<<Dear Gabriele
I have set nohz=off on two out of the four problematic servers at 5.12.
Yesterday evening, one of them has been evicted again (nodename brutus2).
So, this did not help.
BTW: the other node has nohz=off as well as notsc and clocksource=hpet set.>>
|
|
Thanks to attach the update Gabriele. I also attach both files/logs (brutus2_eviction.txt, messages_brutus2.txt) that were provided with this update.
Matteo, does this mean that nohz=off helped to avoid the issue for some time ?
Also, I am afraid I don't fully understand what you mean by "stuck", is it a scheduling issue caused by the heavy load on the login nodes ?
Don't you run with some special ptlrpcd module multi-thread/numa-policy parameter ?
I see in the logs that the evictions almost always start from OSS side with the "LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after …." msg, so can you provide the lustre debug-log (with full debug enabled and biggest debug buffer) at the time of the evict and from Client and OSSs sides ?
|
|
Adding latest infos/exchanges from/with customer :
_ Since it was never provided, I requested to get Client/OSS Lustre debug log during an evict. Since debug mask current settings was only "ioctl neterror warning error emerg ha config console" on all Clients/Servers, I requested "+dlmtrace +rpctrace". This is now setup on one login/Client node on-site.
_ I requested to get current ptlrpcd Numa settings. On all Clients/Servers nodes nothing specific is actually configured, max_ptlrpcds=0 and ptlrpcd_bind_policy=3.
_ I also requested to get current vm.zone_reclaim_mode setting, and since it was 1/on, I requested to have it changed to 0/off at least on one login/Client node. This has been done by customer.
I also attach the latest Lustre+Numa configurations infos provided by customer (requested_outputs.tar.bz2).
|
|
Customer provided more infos (Client lustre debug-log, Client/OSS syslog, ...) taken during a new occurence. It is "ETHZ_client_eviction_brutus3_n-oss07_20131220.tar.bz2" attachment I just uploaded.
Here are my 1st analysis comments about these new infos.
The following similar stack dumps for multiple ldlm_bl_xx threads before the evict :
================================================================
Dec 20 10:00:54 brutus3 kernel: INFO: task ldlm_bl_43:12517 blocked for more than 120 seconds.
Dec 20 10:00:54 brutus3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 20 10:00:54 brutus3 kernel: ldlm_bl_43 D 000000000000000c 0 12517 2 0x00000080
Dec 20 10:00:54 brutus3 kernel: ffff880ae35a3d50 0000000000000046 0000000000000000 ffffffffa0582967
Dec 20 10:00:54 brutus3 kernel: 0000000100000000 0000000000000000 ffff880ae35a3cf0 ffffffffa0581f22
Dec 20 10:00:54 brutus3 kernel: ffff880ad9e7c5f8 ffff880ae35a3fd8 000000000000fb88 ffff880ad9e7c5f8
Dec 20 10:00:54 brutus3 kernel: Call Trace:
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa0582967>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa0581f22>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
Dec 20 10:00:54 brutus3 kernel: [<ffffffff8150f1ee>] __mutex_lock_slowpath+0x13e/0x180
Dec 20 10:00:54 brutus3 kernel: [<ffffffff8150f08b>] mutex_lock+0x2b/0x50
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa06d5ebf>] cl_lock_mutex_get+0x6f/0xd0 [obdclass]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa0a4495a>] osc_ldlm_blocking_ast+0x7a/0x350 [osc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa057d2c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa07f1f00>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa07f2451>] ldlm_bl_thread_main+0x281/0x3d0 [ptlrpc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa07f21d0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa07f21d0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffffa07f21d0>] ? ldlm_bl_thread_main+0x0/0x3d0 [ptlrpc]
Dec 20 10:00:54 brutus3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
================================================================
looks like an indication of a problem/dead-lock on the Client-side.
And the associated Lustre debug-traces sequence is also always the same and like :
==================================================================
00000100:00100000:33.0:1387529928.782898:0:3546:0:(events.c:352:request_in_callback()) peer: 12345-10.201.62.37@o2ib
00000100:00100000:25.0:1387529928.782936:0:3640:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1444519122204514
00000100:00080000:25.0:1387529928.782961:0:3640:0:(service.c:1079:ptlrpc_update_export_timer()) updating export LOV_OSC_UUID at 1387529928 exp ffff8820303d4000
00000100:00100000:25.0:1387529928.782985:0:3640:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-10.201.62.37@o2ib, seq: 15852
00000100:00100000:25.0:1387529928.783003:0:3640:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb03_002:LOV_OSC_UUID+4:4310:x1444519122204514:12345-10.201.62.37@o2ib:106
00010000:00010000:25.0:1387529928.783029:0:3640:0:(ldlm_lockd.c:1882:ldlm_handle_gl_callback()) ### client glimpse AST callback handler ns: nero-OST000e-osc-ffff88203a24a800 lock: ffff88091b327a00/0x30c32bca9aefeafc lrc: 8/0,0 mode: PW/PW res: 119352871/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x29400000000 nid: local remote: 0x87d5d77ddb4e8fb0 expref: -99 pid: 45673 timeout: 0 lvb_type: 1
00000020:00010000:25.0:1387529928.783063:0:3640:0:(cl_object.c:305:cl_object_glimpse()) header@ffff88035bd6a6a0[0x0, 5, [0x20001c158:0xdd90:0x0] hash]
00000020:00010000:25.0:1387529928.783065:0:3640:0:(cl_object.c:305:cl_object_glimpse()) size: 348 mtime: 1387455174 atime: 1387456028 ctime: 1387455174 blocks: 8
00000020:00010000:25.0:1387529928.783068:0:3640:0:(cl_object.c:305:cl_object_glimpse()) header@ffff880334d7e7f0[0x0, 3, [0x1000e0000:0x71d2e27:0x0] hash]
00000020:00010000:25.0:1387529928.783069:0:3640:0:(cl_object.c:305:cl_object_glimpse()) size: 348 mtime: 1387455174 atime: 1387456028 ctime: 1387455174 blocks: 8
00000100:00100000:25.0:1387529928.783125:0:3640:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb03_002:LOV_OSC_UUID+4:4310:x1444519122204514:12345-10.201.62.37@o2ib:106 Request procesed in 135us (231us total) trans 0 rc 0/0
00000100:00100000:25.0:1387529928.783131:0:3640:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.201.62.37@o2ib, seq: 15852
00010000:00010000:12.0:1387529928.783139:0:12517:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: nero-OST000e-osc-ffff88203a24a800 lock: ffff88091b327a00/0x30c32bca9aefeafc lrc: 8/0,0 mode: PW/PW res: 119352871/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x29400000000 nid: local remote: 0x87d5d77ddb4e8fb0 expref: -99 pid: 45673 timeout: 0 lvb_type: 1
00010000:00010000:12.0:1387529928.783150:0:12517:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff88091b327a00 already unused, calling callback (ffffffffa0a448e0)
==================================================================
At a first look, problem seems related to LU-874/LU-2683, but all the concerned patches have been landed to b2_4 so it is unlikely to be the case finally.
A good thing would be to get a crash-dump at the time of the evict or at least a full threads stack-traces, to understand who owns the cl_lock mutex preventing ldlm_bl_xx threads to complete/answer the blocking-AST.
I will try to come back with a procedure to get more infos during other occurrences or may be a with specific debug patch.
|
|
Humm, in fact I wonder if this problem could not be a new facet of the same problem than LU-874/LU-2683 but under 2.2/2.4 hybrid un-tested/supported configuration ??…
|
|
Could it be possible to install+run with an instrumented version of Lustre-Client modules, at least on one of the impacted login nodes, that will allow to get a crash-dump (instead of lustre-debug log dump!) upon eviction ??
|
|
Just created a set of Client RPMs for the same Kernel 2.6.32-358.6.2.el6.x86_64 and Lustre 2.4.0-RC2 versions used on-site, allowing for a LBUG() (in ptlrpc_invalidate_import_thread()) to occur instead of a debug-log dump (if obd_dump_on_eviction exactly = -1) upon eviction. This should allow to get a crash-dump (if panic_on_lbug is also != 0) upon eviction.
I am currently testing its functionality in-house, will provide an update soon.
On the other hand, I am also trying to setup a similar Lustre-Server/Lustre-Client Version (2.2.0-RC2-PRISTINE/2.4.0-RC2-CHANGED) platform to try reproduce the issue with the prog from LU-874.
|
|
My RPMs, including the LBUG() upon eviction patch/instrumentation, testing has been successful. The instrumented RPMs have been pushed on customer upload for installation and in order to allow for a crash-dump to be taken during an eviction.
Customer will also try to run the LU-874 reproducer on its platform.
|
|
We have the RPMs on half of the login nodes in place, checked our kdump config with sysrq-trigger and waiting for the first crash-dump during an eviction.
We run reproducer LU-874 and also LU-4112 aggressively but no eviction nor anything else happened.
Cheers
Allen
|
|
Allen, thanks for helping so much.
|
|
crash dump text file uploaded to jira and vmcore dump provided to Gabriele Paciucci.
|
|
Another dump file for reference.
Full coredump uploaded to Gabriele.
|
|
Sorry for the delay to complete crash-dumps from instrumented RPMs analysis.
Now, after spending some time doing so, I think the problem you face is the same than the one already tracked as part of LU-4300.
To confirm this, could it be possible that you disable ELC (with "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel") on one (all?) of the login nodes that trigger the evictions ?
|
|
Thanks for the analysis of the crash dump, Bruno.
I have disabled ELC on all the involved login nodes now. Lets see, what happens.
|
|
Hello Eric, can you give some feedback if running with ELC disabled helped or not?
|
|
I cannot tell if it is successful as we had no eviction prior to set ELC=0 for at least a fortnight.
I have tried to stress the login nodes more than usual, but I could not force an eviction. But this is not unusual,
as we never managed to trigger the eviction with a certain usage pattern…
I will of course update asap we have a new eviction, although I hope that it never ever occurs again (best case).
|
|
Eric, can you give some update ?? Thanks in advance.
|
|
We had no eviction on the 4 nodes until 02.04.2014. At this time, we rebooted one of these nodes with the longest uptime of all (53 days) and let the default early_lock_cancel (1) configured. This node run for 6 days until 08.04.2014 and then has been evicted by first a single OSS and rapidly by others. I have seen users doing many rsync operations on Lustre at that time (apart from other usual load).
Lustre logs showed the following:
00010000:00010000:45.0:1396944724.065662:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881853511200/0x5d1861be72b7c5eb lrc: 0/0,0 mode: -/PR res: 138708343/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5df055 expref: -99 pid: 43945 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065672:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881d59ae9a00/0x5d1861be72b7b7f2 lrc: 0/0,0 mode: -/PR res: 138708334/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5def52 expref: -99 pid: 43945 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065678:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8818726e8600/0x5d1861be72b7a728 lrc: 0/0,0 mode: -/PR res: 138708324/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5dee41 expref: -99 pid: 43945 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065684:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881a88e77400/0x5d1861be72cf6d93 lrc: 0/0,0 mode: -/PR res: 138711793/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbaff expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065703:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881e9e464c00/0x5d1861be72cfd56d lrc: 0/0,0 mode: -/PR res: 138711859/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fc1b2 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065709:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881c15923e00/0x5d1861be72cf1691 lrc: 0/0,0 mode: -/PR res: 138711753/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb5cd expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065716:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881cab0b1400/0x5d1861be72b78afe lrc: 0/0,0 mode: -/PR res: 138708312/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5debd2 expref: -99 pid: 43945 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065722:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff880b360b0c00/0x5d1861be72cfc8b6 lrc: 0/0,0 mode: -/PR res: 138711851/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fc0cb expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065728:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8815907c4200/0x5d1861be72cfac3f lrc: 0/0,0 mode: -/PR res: 138711835/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbee8 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065734:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8815399ed400/0x5d1861be72cf7232 lrc: 0/0,0 mode: -/PR res: 138711798/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbb53 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065749:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881915410400/0x5d1861be72cf90cb lrc: 0/0,0 mode: -/PR res: 138711819/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbd52 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065759:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff88146b2d0c00/0x5d1861be72cf1ef6 lrc: 0/0,0 mode: -/PR res: 138711760/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb63d expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065766:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8813f3483600/0x5d1861be72b78333 lrc: 0/0,0 mode: -/PR res: 138708303/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5deac8 expref: -99 pid: 43945 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065772:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff880725999600/0x5d1861be72cfb752 lrc: 0/0,0 mode: -/PR res: 138711843/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbfb3 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065779:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8819f2a30200/0x5d1861be72cf3084 lrc: 0/0,0 mode: -/PR res: 138711769/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb708 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065785:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8814e99b5600/0x5d1861be72cf69bf lrc: 0/0,0 mode: -/PR res: 138711791/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fba9d expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065791:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff8804d6b30200/0x5d1861be72d291cf lrc: 0/0,0 mode: -/PW res: 138711739/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869480000000 nid: local remote: 0x35306a32ed604737 expref: -99 pid: 5386 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065801:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881902ad2200/0x5d1861be72cf58f5 lrc: 0/0,0 mode: -/PR res: 138711785/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb99a expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065808:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff88201e820600/0x5d1861be72cf4449 lrc: 0/0,0 mode: -/PR res: 138711776/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>18446744073709551615) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fb85f expref: -99 pid: 44920 timeout: 0 lvb_type: 1
00010000:00010000:45.0:1396944724.065818:0:12697:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: nero-OST0019-osc-ffff880c39f1ec00 lock: ffff881108e46000/0x5d1861be72cf6e34 lrc: 0/0,0 mode: -/PW res: 138711795/0 rrc: 1 type: EXT [0->18446744073709551615] (req 0>4095) flags: 0x869400000000 nid: local remote: 0x35306a32ed5fbb06 expref: -99 pid: 44920 timeout: 0 lvb_type: 1
We have now set ELC on all the nodes again.
So, there might be a reasonable chance that ELC will work out (hopefully!).
|
|
Hello Eric,
Can you give us some feedback on how "worked" ELC ?? I hope we will definitely confirm that this ticket is a dup of LU-4300 and we can close it accordingly ...
|
|
Hello Bruno
All nodes are up for at least 45 days now without any new evictions. I guess we can say that ELC works and may close this ticket.
Thanks for collaboration!
|
|
Thanks Matteo!
|
Generated at Sat Feb 10 01:38:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.