[LU-4593] Lustre OSTs on one of the login nodes are evicted but never reconnect. Created: 06/Feb/14 Updated: 23/Nov/17 Resolved: 14/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oz Rentas | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 12546 |
| Description |
|
AWE has reported the following problem about Lustre. Jan 14 08:54:13 sprig2 kernel: [680476.616831] LustreError: 809:0:(cl_lock.c:1420:cl_unuse_try()) result = -108, this is unlikely! extract of syslog Looks similar to issue reported on https://jira.hpdd.intel.com/browse/LU-3889. |
| Comments |
| Comment by Peter Jones [ 06/Feb/14 ] |
|
Bobijam Could you please help with this one? Thanks Peter |
| Comment by Oz Rentas [ 12/Feb/14 ] |
|
Sprig2 hung again on 6th Feb. Log files from the client as well as the OSS and MDS nodes are attached. The only user on the machine at that time said he was updating the working copy of a subversion repository and had a couple of editor sessions open. Thanks. |
| Comment by Oz Rentas [ 18/Feb/14 ] |
|
Another crash event on 2/17. Attaching a fresh set of logs. Any feedback is appreciated. |
| Comment by Oleg Drokin [ 19/Feb/14 ] |
|
hm, what do you mean by another "crash" event? Before you only had hangs of clients with the diagnostics, right? What was the crash? |
| Comment by Zhenyu Xu [ 19/Feb/14 ] |
|
Saw several communication errors Feb 6 16:18:56 sprig2 kernel: [621533.228954] LustreError: 11-0: scratch-OST0003-osc-ffff881f4fe08400: Communicating with 11.3.0.14@o2ib, operation obd_ping failed with -107. oss1-messages:Feb 17 13:04:45 sprig-oss1 kernel: : LustreError: 138-a: scratch-OST0003: A client on nid 11.3.0.47@o2ib was evicted due to a lock blocking callback time out: rc -107 Is the lustre network working during the client hung? |
| Comment by Zhenyu Xu [ 19/Feb/14 ] |
|
You can use "lctl ping <NID>" to test lustre network, NID can be MDS nid or OSS nid, like 11.3.0.47@o2ib |
| Comment by Oz Rentas [ 03/Mar/14 ] |
|
Hi, this came in from the customer on 2/26. The SR owner asked me to update the notes with the new logs but I missed that original request. Here's it is: Please find attached “lctl ping” test results as well as the dmesg, Lustre log and messages file produced by a Lustre OST eviction event that occurred this morning on sprig5. This time we were the ones who triggered the bug rather than our users. We were recursively deleting a number of directories which contained a large number of small files on sprig5 which we use as a “scheduler node”. We therefore the only users logged in and the deletion was performed as the root user. I have since been able to recreate the issue on a compute node using a simple bash script that creates a large number of small files and then recursively deletes them using the rm command. Unfortunately, the bug isn’t 100% reproducible, but I hope this information may help you debug the issue further. |
| Comment by Zhenyu Xu [ 04/Mar/14 ] |
|
Still see sprig5 lost connection to OSS (11.3.0.14@o2ib) for unknown reason
and this makes lots of ongoing IO error out, as Feb 26 10:51:40 sprig5 kernel: [7247612.124147] LustreError: 129994:0:(cl_lock.c:1420:cl_unuse_try()) result = -108, this is unlikely! |
| Comment by Rajeshwaran Ganesan [ 04/Mar/14 ] |
|
Received the latest Logs. The eviction has happened again on a login node. We’ve been able to get the output from both dump files, dmesg and the messages file. We tried the lctl ping and were able to ping each of the mds and oss nodes. |
| Comment by Rajeshwaran Ganesan [ 04/Mar/14 ] |
|
The eviction has happened again on a login node. We’ve been able to get the output from both dump files, dmesg and the messages file. We tried the lctl ping and were able to ping each of the mds and oss nodes. |
| Comment by Zhenyu Xu [ 04/Mar/14 ] |
|
from lustre-lost.1393924031.9592 and
Is the OSS which holds OST0001/4/7 very busy during the eviction? |
| Comment by Rajeshwaran Ganesan [ 06/Mar/14 ] |
|
We have checked the system load, but they look normal, here is the update received.. OST 1 & 4 are hosted on oss1, OST 7 is hosted on oss2. We have sysstat installed on the OSSs. The OSS machines don’t seem heavily loaded. Most samples report 99.90% idle using the sar command. The eviction happened at 23:08 last night but the OSSs reported no real change to the load at that time. Looking at the load on the machine yesterday we see peaks of SYSTEM use and IOWAIT times at 10:30 and 14:30 on both OSSs. These peaks increase from the normal 0.07% system use to 0.30% and 0.01% iowait to 0.27 at 14:30. I wouldn’t imagine this is considered heavily loaded? |
| Comment by Zhenyu Xu [ 06/Mar/14 ] |
|
Would you mind trying to rehit the issue for another time with client/OSS/MDS debug setting as #lctl set_param debug="+rpctrace +dlmtrace" and try to setup bigger debug buffer #lctl set_param debug_mb=<# of megabytes> after the eviction happens, "lctl dk > log_file" on client, OSS and MDS nodes. I try to understand where the eviction come from. The existing log is brief and does not cover the troublesome lock's lifespan. Thanks in advance. |
| Comment by Rajeshwaran Ganesan [ 11/Mar/14 ] |
|
I have added new set of logs into FTP site, ftp.whamcloud.com/uploads/ The filenames are |
| Comment by Rajeshwaran Ganesan [ 11/Mar/14 ] |
|
Sprig3 has just been evicted (by user activity) – that was 07 March 2014 16:41 |
| Comment by Zhenyu Xu [ 13/Mar/14 ] |
|
the client logs uploaded (2014-03-10-SR29797_AWE_luster_client.tar ) only covers between Fri Mar 7 10:08:16 to Fri Mar 7 10:08:46, while the eviction happened after Mar 7 15:41:18. And I could not find error message in the lustre-log.xxxx.xxx files, I think you've set /proc/sys/lustre/dump_on_eviction to 1 that when eviction happens it will dump a log automatically, but the problem is when the eviction happens again and again, the same log file for the same thread will be overwritten. Please put 0 to /proc/sys/lustre/dump_on_eviction. I'm still checking server logs. PS. Can I know what "cat /proc/sys/lustre/dump_on_eviction" and "lctl get_param debug" command output? Thanks. |
| Comment by Zhenyu Xu [ 13/Mar/14 ] |
|
Also can you help find whether network works when high load happens? What's the network topology and setup here? I strongly suspect it could be network problem since logs keep on showing lost connection among each other and reconnection afterwards and repeat the cycle. |
| Comment by Zhenyu Xu [ 17/Mar/14 ] |
|
Let me clarify the situation of this issue. |
| Comment by Rajeshwaran Ganesan [ 18/Mar/14 ] |
|
1-sprig3 was evicted and none of the OSTs reconnected. The node was not hung as we could log into it to grab the Lustre logs etc. In the end we rebooted to get Lustre back into service on the node. |
| Comment by Zhenyu Xu [ 19/Mar/14 ] |
|
update for what I've observed so far. From the evicted client log (lustre-log.1394206866.9593.gz), from the begining (Mar 7 08:47:46) to the end of the log (Mar 7 17:07:46), there are several enqueue (o101) RPCs always in-flight to OSS1, while didn't get response from OSS1. Mar 7 15:42:46 sprig3 kernel: [1562020.535941] LustreError: 27830:0:(import.c:324:ptlrpc_invalidate_import()) scratch-OST0005_UUID: rc = -110 waiting for callback (2 != 0) Mar 7 15:42:46 sprig3 kernel: [1562020.535961] LustreError: 27830:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff883ac3c15800 x1460291788607144/t0(0) o101->scratch-OST0005-osc-ffff883f77c95800@11.3.0.14@o2ib:28/4 lens 328/368 e 0 to 0 dl 1394205665 ref 1 fl Interpret:RE/0/0 rc -5/0 ... the sprig3 is using 11.3.0.47@o2ib dmesg.sprig3 [ 222.193940] LNet: Added LNI 11.3.0.47@o2ib [8/512/0/180] And oss1 shows corresponding communication with sprig3 grep -n "11.3.0.47" messages.oss1 45:Mar 7 15:40:54 sprig-oss1 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1406s: evicting client at 11.3.0.47@o2ib ns: filter-scratch-OST0000_UUID lock: ffff88014fbf1480/0x7385d82236277793 lrc: 3/0,0 mode: PW/PW res: [0xc943f9:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x10020 nid: 11.3.0.47@o2ib remote: 0x9f2b96c814304d30 expref: 75858 pid: 32611 timeout: 13979851913 lvb_type: 0 49:Mar 7 15:40:54 sprig-oss1 kernel: : LustreError: 32600:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88073901c800 x1451777805607860/t0(0) o104->scratch-OST0003@11.3.0.47@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1 51:Mar 7 15:40:54 sprig-oss1 kernel: : LustreError: 32600:0:(ldlm_lockd.c:709:ldlm_handle_ast_error()) ### client (nid 11.3.0.47@o2ib) returned 0 from blocking AST ns: filter-scratch-OST0003_UUID lock: ffff8801a74f3480/0x7385d8223628a2da lrc: 4/0,0 mode: PR/PR res: [0xc94b3e:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 nid: 11.3.0.47@o2ib remote: 0x9f2b96c814336365 expref: 70975 pid: 32641 timeout: 13979952124 lvb_type: 1 ... 87:Mar 7 15:41:06 sprig-oss1 kernel: : LustreError: 478:0:(ldlm_lockd.c:2348:ldlm_cancel_handler()) ldlm_cancel from 11.3.0.47@o2ib arrived at 1394206866 with bad export cookie 8324297127499047490 88:Mar 7 15:41:06 sprig-oss1 kernel: : LustreError: 478:0:(ldlm_lock.c:2433:ldlm_lock_dump_handle()) ### ### ns: filter-scratch-OST0005_UUID lock: ffff88059e1dab40/0x7385d82234639d0a lrc: 3/0,0 mode: PR/PR res: [0x933e33:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 nid: 11.3.0.47@o2ib remote: 0x9f2b96c80fb8a3f9 expref: 41230 pid: 13975 timeout: 0 lvb_type: 1 91:Mar 7 15:44:53 sprig-oss1 kernel: : Lustre: scratch-OST0005: haven't heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880848a7d800, cur 1394207093 expire 1394206943 last 1394206866 93:Mar 7 15:44:55 sprig-oss1 kernel: : Lustre: scratch-OST0002: haven't heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8804d2b2fc00, cur 1394207095 expire 1394206945 last 1394206868 95:Mar 7 15:45:08 sprig-oss1 kernel: : Lustre: scratch-OST0003: haven't heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880582cb6800, cur 1394207108 expire 1394206958 last 1394206881 101:Mar 7 15:45:20 sprig-oss1 kernel: : Lustre: scratch-OST0004: haven't heard from client ec8e7f8a-ffc8-0d43-aa6a-b5bdfd07799b (at 11.3.0.47@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880848a7d000, cur 1394207120 expire 1394206970 last 1394206893 |
| Comment by Zhenyu Xu [ 20/Mar/14 ] |
|
from the client forever waiting request status Mar 7 15:42:46 sprig3 kernel: [1562020.535961] LustreError: 27830:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff883ac3c15800 x1460291788607144/t0(0) o101->scratch-OST0005-osc-ffff883f77c95800@11.3.0.14@o2ib:28/4 lens 328/368 e 0 to 0 dl 1394205665 ref 1 fl Interpret:RE/0/0 rc -5/0 it's rq_phase is "Interpret" and it's been replied and contains error "RE", so that drag my eyes to ptlrpc_check_set() ptlrpc_check_set() ...
interpret:
LASSERT(req->rq_phase == RQ_PHASE_INTERPRET);
/* This moves to "unregistering" phase we need to wait for
* reply unlink. */
if (!unregistered && !ptlrpc_unregister_reply(req, 1)) {
/* start async bulk unlink too */
ptlrpc_unregister_bulk(req, 1);
continue;
}
if (!ptlrpc_unregister_bulk(req, 1))
continue;
/* When calling interpret receiving already should be
* finished. */
LASSERT(!req->rq_receiving_reply);
ptlrpc_req_interpret(env, req, req->rq_status);
ptlrpc_rqphase_move(req, RQ_PHASE_COMPLETE);
I suspect there is a bug here. |
| Comment by Zhenyu Xu [ 20/Mar/14 ] |
|
Hi Rajeshwaran, This is a debug patch http://review.whamcloud.com/#/c/9732/, only client need it, please remember to set debug level including rpctrace and dlmtrace, current uploaded logs do not have these debug messages. Please use this debug patch and turn on these debug level and try to rehit the issue, thank you. |
| Comment by Rajeshwaran Ganesan [ 21/Mar/14 ] |
|
Hello, Is it possible to get the patched lustre source rpm. We had some issues while compling source rpm with the patches. it would be great, if you can give us the patched lustre client. Thanks, |
| Comment by Zhenyu Xu [ 21/Mar/14 ] |
|
http://build.whamcloud.com/job/lustre-reviews/22597/ could possible contains what you need, you can check your system from the build matrix, and in their "Build Artifacts" you can corresponding source rpm. |
| Comment by Rajeshwaran Ganesan [ 26/Mar/14 ] |
|
We have applied the patch on the client and monitoring it for any eviction. |
| Comment by Bruno Faccini (Inactive) [ 26/Mar/14 ] |
|
Here are some news/updates for this ticket after the conf-call : |
| Comment by Rajeshwaran Ganesan [ 02/Apr/14 ] |
|
The ibdiagnet logs in the ftp server under The clients are running SLES 11 SP3: Server Info: |
| Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ] |
|
Hello Rajeshwaran, |
| Comment by Bruno Faccini (Inactive) [ 03/Apr/14 ] |
|
After a joint conf-call with DDN/SGI/AWE, it appears that they do not trigger evictions anymore, running Lustre 2.4.3 (including debug patch #9732) on Clients. So we need to determine if this new behavior is caused by the debug patch or due to changes between 2.4.1 and 2.4.3. |
| Comment by Bruno Faccini (Inactive) [ 09/Apr/14 ] |
|
As requested during last conf-call, I am adding here the Commit-msg content from patch #9732 which qualifies its non-debug/fix change : LU-4593 ptlrpc: req could loop forever in sending list In ptlrpc_check_set(), when a request runs into RQ_PHASE_UNREGISTERING phase, if the reply is successfully unlinked, the unregistered flag should be set accordingly, otherwise the request could possibly run into interpret section and repeat the reply unregister. |
| Comment by Bruno Faccini (Inactive) [ 16/Apr/14 ] |
|
Rajeshwaran: I forgot to ask you if, when running with debug version of #9732, you have found any of the "Handled 1000 reqs in the set 0xxxxx during yyyy seconds." added debug msg and also a long list of ptlrpc_request(s) being dumped for the set, in the debug-log (with dlmtrace+rpctrace set) of the Client where you reproduced the problem ? Bobi: now that you pushed a "fix only" version/patch-set of #9732, do you think that the same fix is required for b2_5/master too ? |
| Comment by Zhenyu Xu [ 16/Apr/14 ] |
|
yes, I'd push patches for master and b2_5. |
| Comment by Zhenyu Xu [ 17/Apr/14 ] |
|
patch for master tracking at http://review.whamcloud.com/#/c/9975/, for b2_5 at http://review.whamcloud.com/#/c/9992 |
| Comment by Rajeshwaran Ganesan [ 14/May/14 ] |
|
Hello, Please close the Thanks, |
| Comment by John Fuchs-Chesney (Inactive) [ 14/May/14 ] |
|
Thank you for the update and confirmation Rajesh – very helpful! |