[LU-4786] Apparent denial of service from client to mdt Created: 20/Mar/14 Updated: 24/Nov/14 Resolved: 18/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.5.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Manish Patel (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13165 | ||||||||||||
| Description |
|
We generally use "1.8.9wc1-sanger1- Scratch 109,110 and 113 all run "2.2.0--PRISTINE-2.6.32-gmpc-2.6.32-220.el6-lustre.2.2” on the servers. We have recovered access to a filesystem which was effectively unavailable for around 24 hours. We have been supported by DDN ( in particularly Sven Tautmann ) during the investigation the issue, the state of our other filesystems were discussed. We have a number of lustre systems and we saw a similar issue on scratch113,scratch110 and finally scratch109. The symptoms was that new access to file system was lost, we determined that the issue was with the mdt we took the approach of stopping the mdt and removing the lustre modules and restarting the mdt. This approach appeared to allow to regain access although we did have to abort recovery on the MDS on some of the systems after they had been stuck in recovery for more than 30 minutes. Normally we would look for a user job that was causing the issue however there was nothing obviously apparent. With scratch109 which is one of our general purpose file systems and has our reference genomes on and therefore effects our production pipelines, restarting the mdt service didn’t solve the issue as we saw the system fail almost immediately after we aborted the recovery. With assistance from DDN we took the following approach ( this took place during the day of the 18th of March ) 1) cleared up some old configuration from scratch109 MDS config: The following was seen in the log "Mar 18 08:46:04 lus09-mds1 kernel: LustreError: 3192:0:(mgs_handler.c:785:mgs_handle_fslog_hack()) Invalid logname received: params” root@lus09-mds1:/# mount 2) We saw issues with the uid/gui lookup for example: Mar 18 12:10:02 lus09-mds2 kernel: LustreError: 13888:0:(upcall_cache.c:236:upcall_cache_get_entry()) Skipped 208 previous similar messages And we moved the system to a new ldap caching layer which did appear to clear those messages up. 3) we then took the approach of using iptables to limit access to the mds to a limited collection of clients and then showed that the mdt and clients were stable, we then spent a number of hours introducing clients to the mdt until all bar hgs4 ( our 3.8 lustre 2.5 clients ) had access to the filesystem. We would like to understand why the mds failed and why when it was restarted it did not recovery in any useful amount of time and whether this is related to either the work load or the 2.5.0 clients. We have attached the output of “ddn_lustre_showall.sh” from each of the server systems, hgs4 and a random collection of clients ( to provide a baseline ). The following is a number of observations: On the client we saw: Mar 18 09:14:49 hgs4a kernel: [71024.072174] LustreError: 6751:0:(import.c:323:ptlrpc_invalidate_import()) lus13-OST0013_UUID: rc = -110 waiting for callback (1 != 0) While on the server the following was interesting Mar 17 11:52:15 lus13-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 1211s: evicting client at 172.17.125.17@tcp ns: mdt-ffff88082ad49000 lock: ffff88047a8be6c0/0xce376efee12486e2 lrc: 3/0,0 mode: PR/PR res: 8589950579/1911 bits 0x3 rrc: 5 type: IBT flags: 0x4000020 remote: 0xefde62456d81d72a expref: 851 pid: 14082 timeout: 6100853168 And similarly on lus09 Mar 18 03:12:28 lus09-mds2 kernel: LustreError: 5245:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 10808s: evicting client at 172.17.143.41@tcp ns: mdt-ffff881819bc3000 lock: ffff880b1f8936c0/0x5e448803b245d974 lrc: 3/0,0 mode: PR/PR res: 411041793/558694943 bits 0x3 rrc: 1001 type: IBT flags: 0x4000030 remote: 0xdd627947772d3070 expref: 28 pid: 5399 timeout: 4328657010 The length of the timers on the locks do appear to be excessive and having a timer expire after 8 hours does seem less than optimal and this might be related to It appeared that the problem was migrating around the filesystems and that for example an eviction from one system would cause an eviction from another system. “ And in hindsight it does appear that our lustre 2.5 clients appear to get more evictions than 1.8.9. |
| Comments |
| Comment by James Beal [ 20/Mar/14 ] |
|
Given that we suspected the clients on hgs4 of causing issues we decided to restrict the access that those machines to just scratch111 (2.4.2 with lu4222) and scratch113 ( lustre 2.2 ) which are the specific filesystems allocated to the group that use hgs4. Not long after access was granted the mdt of scratch 113 started to have issues and was completely unserviceable this morning. We have put iptables rules in place to restrict access to hgs4 and access to the file system has been restored ( after aborting recovery after waiting 25 minutes ). We have a new filesystem scratch114 ( lustre 2.2 ) which will try and use to isolate the issue. |
| Comment by James Beal [ 20/Mar/14 ] |
|
Various log files and a set of lnet self test results. |
| Comment by Oleg Drokin [ 20/Mar/14 ] |
|
Can you edscribe your filesystems in a bit more details please? Like do you have hundreds of OSTs? |
| Comment by James Beal [ 20/Mar/14 ] |
|
scratch113 has 40 OST's on 8 OSS's The attachments have more logs than is probably useful Scratch113 is currently overly full, although we saw the same issue on scratch 109 and 110 which are more reasonable. It is interesting to note that with this mornings issue with hgs4 the time line looked something like: 08:23 firewall hgs4 from lus13 And the system was mostly usable by 9:30. Looking at the logs for scratch111 which is the other filesystem hgs4 had mounted there were issues reported until around 09:34:34 when hgs4a was evicted from scratch111 lfs df /lustre/scratch113 filesystem summary: 917422707200 835044719396 36491539284 96% /lustre/scratch113 |
| Comment by Oleg Drokin [ 20/Mar/14 ] |
|
This looks a bit like Am I correct reading that you only have this issue with 2.2.0 servers and 2.4.2 servers appear to be fine? Do you have some job that makes this happen reliably that you can run? I'd be interested in getting some log traces from mdt and client to trace one of such stuck lock lifetimes on both ends if feasible. |
| Comment by James Beal [ 20/Mar/14 ] |
|
We had had a look at scratch 111 is a 2.4.2 server and we can see messages showing that it is unhappy about the client however the users have not reported issues with it, looking at the graphs for throughput it is not clear whether it was effected or not. The logs for lus11 are attached We think we have a job that reliably cause the issue and we have a new filesystem which we can use to test the theory ( that filesystem will only be available for a short while, if we need a system to test for longer we have a test system that we can bring in to play). If you tell us exactly how to produce the traces we will happily do that and attach them to the call. |
| Comment by Oleg Drokin [ 21/Mar/14 ] |
|
So ideally you will need to reproduce with as few clients as possible (to contain the log sizing). Increase client debug level on clients to include as much stuff as possible +dlmtrace +vfstrace +inode +ha +reada at the very minimum, +trace would be great. on MDS just add +dlmtrace to the default debug level (echo +dlmtrace >/proc/syslnet/debug) and probably also increase the edbug buffer case and run debug daemon just in case. Then run your reproducer and once it hits, stop the debug daemons with lctl debug_daemon stop on all nodes that it runs, collect the files and make them available for download (you can also convert them to text form with lctl df /path/file_name.bin >/somewhere/file_name.log ) |
| Comment by James Beal [ 22/Mar/14 ] |
|
I don't believe we have managed to run the reproduction yet, but those instructions look good thank you. |
| Comment by Helen Brimmer [ 24/Mar/14 ] |
|
So, we have been running the pipeline job again today with the increased debug logging against scratch114, which has only just been put into production and has the same configuration as the problematic scratch113. It takes a while to get going, so a lot of the log files have wrapped. After five hours, out of 11 clients running lustre 2.5.0, we've had 3 succumb to LBUG kernel panics (assertion failure in CLS_QUEUING), 3 suffer seemingly permanent evictions from some OSTs and not others (different set for each affected client) and not seen any problems on the MDT (which is what we are waiting to see). The dispatching process for the pipeline is still running, though the I/O activity has dropped to pretty much nil. We are now looking at devising ways to add some extra load to the filesystem to see if we can prompt the lockups to occur. |
| Comment by Helen Brimmer [ 27/Mar/14 ] |
|
We are planning to try and reproduce the problem with a different set of 2.5.0 clients which have a different kernel version (just to rule it out as a 'difference' with the hgs4 systems). As the filesystem is now stuck, we will need to reboot the MDS. Do you need us to preserve any state from the run with 'hgs4'? |
| Comment by Oleg Drokin [ 27/Mar/14 ] |
|
No. I am intrested in client traces (+ lock timeout info from server logs) only, as that hopefully will answer my question about why the locks are not released there. So it's fine to reboot the MDS. |
| Comment by Oleg Drokin [ 28/Mar/14 ] |
|
It looks like you have wrong debug level or subsystem on your clients. Sadly that means there's no useful info in those logs that I can use Please double check that when you do cat /proc/sys/lnet/debug on the clients, you see that all desired debug levels are listed in return. Also please make sure that doing cat /proc/sys/lnet/subsystem_debug you see all subsystems listed like this: It's kind of bad that the logs accumulate as fast. How to make sure you got the debug logs that contain useful info.: lus14-mds2-mar24-mar25-kern.log:Mar 25 12:42:53 lus14-mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:357:waiting_locks_callback()) ### lock callback timer expired after 781s: evicting client at 172.17.125.18@tcp ns: mdt-ffff88100430e000 lock: ffff880fb8e15b40/0xa0166cc2f1abd865 lrc: 3/0,0 mode: PR/PR res: 341311489/3520050123 bits 0x3 rrc: 39 type: IBT flags: 0x4000020 remote: 0x6b99a1f28ce9b702 expref: 14 pid: 45823 timeout: 5591090236 see the number after "remote:" 00010000:00010000:3.0:1395970265.268930:0:13661:0:(ldlm_lock.c:797:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: lustre-MDT0000-mdc-ffff88005315d7f0 lock: ffff8800456b7db8/0xb71459050dddbe18 lrc: 3/1,0 mode: --/CR res: [0x200000007:0x1:0x0].0 bits 0x0 rrc: 1 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 13661 timeout: 0 lvb_type: 0 Note how on client side the location of this remoeve handle moved to the lock: addr/handle location. remote is 0 meaning this lock has not been replied to by the server yet. If you see at least 00010000:00010000:3.0:1395970265.287973:0:13661:0:(ldlm_request.c:700:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: lustre-MDT0000-mdc-ffff88005315d7f0 lock: ffff8800456b7db8/0xb71459050dddbe18 lrc: 4/1,0 mode: PR/PR res: [0x200000007:0x1:0x0].0 bits 0x13 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xb71459050dddbe26 expref: -99 pid: 13661 timeout: 0 lvb_type: 0 That means we have lifetime of the lock since it was granted by the server and so it's still a very good log. If you do not have any matches for the handle in the client log, that means this log does not contain anything useful related to the lock in question and is therefore likely useless. |
| Comment by Helen Brimmer [ 28/Mar/14 ] |
|
Here is confirmation of how the hgs4 clients were set up for debugging: [root@hgs4g ~]$ cat /proc/sys/lnet/debug [root@hgs4g ~]$ cat /proc/sys/lnet/subsystem_debug [root@hgs4g ~]$ cat /proc/sys/lnet/debug_path [root@hgs4g ~]$ cat /proc/sys/lnet/debug_mb [root@hgs4g ~]$ cat /proc/sys/lnet/debug_log_upcall Please let me know if there is anything amiss here. As the log files were rotating so quickly, we did create several sets of debug output so I will see if some of the other files contain anything useful. We'll run the pipeline tests on the other set of lustre 2.5.0 clients today (they have a different kernel, so are seeing if that is a potential issue with hgs4). I'll adjust the debug capture so that it doesn't collect osc and class messages as you suggest. |
| Comment by Oleg Drokin [ 28/Mar/14 ] |
|
This output seems correct, so that's kind of strange how you ended up with the results you got then. |
| Comment by Helen Brimmer [ 28/Mar/14 ] |
|
We have had some success recreating the issue with a different set of clients today. The filesystem went into the error state we have been seeing within about five minutes of the pipeline being started. The sequence of events appears to be that the worker nodes start being evicted by various OSTs (a different set in each case). Some time later, (~20 minutes today), the MDS starts to evict certain worker nodes but not others. Several hours later, we see stuck threads and lustre error dumps on the MDS. I believe that this time we have managed to capture some thread lifetimes. The debug_daemon files started off accruing a lot of data, but then appeared to calm down. We did take the precaution of restarting it as the file size approached maximum, though one client (ht-1-1-03) continued to generate a lot of data so some might have wrapped. I have uploaded compressed trace text files from all of the clients involved in running the pipeline (ht-1-1-*) and the MDS (lus14-mds2). They are available from the same URL as before. There are two each, with I think the most interesting stuff happening in the file stamped '115504' (which is generally a lot smaller than the earlier one). I'll be attaching kernel logs from all of the systems for reference. The pipeline was started at 11:22 from ht-1-1-03, which acted as the master controlling node. Processing jobs were then issued to the remaining six systems. We saw OST evictions on all of the workers except ht-1-1-03 and ht-1-1-07. MDS evictions occurred on ht-1-1-04, ht-1-1-05, ht-1-1-06, ht-1-1-08 and ht-1-1-09, with ht-1-1-09 being particularly troublesome. I believe that we have captured lock lifetimes on ht-1-1-09, ht-1-1-04 and possibly ht-1-1-06 as well. All of the clients are running Ubuntu 12.04.4, with kernel 3.2.0-58-generic and lustre 2.5.0. Debug settings are as below: [root@ht-1-1-03 ~]$ cat /proc/sys/lnet/debug lus14-mds2 was rebooted prior to starting this run in order to reset its state from the previous run. It is also running Ubuntu 12.04.04 LTS but has a custom built kernel, 2.6.32-gmpc-2.6.32-220.el6-lustre.2.2 in order to support Lustre version 2.2.0. It had the following debug settings: [root@lus14-mds2 ~]$ cat /proc/sys/lnet/debug Please let me know if there are any problems with this set of files. We are still trace logging into new files, so we can upload those too if you need to see what came next. |
| Comment by Oleg Drokin [ 30/Mar/14 ] |
|
Ok, from this confirms that the client got cancel request and prepared reply in time and handed it off to a ptlrpcd thread for asynchronous sending. The server apparently did not get the reply. So the mystery now shifted to finding if the request was sent to the server at all and somehow lost there or if it was not even sent from client. Please try your workload with this patch and also all the debug (including rpctrace this time), if the issue reproduces, please gather the logs as before, if it's gone, then that might have been the culprit. |
| Comment by Helen Brimmer [ 31/Mar/14 ] |
We are using 2.5.0 clients; Can you confirm that this patch is applicable to that version? |
| Comment by Oleg Drokin [ 31/Mar/14 ] |
|
http://review.whamcloud.com/8922 is the patch that applies cleanly to 2.5 clients. I'll check your new logs, thanks. |
| Comment by Oleg Drokin [ 31/Mar/14 ] |
|
So in the new logs we can clearly see an RPC is being formed for the lock cancel, but never sent. I am really interested to see if the patch from |
| Comment by Helen Brimmer [ 31/Mar/14 ] |
|
We are in the process of compiling it up now. We should be able to test it out tomorrow. |
| Comment by Oleg Drokin [ 31/Mar/14 ] |
|
Thank you for your effort in gathering those logs, btw. Really helpful. |
| Comment by Helen Brimmer [ 01/Apr/14 ] |
|
We have tested lustre 2.5.0 with the [root@ht-1-1-03 lustre-dbug]# lsb_release -a [root@lus14-mds2 ~]# lsb_release -a |
| Comment by Oleg Drokin [ 02/Apr/14 ] |
|
Thanks for the logs again. |
| Comment by James Beal [ 02/Apr/14 ] |
|
I am having issues applying the patch on the 2.5.0 branch. root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git log
commit ca629fd7c6f48801b993be88d10ec7569f209336
Author: Liang Zhen <liang.zhen@intel.com>
Date: Mon Jan 20 20:52:51 2014 +0800
LU-4509 ptlrpc: re-enqueue ptlrpcd worker
.....
commit be39b2aa4bf1469db0c936d1415301dca3a1effc
Author: Oleg Drokin <oleg.drokin@intel.com>
Date: Fri Oct 11 18:24:56 2013 -0400
Tag 2.5.0-RC1
root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git cherry-pick bfae5a4e4a37e4057200bcc70dbff14e18a00797
Automatic cherry-pick failed. After resolving the conflicts,
mark the corrected paths with 'git add <paths>' or 'git rm <paths>'
and commit the result with:
git commit -c bfae5a4e4a37e4057200bcc70dbff14e18a00797
root@isg-dev-1-01:/local/lustre-2.5.0-lu4509-lu4300/lustre-release# git status
# Not currently on any branch.
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
# modified: lustre/include/lustre_dlm.h
# modified: lustre/mdc/mdc_request.c
#
# Unmerged paths:
# (use "git reset HEAD <file>..." to unstage)
# (use "git add/rm <file>..." as appropriate to mark resolution)
#
# both modified: lustre/ldlm/ldlm_request.c
# both modified: lustre/osc/osc_lock.c
# both modified: lustre/osc/osc_request.c
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# lustre/scripts/make_META.pl
I am not confident in resolving the conflict in the three files and we are awaiting further instruction. |
| Comment by Oleg Drokin [ 02/Apr/14 ] |
|
It looks like you are trying to cherry-pick the wrong patch. The one you are trying to cherry-pick is from master, the one you need is the one I have referenced that was specifically made for b2_5. Seems to be cherrypicking fine for me: Oleg-Drokins-MacBook-Pro:lustre green$ git checkout 2.5.0 HEAD is now at be39b2a... Tag 2.5.0-RC1 Oleg-Drokins-MacBook-Pro:lustre green$ git cherry-pick d1dcff3084e929f5768dc733cdc104cddf168c06 [detached HEAD e73ec59] LU-4509 ptlrpc: re-enqueue ptlrpcd worker Author: Liang Zhen <liang.zhen@intel.com> 1 file changed, 50 insertions(+), 34 deletions(-) Oleg-Drokins-MacBook-Pro:lustre green$ git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/51/9451/3 && git cherry-pick FETCH_HEAD From http://review.whamcloud.com/fs/lustre-release * branch refs/changes/51/9451/3 -> FETCH_HEAD [detached HEAD d237657] LU-4300 ldlm: ELC picks locks in a safer policy Author: Jinshan Xiong <jinshan.xiong@intel.com> 1 file changed, 6 insertions(+), 7 deletions(-) |
| Comment by James A Simmons [ 02/Apr/14 ] |
|
The |
| Comment by James Beal [ 03/Apr/14 ] |
|
Oleg I have followed you instructions and we have a new client package. We are having to put the file system we were using in to production however we have another system ( although smaller ) that we have configured in the same way. We plan to verify we can still see the issue with that system before trying with the new client. |
| Comment by Helen Brimmer [ 10/Apr/14 ] |
|
Apologies for the lack of updates. We are still in the process of putting the new test environment together. This is an older set which is smaller than the last one; 4 OSS servers providing 30 OSTs. The Lustre version on the servers is still 2.2.0 and we will be re-running the test of |
| Comment by Helen Brimmer [ 29/Apr/14 ] |
|
Again, apologies for the long delay between updates. We have been having trouble reproducing the issue exactly on our alternative test system and other issues locally have intervened. However, there are definitely still issues with the As there is still an issue, I have moved on to testing the I believe that the osc_lock_upcall issue is fixed in 2.5.1, so we will be looking to test this version next. |
| Comment by Helen Brimmer [ 30/May/14 ] |
|
It's been a busy month, but we have now been able to test Lustre 2.5.1 with patches [root@ht-1-1-09 lustre-dbug]$ cat /proc/fs/lustre/version
lustre: 2.5.1
kernel: patchless_client
build: 2.5.1-gd6b5d00-CHANGED-3.2.0-58-generic
[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/debug
trace inode info ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace reada config console
[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/subsystem_debug
undefined mdc mds ost log llite rpc mgmt lnet lnd pinger filter echo ldlm lov lquota osd lfsck lmv sec gss mgc mgs fid fld
[root@ht-1-1-09 lustre-dbug]$ cat /proc/sys/lnet/debug_mb
4096
To try and ease the log gathering, we set the debug_daemon to restart at regular intervals throughout the run, so there are quite a number of files to look through this time (hopefully we have managed to capture something useful). The timeframe is from around 16:30 - 16:50 local time, by which time the machines had been running jobs for around 25 minutes. The first stuck ptlrpcd threads occur on ht-1-1-07 at around 16:32. Both machines were evicted by a different selection of OSTs, but so far we have not seen any evictions or problems with the MDT in this smaller test environment. I'll attach kernel logs from all of the machines separately. Let me know if there is anything else we can add. |
| Comment by James A Simmons [ 30/May/14 ] |
|
Can you try the tip of b2_5? I have been having success with normal file operation except for singe shared files. That is being addressed in |
| Comment by Helen Brimmer [ 02/Jun/14 ] |
|
Hi James, Can you clarify what you want us to test now? I'm afraid that I'm not familiar with Lustre development terminology. |
| Comment by Peter Jones [ 02/Jun/14 ] |
|
Thanks for your input James. Helen, one of our engineers will give a recommendation after having concluded the analysis of the latest information |
| Comment by Oleg Drokin [ 05/Jun/14 ] |
|
Helen, I got your logs, but it seems only client logs from ht-1-1-09 and ht-1-1-07 are available. |
| Comment by Helen Brimmer [ 05/Jun/14 ] |
|
Many apologies, I overlooked the MDS (lus03-mds2) trace logs. I have now uploaded these (they are one large file as the rate of production is not as high as for the clients). We did not see any evictions from the MDS during the test run though, only the OSSes, which we currently are not collecting debug traces from. If you need us to do this, please let me know and I will get another run going as soon as I can. |
| Comment by Oleg Drokin [ 11/Jun/14 ] |
|
If you can upload a syslog part of oss logs that shows the evicted log handles, that might be enough. |
| Comment by Helen Brimmer [ 11/Jun/14 ] |
|
kern_logs_20140529.tar.gz attached to this ticket has kernel log files from all of the servers covering the period of the latest test. |
| Comment by Oleg Drokin [ 12/Jun/14 ] |
|
Thanks! I totally missed you uploaded the kernel logs into the ticket and not to your ftp site and there's no visual feedback in Jira that shows there was a file attached to that comment. Sorry. Anyway, I looked at the logs. Also if you are trying without full debug first, if you see "INFO: task ptlrpcd_26:2213 blocked for more than 120 seconds" sort of messages in your client logs, that alone might be enough in some cases, so might be a good idea to do that before diving into the full debug runs (Though I hope the problem would be over with this time around). |
| Comment by Helen Brimmer [ 12/Jun/14 ] |
|
Thanks for looking through those logs and confirming the issue. We will wait for the new patch to try. We ran a full debug run yesterday anyway, including '+dlmtrace' logs from the OSSs. In case those traces are useful, I've posted them to the FTP site and put the kernel logs there as well this time. On yesterday's run, all of the client systems (ht-1-1-*) suffered ptlrpcd lockups and effectively lost access to the filesystem, but only one actually got evicted (ht-1-1-07) by a few of the OSTs. Do you know if |
| Comment by Oleg Drokin [ 12/Jun/14 ] |
|
|
| Comment by Oleg Drokin [ 17/Jun/14 ] |
|
So, I have a question. Do you have a test system where you can try stuff on? If you are interested in testing the actual 2.6 prerelease code, you can grab the RPMs at http://build.whamcloud.com/job/lustre-master/2085/ - pick the client version for your distro. |
| Comment by Helen Brimmer [ 17/Jun/14 ] |
|
We can try out 2.6 candidates on the set of machines that we have already been using to test 2.5.x. We are an Ubuntu shop though, so I am afraid that RPMs are not that helpful for us (the test clients are running Ubuntu 12.04.4 with kernel 3.2.0-58-generic). We have been building the 2.5.x +patch variants from the git sources. If you can confirm the git release tag to checkout from 2.6, we can try building it from there. |
| Comment by Oleg Drokin [ 18/Jun/14 ] |
|
It's git hash 5c4573e327c5d027a6b877f75910767244161a1f |
| Comment by Helen Brimmer [ 19/Jun/14 ] |
|
We have successfully got 2.5.59 build #2085 to build on our systems and have been running the test pipeline on six worker nodes for around 22 hours now. So far, we have not seen any stuck ptlrpcd threads and the jobs continue to flow through our batch scheduler. With the previous versions, we would have seen the nodes lockup within an hour of receiving pipeline jobs. To confirm that we've got the version you wanted us to test: [root@ht-1-1-07 lustre-dbug]$ cat /proc/fs/lustre/version
lustre: 2.5.59
kernel: patchless_client
build: 2.5.59-g5c4573e-CHANGED-3.2.0-58-generic
We have not seen any problems on the servers either, which are still running v2.2.0. This looks very promising, but we do need to have a stable release for our production environment. |
| Comment by Jinshan Xiong (Inactive) [ 23/Jun/14 ] |
|
I'm working on this issue. |
| Comment by Jinshan Xiong (Inactive) [ 24/Jun/14 ] |
|
Can you please try patch: http://review.whamcloud.com/10795 |
| Comment by Helen Brimmer [ 26/Jun/14 ] |
|
We have applied that patch to Lustre 2.5.1 and restarted the pipeline. The test nodes have now been running the jobs for over 18 hours and we have not seen any stuck threads yet. [root@ht-1-1-09 root]$ cat /proc/fs/lustre/version
lustre: 2.5.1
kernel: patchless_client
build: 2.5.1-g0b9a5ad-CHANGED-3.2.0-58-generic
|
| Comment by Helen Brimmer [ 09/Jul/14 ] |
|
We have continued to run the problematic pipeline and other workloads on the 2.5.1 clients patched with |
| Comment by Oleg Drokin [ 11/Aug/14 ] |
|
Hm, so the client was evicted by the server because it did not reply to a request to release a contended lock. After the eviction we see import invalidation thread stuck on some semaphore and then a bunch of ptlrpcd threads stuck on a semaphore, potentially indicating there's some client side deadlock involved. Whenever it was triggered by eviction or was a separate event is not yet clear. This is most definitely different issue than what you had before, so that's a good sign, I guess. |
| Comment by Helen Brimmer [ 11/Aug/14 ] |
|
Jira doesn't seem to want to accept attachments at the moment, so I've uploaded the whole kern.log file for last week and the equivalent syslog files to our ftp site (filenames ht-1-1-16-*.gz). This particular client reported some 'revalidate FID' errors a few days before, but other than that was very quiet. |
| Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ] |
|
There is no enough information to identify the problem. Can you please reproduce the problem and when you see the problem again: Thanks. From the limited log, I can see that the file has 40 stripes - do you have any specific reason to have wide striped files? Also what's the access mode to the files - does it do append write to the files? Jinshan |
| Comment by Helen Brimmer [ 12/Aug/14 ] |
|
This latest test was run against one of our production filesystems, so it may take a little time to get permission to run it again. Since it is relatively quiet at the moment due to summer holidays, hopefully we will get a chance this week. In the meantime, I have run the commands you suggested anyway on the client, since it has not done anything since the run last week (and the Lustre modules were not unloaded after the filesystem was unmounted). It looks like 'lctl dk' has captured information from the duration of the pipeline run, but I'm not sure that dmesg still contains anything useful. I've uploaded the following files to our FTP site anyway, in case they are useful: ht-1-1-16-20140812-lustre-dmp.txt.gz (pydmesg is an in-house wrapper for dmesg that converts the 'time since uptime' timestamps into realtime values) The pipeline is produced and run by one of our scientific research teams; I will have to check with them about the exact details of its file accesses. Is there a way to identify the file that the log refers to? The pipeline generates over 3000 individual batch jobs, each with their own set of reference files and output files... |
| Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ] |
|
If the system is still alive, you can try this command to get the file name affected. lfs fid2path <lustre_dir> "[0x20000b8ab:0x777f:0x0]" <lustre_dir> is the directory where you mount Lustre file system, it is /lustre/scratch113 in your case. I will take a look at the files you attached. |
| Comment by Jinshan Xiong (Inactive) [ 12/Aug/14 ] |
|
I checked the attachments but didn't find anything useful from there. My gut feeling is that this may be related with the access pattern of those widely striped files. |
| Comment by Helen Brimmer [ 13/Aug/14 ] |
|
Okay, if you need us to increase the debug level on the clients, please let us know which flags to apply. The problematic file appears to be /lustre/scratch113/teams/hgi/dddtest/variant_calling/20140805-lustest/pooled/.jobs/call_variants.w.1501.o which has zero length. It is one of many .o files in that directory, the largest of which is only 16K. The '.jobs' directory is where the STDOUT and STDERR from each pipeline job is stored. These files will have data appended to them as the jobs produce their output (or errors). I will check with the pipeline owners if the striping is intentional - we don't recommend striping files this small... |
| Comment by Peter Jones [ 14/Aug/14 ] |
|
Helen We have landed the existing patch you are carrying for 2.5.3 as that has clearly improved the situation markedly. Could you please open a new ticket to track this residual issue? This will make things easier for us (and others) to track what changes are in a given release. Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 16/Aug/14 ] |
|
Please apply the following debug flags on the client side: lctl set_param debug="vfstrace dlmlock cache reada" Yes there is no reason to use wide stripe if the files are so small. I kind of understand why it happened - if this is the output of STDOUT and STDERR, mostly people would like to do: command >> output 2>&1 The output file is a Lustre file and it will be opened and written with O_APPEND mode, which is really bad for widely striped file. Just my guess - please verify it. |
| Comment by Peter Jones [ 18/Aug/14 ] |
|
ok I will close this ticket to reflect that the original patch has landed for 2.5.3. If/when any additional information is supplied about this residual issue then we can track that under a new ticket and link it to this ticket. |