[LU-4591] Related cl_lock failures on master/2.5 Created: 05/Feb/14 Updated: 12/Nov/14 Resolved: 04/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | MB, mn4 | ||
| Environment: |
Master clients on SLES11SP3, server version irrelevant (tested against 2.1,2.4,2.4.1,2.5). |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 12543 | ||||||||||||||||
| Description |
|
We've seen a number of different cl_clock bugs in master/2.4.1/2.5 that we believe are related. We have seen these in master as recently as two weeks ago (current master cannot run mmstress at all due to a problem in the paging code). These bugs are not present in Intel released 2.4.0, but we've seen them in Cray 2.4.1 and 2.5 (which do not track precisely with the Intel versions of 2.4.1 and 2.5). We've seen all of these bugs during our general purpose testing, but we believe they're related, because all of them are reproduced easily by running multiple copies (on multiple nodes) of mmstress from the Linux Test Project (mtest05 - I will attach the source), and none of them seem to be present in 2.4.0. (At least, none of them are reproduced in that context.) Not all of the stack traces below are from runs on master (sorry - it's not what I've got handy), but all of the listed bugs have been reproduced on master: General protection fault in osc_lock_detach (this one seems to be the most common): — (osc_lock.c:1134:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 lov_lock_link_find()) ASSERTION( cl_lock_is_mutexed(sub->lss_cl.cls_lock) ) failed: General protection fault in cl_lock_put (possibly the same issue as lovsub_lock.c:103:lovsub_lock_state()) ASSERTION( cl_lock_is_mutexed(slice->cls_lock) ) failed: General protection fault in cl_lock_delete: One more, which is slightly different, but still caused by the same tests, not found in 2.4.0, etc. Lustre is getting stuck looping in cl_locks_prune. We have many cases of applications failing to exit with processes stuck somewhere under cl_locks_prune - Here's two examples: > 05:18:21 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0 Sorry for the massive dump of information in one bug, but we strongly suspect these bugs have a single cause or several tightly related causes. With assistance from Xyratex, we've singled these patches out as possible patches of interest that have come in between 2.4.0 and master: most suspicious: On my to-do list is testing master with some of these patches removed to see what, if any, affect this has on the bugs listed above. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ] |
|
mmstress |
| Comment by Jinshan Xiong (Inactive) [ 05/Feb/14 ] |
|
Thanks for the work. I'd suggest you to try to remove those patches: 13079de and see what will happen. Yes, I agree with you that this looks really like a single cause. Will you please get some logs so that our guys can take a look? |
| Comment by Jinshan Xiong (Inactive) [ 05/Feb/14 ] |
|
Can you please tell me how you ran the test program? |
| Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ] |
|
Jinshan, Thanks for the quick response. About logs: Unfortunately, these problems don't happen with dlmtrace (or any of the other large debug flags - such as trace or rpctrace) enabled. I created a special debug patch with all calls to cl_lock_trace under a special debug flag and was able to hit it with only that enabled. I should be able to get those logs for you tomorrow morning. (Sorry I don't have them on hand, I had to clean out my old dumps/logs.) Just a heads up, Vitaly Fertman of Xyratex has been looking in to this with us. About mmstress - It's executed with no arguments, but we started multiple copies with our workload manager: Would run it on 100 cores, which is enough that we see the problems pretty quickly (with debug at default). That core count is allocated by putting NUM_CPUs jobs on each node. So if nodes had 8 cores, with 100 jobs, we'd get 12 nodes with 8 jobs each, and one with 4 jobs. |
| Comment by Patrick Farrell (Inactive) [ 05/Feb/14 ] |
|
Also, I'll (hopefully, system problems may interfere) be testing removing those patches tomorrow as well. |
| Comment by Alexander Boyko [ 06/Feb/14 ] |
|
I have added info from crash for one case. |
| Comment by Patrick Farrell (Inactive) [ 06/Feb/14 ] |
|
[Edit] Sorry, I forgot a bit of background info. There are actually two Here's the list of patches I explored today: I think the problem is a bad interaction between: I started by removing all four of those patches I noted above from Cray 2.5, and confirmed there's no problem. I tested the two I tested I tested I tested I tested I tested I'm going to test the two |
| Comment by Patrick Farrell (Inactive) [ 06/Feb/14 ] |
|
With further testing of the two |
| Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ] |
|
Hi Patrick, Thank you for the information, I will take a look. Just to confirm, you still can't reproduce this problem with dlmtrace enabled, is that right? |
| Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ] |
|
Jinshan - Correct. The logs Alex Boyko provided are from a special debug patch with the calls to cl_lock_trace moved to their own debug flags (actually, two different ones). Both of those flags were on, but nothing else. So all calls to cl_lock_trace should be logged there. Also, in further stress testing (notice it is mmstress again) with the |
| Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ] |
|
Jinshan - Something that came up in a Cray discussion of the history of When we opened Then Oleg reported Intel was still seeing the assertion from Then we found racer.sh could reproduce the problem, and you and several of the Xyratex guys worked out a patch for it, which was labeled with So I don't think we have any hard evidence that the second |
| Comment by Patrick Farrell (Inactive) [ 07/Feb/14 ] |
|
One further thought... |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
working on this ... |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
Jinshan - I know from experience these bugs can be hard to replicate without a larger system. If you've got something you'd like tested (including a debug patch), I can make time to test it on one of our in house systems here where we can replicate the problems. |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
Patrick - will you please attach osc.ko here? |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
Jinshan - Can you be more specific? osc.ko that goes with the logs that Alex Boyko provided? If that one, he'll have to provide that, unfortunately. |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
I just want to know the source of osc_lock_detach+0x46 so that I will know which freed data it was trying to access. |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
Mostly it implies dlmlock was already freed, just want to make sure. |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
Ah, OK. I'll attach another osc.ko in a moment where we get the crash at the same line... In case it's enough, here's a disassemble of osc_lock_detach and a line number from the ko I'm going to attach: crash> disassemble osc_lock_detach |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
KO that goes with the dissassembly in Paf's comment https://jira.hpdd.intel.com/browse/LU-4591?focusedCommentId=76645&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-76645 |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
Hi Patrick - do you have a crash dump in hand? if yes, can you please show me the state of corresponding cl_lock and osc_lock? |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
Jinshan - Not right this second, but I'll try to get one uploaded for you so you can take a look. Sorry about not having one in hand. |
| Comment by Patrick Farrell (Inactive) [ 10/Feb/14 ] |
|
Dump is here, with KOs and console and messages log: uploads/ The node which went down with osc_lock_detach is named c2-0c0s7n3. This system was running Cray 2.5. |
| Comment by Jinshan Xiong (Inactive) [ 11/Feb/14 ] |
|
Hi Patrick - what's the tip of your branch? |
| Comment by Patrick Farrell (Inactive) [ 11/Feb/14 ] |
|
Jinshan - Sadly, we don't use git, so there's no answer to that question. Our 2.5 is 2.5 as released by Intel plus a number of patches we've pulled in, but I can replicate the same problems in the same way on master or Intel's released 2.5. If it would help, I could do it with one of those code bases - That's just the dump I had handy. |
| Comment by Jinshan Xiong (Inactive) [ 11/Feb/14 ] |
|
No worry Patrick. I will take a look at the dump, sorry was interrupted by something else yesterday. |
| Comment by Jinshan Xiong (Inactive) [ 12/Feb/14 ] |
|
I've taken a look at the dump. I suspect this issue is related to Patrick - can you please revert that patch and see what'll happen? |
| Comment by Patrick Farrell (Inactive) [ 12/Feb/14 ] |
|
Sure. I just started testing master with Would you like the dump from that cl_lock_delete GPF? (No debugging enabled.) Further update: I've hit these three other bugs from the list above: GPF at osc_lock_detach+0x46 2014-02-12T12:19:01.577674-06:00 c0-0c2s2n3 LustreError: 3031:0:(osc_lock.c:1204:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 I also hit |
| Comment by Li Xi (Inactive) [ 19/Feb/14 ] |
|
Hi Patrick, Would you please share the way of reproducing these bugs? I've tried to run multiple processes of LTP mmstress on Lustre to reproduce them, but failed. I ran with "./mmstress -t 1" commands. Is there anything I am missing? Thanks! |
| Comment by Patrick Farrell (Inactive) [ 19/Feb/14 ] |
|
Li - Sure. I have never successfully reproduced these on a small system. My usual system has 70 nodes on it, though I expect something smaller could do it as well. But when I tried with two and three nodes, I wasn't able to reproduce the problem either. I run - with no command line options - 4 copies of mmstress per node, on ~70 nodes. All copies of mmstress are executed in the same directory on the Lustre file system. Within a half hour, on master or 2.5 or 2.4.1, I've hit about 15-20 of these problems. |
| Comment by Zhenyu Xu [ 28/Feb/14 ] |
|
Hi Patrick, Would you mind giving http://review.whamcloud.com/9433 a try, it's a rewrite of " |
| Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ] |
|
Zhenyu - I tried this patch on master just now (master from today + your patch) with the mmstress reproducer. I hit essentially all of the bugs from above, and I suspect if I kept running, I would see the others. Here's the list of those I hit for sure: GPF in osc_lock_detach No exit stuck in cl_locks_prune LustreError: 12688:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 GPF in cl_lock_delete |
| Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ] |
|
cl_lock debugging patch |
| Comment by Patrick Farrell (Inactive) [ 28/Feb/14 ] |
|
Zhenyu - I just attached a debug patch which breaks out the cl_lock_trace calls under other debug flags. In the past, I've been able to hit some of these bugs with the cllock and clfree debug flags that patch adds enabled. (I can't hit them with any of the heavier debug flags, like dlmtrace or rpctrace, enabled.) Would you be interested in a dump and logs of one of these crashes with your patch and that debug patch? |
| Comment by Zhenyu Xu [ 01/Mar/14 ] |
|
yes please, do you want me to combine these two patches in a single patches for you to get a built image easily? |
| Comment by Zhenyu Xu [ 01/Mar/14 ] |
|
FYI, I've pushed http://review.whamcloud.com/9441 , the debug patch which you provided which is based on my patch. |
| Comment by Zhenyu Xu [ 03/Mar/14 ] |
|
Patrick, FYI, the error you reported in Mar 1st are exactly issues reported in |
| Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ] |
|
Thanks for pointing those out - Good that someone else has seen them. I should be able to get you a dump with debugging (I hope) later today. |
| Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ] |
|
Unfortunately, I've been unable to hit the bugs with debugging enabled. I'm trying with only the clfree debugging option on. I did see something that may not be related... I have a number of threads not exiting stuck here: I've seen this before, but only, I think on master doing these tests. It seems to happen when the tests are run for a long time. (Normally they aren't run very long because nodes are dropping.) |
| Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ] |
|
With debugging reduced to just the clfree flag (cl_lock_tracing only in cl_free), I started hitting the various bugs. I grabbed three dumps. Dumps are uploading, will be here in about 5 minutes: I'll go back to testing with the cllock and clfree flags on to see if I can hit the bug. |
| Comment by Patrick Farrell (Inactive) [ 03/Mar/14 ] |
|
I suspect if you need better logs, we'll have to adjust the debug further. Before we discovered removing one of the Here's that data. The first number is the # of calls to that particular cl_lock_trace in the sample I gathered: Looking at this list, are there any of the top 10 or so we could do without, or could reduce significantly? I'm concerned that reducing the amount of data printed by cl_lock_trace won't really change how heavy it is - I would think most of the cost is in printing the message (though I could be wrong). I was also considering trying a modified version of cl_lock_trace which prints less information: New version of cl_lock_trace I'm going to try: static void cl_lock_trace_reduced0(int level, const struct lu_env *env, So, are there any of the most common cl_lock_trace calls you don't think we need? Once we've figured out how best to reduce the debug levels, I can test accordingly. |
| Comment by Jinshan Xiong (Inactive) [ 06/Mar/14 ] |
|
Hi Patrick, Will you please try this patch: http://review.whamcloud.com/9524 and see if it will help? Jinshan |
| Comment by Patrick Farrell (Inactive) [ 06/Mar/14 ] |
|
Sure, I'll test as soon as I can. Due to some poor planning on my part, that may not be until next week. I'll get results sooner if I can. |
| Comment by James A Simmons [ 12/Mar/14 ] |
|
Please cherry pick this to b2_5 |
| Comment by Peter Jones [ 12/Mar/14 ] |
|
James This will certainly be a candidate to back port once we have confirmation that the fix works Peter |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Jinshan - I'm sorry for the delay here (vacation, then system problems), but this patch doesn't fix the problem. I ran with this patch + master from last week. I wasn't able to hit the bugs with my cl_lock debugging enabled, unfortunately. I can provide node dumps from one of these nodes if desired. The bug set observed has changed somewhat... Old bugs we're still seeing: GPF in osc_lock_detach: 2014-03-12T11:37:21.627399-05:00 c0-0c2s1n2 LustreError: 20950:0:(osc_lock.c:1208:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 6 GPF in cl_lock_delete: No exit stuck in cl_locks_prune. ---------------------- Now for new things: I'm seeing this error and related messages fairly often in the logs: This one is new, and observed several times: This is also (sort of) new - I also saw it with one of Bobijam's patches. It's possible this isn't related to the patches, but I haven't seen it except in testing with fairly recent master and one of these patches. (I haven't carefully tested recent master by itself.) I also saw several dropped connections to some of our OSTs: That seems likely to be a problem with our system rather than the Lustre client, but I haven't seen it before on this system, so I thought I'd mention it. |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
A quick note - the version of master I used does have the patch for |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
One further note: I haven't examined all of the patches offered by Jinshan and Bobijam well enough to be sure, but are they all conflicting? These bugs haven't been fixed by a number of different patches, and I'm starting to wonder if there isn't more than one fix needed - I know that's much less likely in general, but I thought I'd suggest it as something to consider. There's also a pair of patches that were suggested at one point by Vitaly F. @ Xyratex. They were not successful, but I'll attach them for reference. One is a patch to avoid recursive disclosures, the other is a tweak to usage of hold/get. Again, these did not resolve the issue, I'm just attaching them for reference. |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Attempted patch from Vitaly |
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ] |
I saw this, is the client being evicted at that time? |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Other attempted patch from Xyratex |
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ] |
|
Please share us the the core dump. Thanks. |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Ah, yes Jinshan - It was. Sorry, the messages were a bit garbled and I missed that. So that and the lost connection to the OST were a client eviction by the OST: |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Jinshan - Is there a particular failure you'd like a dump for? |
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ] |
|
Just provide me the latest failure with my patch applied, please. I will take a look. BTW, have you ever seen the issue on your production system? |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
When you say the latest failure, do you just mean this one? All of the failures I listed were encountered during testing with your patch applied. And yes, we've seen several of these on our production systems. I can't/shouldn't share the details, but until we found the workaround of removing the Note that |
| Comment by Jinshan Xiong (Inactive) [ 12/Mar/14 ] |
Yes, let's start with this one for now |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/14 ] |
|
Jinshan, Unfortunately, I had to give up the test system before I could get those dumps... So, instead, I've got these six dumps for you, five of them are previously seen bugs and the sixth is a new one: The first four are kernel panics, this is a dump of a node with a thread stuck in cl_locks_prune (Which was NMI'ed while running, rather than having a kernel panic): And this is the new bug: Dumps will be in The console log is also in there. There are many other nodes (10+) which went down that I didn't give dumps for because they were duplicates of the ones I picked. (You'll see their stack traces in the console log.) Upload of the dumps should be done in ~10-15 minutes.
|
| Comment by Jinshan Xiong (Inactive) [ 13/Mar/14 ] |
|
Hi Patrick, I will take a look at the dump, please don't forget to copy lustre modules Anyway you have provided an important clue about the error in completion AST. If this is this error happened for every failing case, definitely we can get something from there. Jinshan |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/14 ] |
|
Jinshan, Bobijam, Just a general question. Do you think breaking out the cl_lock_trace calls in to their own debug flag (rather than being part of dlmtrace) is a good thing in general? My patch to do it is just a quick hack, but I'm wondering if a cleaned up version of it - without special treatment for cl_free - is something we might want to land to master? It's been useful for me having it separated, but only because enabling full dlmtrace always prevents me from seeing these cl_lock bugs (and some of the earlier ones as well). If you think so, I'll submit a patch for it. |
| Comment by Jinshan Xiong (Inactive) [ 13/Mar/14 ] |
|
Hi Patrick, cl_lock is dying so please don't waste any time on it. A simplified version of cl_lock will be introduced in CLIO simplification project. Jinshah |
| Comment by Patrick Farrell (Inactive) [ 14/Mar/14 ] |
|
OK. I like that answer. |
| Comment by Shuichi Ihara (Inactive) [ 29/Mar/14 ] |
|
we need this fix with b2_5, so backported. http://review.whamcloud.com/9851 |
| Comment by Bruno Faccini (Inactive) [ 30/Mar/14 ] |
|
After doing some investigations in correlating/crosscheck with the current problems encountered at CEA/Tera-100 site since they upgraded to Lustre 2.4.2 it appears that they also encounter almost all of the [L]BUGs described in this ticket, and here is the list : I have encouraged them to give a try to the debug-trace setting (rpctrace+dlmtrace) and see if it helps to avoid/reduce the frequency of the crashes, and they have enabled this on laste Friday night. Will see on Monday if their very bad stats (about 8 of the different crashes listed per day) have been lowered. I have added Lustre 2.4.2 to the list of affected versions for this ticket. What is unclear for me (and CEA people) with this ticket is : |
| Comment by Patrick Farrell (Inactive) [ 31/Mar/14 ] |
|
Bruno - Here's a breakdown from the Cray perspective, where we/I've been looking at these for a while. 9524 does not fix any single specific assertion/GPF. In my own testing, with mmstress on a 70-ish node system, it did not significantly reduce incidence of the bugs listed in this ticket. (I wouldn't have noticed anything less than probably a 50% reduction, however. So it may improve things a bit.) According to review comments on 9524, it improves the success rate with racer. I haven't checked that, as racer isn't part of our usual test suite. (We don't pass it often enough.) Cray has found a set of patches that seems to avoid the problems, though I don't believe it really fixes them. As I described in this comment: We identified these patches as relevant: 13079de I give the (lengthy) details in my original comment, but in essence, we removed: From our 2.4 and 2.5, and have not seen any of the assertions/GPFs you listed in general testing since. In specific, focused testing with debug disabled, I've been able to hit one or two of them. So I don't believe pulling the Still, the So, we decided it was safe to pull the |
| Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ] |
|
Patrick, thanks for all these clarifications that are very helpful!! |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ] |
|
I'm working on this issue. |
| Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ] |
|
Just a small comment to indicate that CEA did not get any crash since they only enabled dlmtrace last Friday !! |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/14 ] |
|
Yes, it's recommended to revert the 2nd patch of |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ] |
|
I think all occurrences of the problems point to the same root cause - the sub lock has already been freed. I think this is a race of |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ] |
|
Patch is at http://review.whamcloud.com/9876. Patrick, can you please give it a try since you can consistently reproduce it? Jinshan |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ] |
|
Jinshan - With this patch applied on top of master, I'm getting a GPF in cl_lock_delete0, on line 841, which is this: /*
* From now on, no new references to this lock can be acquired
* by cl_lock_lookup().
*/
cfs_list_for_each_entry_reverse(slice, &lock->cll_layers,
cls_linkage) {
if (slice->cls_ops->clo_delete != NULL) <---- This line here.
slice->cls_ops->clo_delete(env, slice);
}
— This happens swiftly when running mmstress, even with full debug enabled. I'm going to take a quick look to see if I can understand why, but I'll probably upload a dump (with full dk logs enabled) shortly... |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ] |
|
Can you please give me stack trace? |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ] |
|
Oh, duh. Sorry Jinshan - I forgot: [<ffffffff81006591>] try_stack_unwind+0x161/0x1a0 Dump is here: ftp.whamcloud.com |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ] |
|
Please use patch version 2 of http://review.whamcloud.com/9881 |
| Comment by Ann Koehler (Inactive) [ 03/Apr/14 ] |
|
Jinshan, you might want to take a look at |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/14 ] |
|
Jinshan - Wow. Finally some good news on these bugs - early testing results on master are perfect. I would expect to have seen 10-20 instances of these various bugs by now in my testing, and I have not seen any yet. I'm adding this patch and |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/14 ] |
|
That's really good, Patrick. Thank you for your effort on this bug. Hi Ann, I will take a look at it soon. Jinshan |
| Comment by Patrick Farrell (Inactive) [ 04/Apr/14 ] |
|
Jinshan - Testing last night completed without any problems. Thank you very much for this - It looks like we've probably finally fixed a bug we've been working on for a long time. (Much of the work on this bug on our side happened before we opened the bug with you.) I've given a positive review to the mod as well. I think this ticket could probably be closed as a duplicate of Again, thank you very much. I've been working on this in various forms since about October of last year. |
| Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ] |
|
duplicate of |
| Comment by Patrick Valentin (Inactive) [ 04/Apr/14 ] |
|
Patrick, in your comment on 03/Apr/14 8:53 PM, you wrote: We built today a lustre 2.4.2 with Thanks in advance |
| Comment by Patrick Farrell (Inactive) [ 04/Apr/14 ] |
|
Patrick - I ran Cray's 2.5 (which is very similar to Intel's 2.5.1) with It worries me (obviously) that CEA continued to see problems in 2.4.2. Can you share which specific assertions/GPFs they continued to hit? And if possible, what codes were causing the issue? Thanks
|
| Comment by Jinshan Xiong (Inactive) [ 04/Apr/14 ] |
|
Hi Patrick Valentin, Please apply patch http://review.whamcloud.com/9881 to your branch and don't revert anything. I was wondering why you talked to yourself and it took me a while to figure out that you guys have the same first name :-D Jinshan |
| Comment by Patrick Valentin (Inactive) [ 09/Apr/14 ] |
|
Hi Jinshan and Patrick, Patrick |