[LU-1153] Client Unresponsive Created: 29/Feb/12 Updated: 15/Jun/12 Resolved: 15/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Roger Spellman (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | paj | ||
| Environment: |
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6441 |
| Description |
|
It is possible that I am seeing multiple bugs. So, you may want to split this one bug into several bugs. Let me emphasize that this problem occurs only on one system, and that is the system built withhttp://review.whamcloud.com/#change,2170, which is code specifically for the 2.6.38.2 kernel. This bug is preventing us from shipping the code to this customer. PROBLEM 1 Then, I noticed that I may have hit a minor bug. After the iozone tests, I removed all of the files. Then, I did an lfs df -h, and I saw that some OSTs had 20G used still. After I unmounted all of the clients then remounted them, the problem went away (that is, all the OSTs had the same amount of used space). Here is some session output: [root@compute-01-32 lustre]# cd /mnt/lustre [root@compute-01-32 lustre]# # Note that there are no files; I removed them a while ago. filesystem summary: 1.5T 43.6G 1.4T 3% /mnt/lustre [root@compute-01-32 lustre]# # Note that two OSTs have used 20.4G, even though no files ! [root@compute-01-32 lustre]# df PROBLEM 2 I then unmounted Lustre on all clients, rebooted all the clients, I decided to try and reproduce this bug. So, I started up IOZone again. Keep in mind that this is after a client reboot. IOZone ran for a few seconds, then hung. I could ping the node with 2.6.38.8 kernel, but I could not ssh to it. The video console was locked up. Pressing Caps Lock and NumLock on the keyboard did not light up any LEDs on the keyboard. So, I power cycled. This is all that I saw in /var/log/messages: Feb 28 09:33:53 compute-01-32 avahi-daemon[1508]: Joining mDNS multicast group on interface eth2.IPv4 with address 10.7.1.32. I did not have a serial cable plugged into get the crash dump. I have set this up now, so if the problem occurs again, we should get more data. I will try to reproduce this bug. PROBLEM 3 I set up the serial cable, and I verified that SysRq-T worked and sent its output over the serial cable to another node that was capturing the data. After many hours of testing, the 2.6.38.2 client became unresponsive again. I see some Out of memory messages. I will keep an eye on the slab usage next time. Below following is the output over ttyS0. After the problem occurred, I plugged in a monitor and keyboard, and SysRq-T did not work. Perhaps I need to have the keyboard in prior to the problem occurring LustreError: 19819:0:(ldlm_request.c:1171:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway |
| Comments |
| Comment by Peter Jones [ 29/Feb/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Roger Spellman (Inactive) [ 02/Mar/12 ] |
|
I am able to reproduce the bug at will. The test fails every time. I believe that the cause is out-of-memory. I see the following in the logs quite frequently: Mar 1 11:01:01 compute-01-32 kernel: cannot allocate a tage (7) I have also seen: Mar 1 10:52:35 compute-01-32 kernel: ----------- |
| Comment by Roger Spellman (Inactive) [ 02/Mar/12 ] |
|
Lai, Can you please give an update to this bug? Thanks. |
| Comment by Peter Jones [ 03/Mar/12 ] |
|
Roger I did chat with Lai about this ticket. There is a lot of information to sift through but he expects to post something in the near future. He is based in China so this may well occur before you are in the office on Monday Regards Peter |
| Comment by Lai Siyao [ 05/Mar/12 ] |
|
Roger, The warning message for simple_setattr() is not an issue, kernel exports this function, but it's too cautious that if a somewhat complicated filesystem uses this function (which has it's ->truncate implementation), it will complain. But Lustre just uses it to update times, so it's absolutely okay. As for messages like "kernel: cannot allocate a tage", this is normal too. By default, Lustre debug is enabled, and it's printed to allocated kernel pages, but sometimes when system memory is tight, it can't allocate pages, it will print such warnings, and skip output debug logs. I'm trying to reproduce this failure, and after that I'll update you. If you can run some scripts to output /proc/slabinfo periodically (eg. `watch sort -k2rn /proc/slabinfo`), and if OOM happens again, could you collect it from the serial console and put here? |
| Comment by Lai Siyao [ 05/Mar/12 ] |
|
Roger, I can't reproduce here, could you upload the test script or program? |
| Comment by Roger Spellman (Inactive) [ 05/Mar/12 ] |
|
While running my test, I also ran the script get-slab-loop. This script gets slab info, vmstat, and other stuff, and puts it into new file every 30 seconds. The script and the files are in the tarball. I believe that my test started running before mem.4 was created. I will upload the test shortly. |
| Comment by Roger Spellman (Inactive) [ 05/Mar/12 ] |
|
This tarball contains the test that causes this bug. The test uses IOZone. The binary is in this directory. It should be moved to a place that all clients can access. Then, the client_list should be modified with the location of that binary. The format of the cilent_list is: hostname mountPoint binaryLocation fileToCreate The client list currently has 22 entries. You can modify this for the number of clients that you have. Before running any tests, I run: ./setup_directories /mnt/lustre That creates the following directories on my system with 8 OSTS: /mnt/lustre/ The last 8 are striped to each OST. The client_list writes files to these directories. Feel free to change the client list to match the number of OSTs that you have. The script that is run is called 'run-test'. This script includes the line: for threads in 22 '22' is the number of entries in the client_list. If you change the number of entries in client_list, then you should change the number 22 to match that. Please let me know if you need any help setting up the tests. It usually hangs within an hour or two. |
| Comment by Lai Siyao [ 06/Mar/12 ] |
|
Yes, I can reproduce in your way, thanks! I'll update you when I have some clue. |
| Comment by Lai Siyao [ 06/Mar/12 ] |
|
Hi Roger, I'm wondering whether this issue is introduced by FC15 support code, do you have any client with supported kernels (eg. RHEL5/6) to verify this won't happen on them? |
| Comment by Roger Spellman (Inactive) [ 06/Mar/12 ] |
|
Lai, |
| Comment by Lai Siyao [ 11/Mar/12 ] |
|
I tested the same code on RHEL5/6 and FC15, OOM only happens on FC15. The log and memory stats doesn't show anything special, just too many cached pages, and the slab usage is a bit high. I talked to Jinshan, he said that iozone test consuming a lot memory is a known issue (but he never met OOM before), and it's because CLIO depends on kernel to release cached pages, but kernel tends to cache more pages. I doubt that FC15 kernel is more aggressive on this, therefore it will OOM. Jinshan is working on iozone memory use problem, I'll update here if he has any progress. |
| Comment by Lai Siyao [ 12/Mar/12 ] |
|
I tried tuning kernel vm dirty_ratio and dirty_background_ratio to 5, but it still OOM. There may not exist a simple workaround for this excessive memory usage issue. |
| Comment by Roger Spellman (Inactive) [ 12/Mar/12 ] |
|
> I talked to Jinshan, he said that iozone test consuming a lot This bug DOES NOT require IOZone. I am able to reproduce it with dd, with the following script:
echo |
| Comment by Peter Jones [ 13/Mar/12 ] |
|
Roger To clarify, Jinshan is referring to the conditions that can be simulated by running a tool like IOZONE. He is not suggesting that IOZONE itself is the key factor here Peter |
| Comment by Roger Spellman (Inactive) [ 13/Mar/12 ] |
|
Peter, Jinshan, I believe that the problem has to do with writing then over-writing the same file. IOZone does this by default, and the loop that I submitted does that too. When I just write a file once, I am not seeing this problem (in the time frame that I'm testing). You can do that in IOZone with the -+n option. Roger |
| Comment by Jinshan Xiong (Inactive) [ 14/Mar/12 ] |
|
I'll look into this issue. For the warning of simple_setattr(), it means we shouldn't use simple_setattr() if we have truncate method implemented. We used to call inode_setattr() for old kernels, we really need to fix that. |
| Comment by Lai Siyao [ 14/Mar/12 ] |
|
Jinshan, please check http://review.whamcloud.com/#change,1863 and http://review.whamcloud.com/#change,2145. |
| Comment by Jinshan Xiong (Inactive) [ 31/Mar/12 ] |
|
I'm trying to compile lustre-master with fc15 but it doesn't work. Can you please tell me which distribution you;re using for your client? |
| Comment by Lai Siyao [ 31/Mar/12 ] |
|
Hmm, if you use kernel-devel package to build kernel module, lustre configure will fail on LB_LINUX_COMPILE_IFELSE in build/autoconf/lustre-build-linux.m4, I tried to tweak it a bit, but didn't succeed. In my setup, I build lustre patchless client against kernel source, and you need update kernel source Makefile 'EXTRAVERSION' to be in accordance with your kernel version(my setup is '.6-26.rc1.fc15.x86_64'). |
| Comment by Roger Spellman (Inactive) [ 02/Apr/12 ] |
|
I had no trouble building against 2.6.38.2. I don't see LB_LINUX_COMPILE_IFELSE in .config in this release. [root@RS_vm-2_6_38_2 linux-2.6.38.2]# pwd -Roger |
| Comment by Roger Spellman (Inactive) [ 02/Apr/12 ] |
|
I updated to the 5th patch on: This bug is still present there. |
| Comment by Peter Jones [ 09/Apr/12 ] |
|
Roger Peter said today that he thought that this issue was independent of using the 2.6.38 client. Are you able to reproduce this same behaviour when running vanilla 2.1.x and RHEL6 clients, say? Peter |
| Comment by Roger Spellman (Inactive) [ 10/Apr/12 ] |
|
I have a client with kernel: 2.6.32-220.el6.x86_64 Is that what you want me to try? What git tag should I use to get the 2.1.x code? |
| Comment by Peter Jones [ 15/Jun/12 ] |
|
As per Terascala ok to close |