[LU-1328] Failing customer's file creation test Created: 16/Apr/12 Updated: 27/Sep/12 Resolved: 01/Jul/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Roger Spellman (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 6412 |
| Description |
|
Customer running a script that calls a Java program (see attachments). Two clients panic'd. 2012-04-13 19:21:49 +0000 [26759.401564] ----------- On one OSS, about an hour before the client crash, I saw this: Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: I don't know if it is related, or not. Attached files: Reproduce.java: Java program used when reproducing the problem |
| Comments |
| Comment by Peter Jones [ 16/Apr/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 16/Apr/12 ] |
|
Roger, the first backtrace is a warning, the panic information is lost (refer to usrs393.netconsole), is it possible to setup kdump and get crash data? or connect a serial console and get panic info? I'll try to reproduce in my setup too. The second backtrace shows a slow thread on OST, which should not be relative to the panic. |
| Comment by Roger Spellman (Inactive) [ 17/Apr/12 ] |
|
/var/log/messages from system with the problem. |
| Comment by Roger Spellman (Inactive) [ 17/Apr/12 ] |
|
From the customer: Attached is the appropriate portion of /var/log/messages from that date - it had more info then the netconsole because apparently the problem was in the softirq and the network driver couldn't send anymore. Unfortunately the serial connections very often don't have anything on them for whatever reason and we haven't been able to get kdump to work in our environment. I currently have another client that is locked up on probably the same issue. |
| Comment by Lai Siyao [ 18/Apr/12 ] |
|
Yes, the log shows there are some network IO problem, but it's still a warning, is it possible to get the crash backtrace? |
| Comment by Peter Piela (Inactive) [ 26/Apr/12 ] |
|
The customer is unable to provide a crash backtrace. Are you able to reproduce the problem in-house with the provided test application? |
| Comment by Lai Siyao [ 26/Apr/12 ] |
|
Not yet, I'll keep it running. |
| Comment by Lai Siyao [ 02/May/12 ] |
|
This has been run for several days, but not produced in my test environment. I used `while true; do sh reproduce.sh || break; done`, and I saw a few mkdir failures (looks to be failed on -EEXIST, this may be okay), everything else looks fine. |
| Comment by Roger Spellman (Inactive) [ 07/May/12 ] |
|
Customer reports: I was able to verify I still see this problem this morning, additionally I think I have distilled it down to an even smaller test case with a single client. I have a directory (generated from the test case I sent you) that has 34853 files in it. If I execute the following: $ lfs find <dir> -type f | sh -c 'while read line; do stat ${line} < /dev/null > /dev/null & done' This will attempt to stat the files in parallel and I see a bunch of errors that it can't find the files. However if I stat the files sequentially, everything is ok: $ lfs find <dir> -type f | sh -c 'while read line; do stat ${line} < /dev/null > /dev/null; done' Lai, please see if you can reproduce using this method. Regarding the EXIST, customer responded: [pid 22808] open("/mnt/lustre /filetest/usrs398/20120507-092516/d5/7/75.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 79 [pid 22808] fstat(79, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0[pid 22808] fcntl(79, F_GETFD) = 0 [pid 22808] fcntl(79, F_SETFD, FD_CLOEXEC) = 0 [pid 22808] write(79, "K\304\215f\361R\351\226\340Hh\32i\303\233\243t\20\301\34\313A\243\254\224\n\262\211^\202\215\20"..., 1024) = 1024 [pid 22808] open("/mnt/lustre/ filetest/usrs398/20120507-092516/d5/7/76.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 80 [pid 22808] fstat(80, {st_mode=S_IFREG|0664, st_size=0, ...} ) = 0 You can see that 2 opens succeeded with writes and the third failed. They were all writing to the same directory so that means the directory existed, just for some reason the third open returned ENOENT instead of succeeding. open(2) says: ENOENT O_CREAT is not set and the named file does not exist. Or, a directory component in pathname does not exist or is a dan- Since O_CREAT is given and all directory components exist since previous writes succeeded, there must be a bug in Lustre somewhere. What about the simple script where I tried to stat the files in parallel. Did you pass that on to Whamcloud as (on the surface) it appears to be a similar problem. |
| Comment by Lai Siyao [ 09/May/12 ] |
|
I can't reproduce with the 'stat' script yet; maybe my test machine is not powerful enough, and there isn't a machine installed with fc15 in testlab. With a small tweak of your script, it can dump debug log upon error: lfs find <dir> -type f | sh -c 'while read line; do stat ${line} < /dev/null > /dev/null || lctl dk /tmp/`basename $line`.log & done' Could you test this script on your environment? |
| Comment by Lai Siyao [ 14/May/12 ] |
|
I've setup 3 nodes on testlab, 1 is server installed el6, 2 clients with kernel 2.6.38, and both clients doing stat and reproduce.sh in a loop. I have run test for half a day, but haven't reproduced yet, and I'll keep them running, and will add another client tomorrow if not reproduce yet. |
| Comment by Lai Siyao [ 14/May/12 ] |
|
I can't reproduce over night, now there are 300k files total in test system, could you tell me how many files are in your system when the error appeared? Could you upload messages files of both client and MDS at failure time? |
| Comment by Peter Piela (Inactive) [ 15/May/12 ] |
|
Results from requested script change lfs find <dir> -type f | sh -c 'while read line; do stat ${line} < /dev/null > /dev/null || lctl dk /tmp/`basename $line`.log & done' |
| Comment by Lai Siyao [ 16/May/12 ] |
|
This looks to be a MDS issue about OPEN lock, I have a patch at http://review.whamcloud.com/#change,2800, will you compile it against master, update MDS and test again? |
| Comment by Roger Spellman (Inactive) [ 17/May/12 ] |
|
Customer asks: |
| Comment by Lai Siyao [ 17/May/12 ] |
|
I'm testing with 2.6.38.6-26. |
| Comment by Roger Spellman (Inactive) [ 17/May/12 ] |
|
Please verify that the patch only touches one file. |
| Comment by Roger Spellman (Inactive) [ 17/May/12 ] |
|
Please verify that this only affects the Lustre server, not the client. |
| Comment by Lai Siyao [ 17/May/12 ] |
|
Yes, the patch affects Lustre MDS server only, and only touches one file. |
| Comment by Roger Spellman (Inactive) [ 21/May/12 ] |
|
We applied the patch. The customer reports the following: $ lfs find . -type f | sh -c 'while read file; do stat $file & done >/dev/null' Here are some additional details: This is a single client with 24 cores running the test. "ls -l' on each file succeeds - if I perform the stat sequentially it succeeds - it only fails when I launch the stat processes in the background thus making them parallel. Each time I run the test, I get different files that can't be found. |
| Comment by Lai Siyao [ 21/May/12 ] |
|
Is there any error message on MDS server? |
| Comment by Lai Siyao [ 22/May/12 ] |
|
I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`. |
| Comment by Roger Spellman (Inactive) [ 22/May/12 ] |
|
This is the messages file from the MDS. |
| Comment by Roger Spellman (Inactive) [ 22/May/12 ] |
|
> I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`. I will ask the customer to enable these. Are these enabled on both the client and servers? |
| Comment by Roger Spellman (Inactive) [ 22/May/12 ] |
|
What command is run to get view the debug data? |
| Comment by Lai Siyao [ 22/May/12 ] |
|
Yes, it's better to enable debug trace on both client and server. Upon error, you can run `lctl dk /tmp/create.log`, then you can check debug trace in the log file, which contains recent debug traces, and this is why this command should be run immediately after error occurs, otherwise the old debug trace is dropped. |
| Comment by Roger Spellman (Inactive) [ 22/May/12 ] |
|
This file contains client.out and mds.out with the debug info you requested. |
| Comment by Lai Siyao [ 22/May/12 ] |
|
When is the test conducted? I saw two lines of error messages suspicious in messages.mds, but they are at May 21 23:00:01: Could you upload script output along with this test? And you can use `lfs fid2path [0x2003acec9:0x1:0x0]` and `lfs fid2path [0x2003acec9:0x2:0x0]` to check which two files are failed to open. |
| Comment by Roger Spellman (Inactive) [ 23/May/12 ] |
|
I'm moving back to this thread to continue the file issue since it had gotten merged in with the OST failure... I ran the test yesterday so the lines from May 21 don't pertain to the test. The two fids Lai asks about ([0x2003acec9:0x1:0x0] & [0x2003acec9:0x2:0x0]) don't exist in the file system, but I assume they were located under the directory indicated by [0x200000400:0xa:0x0] which is owned by one of our users who is more tolerant of random failures and so he hasn't been telling me about any issues he has had since he knows it is "use at your own risk" at this time. I have uploaded a new tarball where I ran the test again. It contains the following files: client.out - debug_kernel output from the client The tarball is enoent-20120523.tar.gz. |
| Comment by Lai Siyao [ 24/May/12 ] |
|
The logs show that the -ENOENT files are never .lookup or .revalidate at both client and server side, so the -ENOENT error should happen in directory name lookup/revalidate. I'm a bit suspicious of 'CWD' handling in lustre: the test script calls 'stat' which calls syscall vfs_lstat() finally, and it starts pathname lookup from CWD (Current Working Directory) if filename doesn't start with '/'. One way to prove this is to test like this: Then all `stat` will lookup pathname from ROOT. BTW, please enable more debug trace in your test: `lctl set_param debug=+"vfstrace info dlmtrace dentry inode"`. |
| Comment by Roger Spellman (Inactive) [ 25/May/12 ] |
|
Customer reports: It didn't make a difference - I didn't think it would since this seems to be an issue with parallel access since if I run the stats sequentially it works. I have uploaded enoent-20120524.tar.gz: "client.out" and "mds.out" contain the debug output. |
| Comment by Roger Spellman (Inactive) [ 25/May/12 ] |
|
Customer reports If I limit the number of CPUs it can run on, it reduces the number of failures to the point where running it on 1 cpu makes it effectively sequential: $ lfs find . -type f | taskset -c 0 sh -c 'while read file; do stat $file & done > /dev/null' This command eventually can't fork any more processes (I reach that limit) but there are no ENOENT errors. I see the same scenario with "taskset -c 0,1", and I see my first failure with "taskset -c 0-2", but it takes some time. However using "taskset -c 0,2" after some time I do see the ENOENT error. CPU 0 and CPU 1 are in different sockets and CPU 0 and CPU 2 are in the same socket. I do not see the error with 0 & 12 which are the hyperthreaded pairs. I can run more tests next week if you want me to. |
| Comment by Lai Siyao [ 27/May/12 ] |
|
Hmm, it looks to be a bug in ll_splice_alias(): I'll update FC15 patchless client patch later to reflect this change. |
| Comment by Lai Siyao [ 27/May/12 ] |
|
Patch is updated: http://review.whamcloud.com/#change,2170. Please patch it against latest master branch, and this change is for client only. |
| Comment by Lai Siyao [ 30/May/12 ] |
|
Due to recent commit, please use patch: http://review.whamcloud.com/#change,1865 |
| Comment by Roger Spellman (Inactive) [ 30/May/12 ] |
|
Is this change just for the clients, or the servers, or both? |
| Comment by Lai Siyao [ 30/May/12 ] |
|
It's for client only. 2170 has been abandoned now, 1865 is the only remaining patch for FC15 patchless client support code. |
| Comment by Roger Spellman (Inactive) [ 30/May/12 ] |
|
The modules will not load correctly. I get the following errors: Lustre: Lustre: Build Version: 2.2.54-g2fcae7b-CHANGED-2.6.32-220.el6.x86_64 grepping through the code, I find: lustre-release-with-patch-9/lustre/include/lu_target.h:void lut_boot_epoch_update(struct lu_target *lut); So, that function is called from ldlm_lib.c. BUT ... In the ptlrpc/Makefile, target.o is not included, i.e. #ptlrpc_objs += target.o |
| Comment by Lai Siyao [ 30/May/12 ] |
|
you need to update to latest master branch, then patch latest Change,1865 (it's patch 21 now). |
| Comment by Roger Spellman (Inactive) [ 31/May/12 ] |
|
The problem still persists, i.e. May 31 12:05:21 compute-01-32 kernel: Lustre: Lustre: Build Version: 2.2.54-g9567e22-CHANGED-2.6.32-220.el6.x86_64 I think I know the cause of the problem. In the file lustre/ptlrpc/Makefile.in is the line @SERVER_TRUE@ptlrpc_objs += target.o After running: ./configure --with-kernel=/usr/src/kernels/2.6.32-220.el6.x86_64 \ The file lustre/ptlrpc/Makefile contains #ptlrpc_objs += target.o The file lustre/ldlm/ldlm_lib.c calls lut_boot_epoch_update(), which is defined in target.c. But, since ldlm_lib.c is compiled, but target.c is not compiled, there is an unknown symbol. |
| Comment by Roger Spellman (Inactive) [ 31/May/12 ] |
|
The commands I ran to get the build are: git clone git://git.whamcloud.com/fs/lustre-release.git # svn checkout url |
| Comment by Lai Siyao [ 01/Jun/12 ] |
|
I did the same as you, but all looks well. And where lut_boot_epoch_update() is called in ldlm_lib.c is inside #ifdef HAVE_SERVER_SUPPORT, so that these code won't be compiled, and won't cause the unknown symbol problem. |
| Comment by Roger Spellman (Inactive) [ 01/Jun/12 ] |
|
My error. I copied the wrong modules to the target system. The modules load fine now. |
| Comment by Peter Jones [ 04/Jun/12 ] |
|
As I understand it, this code is being tracked for landing under |
| Comment by Roger Spellman (Inactive) [ 06/Jun/12 ] |
|
Customer reports: See attachment for the netconsole output. The following is the output from one: 2012-06-06 11:35:51 +0000 [4026925.936063] Lustre: MGC192.168.185.35@tcp: Reactivating import |
| Comment by Lai Siyao [ 07/Jun/12 ] |
|
Jinshan, could you help check why this ASSERT is not true on 2.6.38 kernel? |
| Comment by Lai Siyao [ 07/Jun/12 ] |
|
As Jinshan pointed out, it looks to be the same issue of |
| Comment by Roger Spellman (Inactive) [ 15/Jun/12 ] |
|
Customer reports: We have been running with the latest patch for several days with no problems other than the old slowpath warning: 609631.154485] ----------- Question: Is the best thing to do for this to lower the number of OST threads, as documented on the Lustre manual? |
| Comment by Lai Siyao [ 18/Jun/12 ] |
|
This is just kernel warning which is a bit excessive, and it has nothing to do with real IO performance. Lustre uses a kernel exported function simple_setattr(), which is originally for simple filesystem (which doesn't implement truncate), so it gives a warning, but there is no real problem. |