[LU-727] application hang waiting on page lock Created: 30/Sep/11 Updated: 04/Oct/11 Resolved: 04/Oct/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Gregoire Pichon | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 2.0.0.1, kernel 2.6.32-71, lustre client |
||
| Severity: | 3 |
| Rank (Obsolete): | 6548 |
| Description |
|
Application hang with the following stack: PID: 17906 TASK: ffff88063e3f94e0 CPU: 0 COMMAND: "Migration_Clien" The file being read is on a lustre filesystem The page structure, whose address was retrieved from the process stack, shows the flags PG_locked and PG_lru are set. , , { inuse = 0xffff, objects = 0xffff } }, , }, , } Unfortunately, the page lock is never released. Looking at the dump info I am not able to find the current owner of the PG_locked page lock. The application is the MIGAL benchmark. The same problem has also been produced at CEA using DDT a debugger environment. The stack is similar: |
| Comments |
| Comment by Gregoire Pichon [ 30/Sep/11 ] |
|
I have uploaded the dump files on ftp.whamcloud.com:/uploads/ |
| Comment by Jinshan Xiong (Inactive) [ 02/Oct/11 ] |
|
I took a look at this bug. It seems the page is being read while it's being truncated, but I'm not sure why the truncate process can't finish. I tried to figure out by analyzing crash file, unfortunately it doesn't work: [root@client-7 tmp]# crash -s vmlinux vmcore and checksum seems to be wrong: Can you please upload them once again, thanks. |
| Comment by Gregoire Pichon [ 03/Oct/11 ] |
|
I have uploaded again the files (in binary mode this time!). |
| Comment by Gregoire Pichon [ 03/Oct/11 ] |
|
What is the path of the truncate that possibly does not complete ? |
| Comment by Gregoire Pichon [ 03/Oct/11 ] |
|
I have uploaded the lustre trace logs of the system where application hang. |
| Comment by Gregoire Pichon [ 03/Oct/11 ] |
|
Finally, I have also uploaded the dump file (named vmcore5) in sync with the traces.txt file.
|
| Comment by Jinshan Xiong (Inactive) [ 03/Oct/11 ] |
|
Can you please check if you include the patch lu-148 in your code? commit 59c1a8e7cd69c31bce09695681e2c9f889fed567 Unlock vmpage in case ll_cl_init fails. It looks like this program is using fadvise(2) to drop cache, and use WILLNEED to read ahead. Unfortunately this isn't well supported by lustre. |
| Comment by Jinshan Xiong (Inactive) [ 03/Oct/11 ] |
|
BTW, I really appreciate for you guys to collect log and crash dump in this professional way, it helps debug a lot. |
| Comment by Gregoire Pichon [ 04/Oct/11 ] |
|
Thanks a lot, the problem has been fixed with the patch for |
| Comment by Jinshan Xiong (Inactive) [ 04/Oct/11 ] |
|
lu-148 |