[LU-6898] ldlm_resource_dump()) Granted locks (in reverse order) Created: 24/Jul/15 Updated: 29/Apr/16 Resolved: 10/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CENTOS6 Lustre2.5.3 Server MOFED2.4 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
On the client we see errors [1437673998.848774] LustreError: 11-0: nbp8-MDT0000-mdc-ffff8806cb247400: Communicating with 10.151.27.60@o2ib, operation obd_ping failed with -107.^M [1437673998.860774] Lustre: nbp8-MDT0000-mdc-ffff8806cb247400: Connection to nbp8-MDT0000 (at 10.151.27.60@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M [1437673998.880773] LustreError: 167-0: nbp8-MDT0000-mdc-ffff8806cb247400: This client was evicted by nbp8-MDT0000; in progress operations using this service will fail.^M [1437673998.916773] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) nbp8-MDT0000-mdc-ffff8806cb247400: namespace resource [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M [1437673998.940772] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) Skipped 2587 previous similar messages^M [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount = 2^M [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):^M [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: nbp8-MDT0000-mdc-ffff8806cb247400 lock: ffff8fc0fa26dbc0/0x7f099458bb92bf52 lrc: 2/0,0 mode: PR/PR res: [0x360375393:0xe66d:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x12e400000000 nid: local remote: 0x551d423294fa4bce expref: -99 pid: 46426 timeout: 0 lvb_type: 3^M [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) Skipped 3648 previous similar messages^M [1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x3603755cc:0x6454:0x0].0 (ffff8b075a9a8bc0) refcount = 2^M Server Jul 23 10:53:08 nbp8-mds1 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 226s: evicting client at 10.151.63.50@o2ib ns: mdt-nbp8-MDT0000_UUID lock: ffff882a2c6794c0/0x551d4232f4dfbb5e lrc: 3/0,0 mode: PR/PR res: [0x4976d01:0xe2d4a4f3:0x0].0 bits 0x13 rrc: 848 type: IBT flags: 0x60200000000020 nid: 10.151.63.50@o2ib remote: 0x7f099458bb9c07c4 expref: 9 pid: 9672 timeout: 8029699438 lvb_type: 0
Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Aborting message for 12345-10.151.12.174@o2ib: LNetM[DE]Unlink() already called on the MD/ME.
Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Skipped 41 previous similar messages
Jul 23 10:53:39 nbp8-mds1 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline
On the client all ldlm_bl threads are stuck 0xffff8a0739b06080 21185 2 1 711 R 0xffff8a0739b066f0 ldlm_bl_110^M [<ffffffff814760e8>] _raw_spin_unlock_irqrestore+0x8/0x10^M [<ffffffffa0d7a807>] osc_page_delete+0xe7/0x360 [osc]^M [<ffffffffa0ad14d5>] cl_page_delete0+0xc5/0x4e0 [obdclass]^M [<ffffffffa0ad192a>] cl_page_delete+0x3a/0x120 [obdclass]^M [<ffffffffa0ee16a6>] ll_invalidatepage+0x96/0x160 [lustre]^M [<ffffffffa0ef314d>] vvp_page_discard+0x8d/0x120 [lustre]^M [<ffffffffa0acda58>] cl_page_invoid+0x78/0x170 [obdclass]^M [<ffffffffa0ad490c>] discard_cb+0xbc/0x1e0 [obdclass]^M [<ffffffffa0ad2467>] cl_page_gang_lookup+0x1f7/0x3f0 [obdclass]^M [<ffffffffa0ad471a>] cl_lock_discard_pages+0xfa/0x1d0 [obdclass]^M [<ffffffffa0d7c0d2>] osc_lock_flush+0xf2/0x260 [osc]^M [<ffffffffa0d7c339>] osc_lock_cancel+0xf9/0x1e0 [osc]^M [<ffffffffa0ad2bd5>] cl_lock_cancel0+0x65/0x150 [obdclass]^M [<ffffffffa0ad394b>] cl_lock_cancel+0x14b/0x150 [obdclass]^M [<ffffffffa0d7cc1d>] osc_lock_blocking+0x5d/0xf0 [osc]^M [<ffffffffa0d7dff9>] osc_dlm_blocking_ast0+0xf9/0x210 [osc]^M [<ffffffffa0d7e15c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]^M [<ffffffffa0be4eef>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]^M [<ffffffffa0bf380f>] ldlm_cli_cancel_local+0x7f/0x480 [ptlrpc]^M [<ffffffffa0bf6b82>] ldlm_cli_cancel_list_local+0xf2/0x290 [ptlrpc]^M [<ffffffffa0bfba07>] ldlm_bl_thread_main+0xf7/0x450 [ptlrpc]^M [<ffffffff81083ae6>] kthread+0x96/0xa0^M [<ffffffff8147f164>] kernel_thread_helper+0x4/0x10^M These events will cause the MDS IO to stop for a few minutes. |
| Comments |
| Comment by Peter Jones [ 24/Jul/15 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 27/Jul/15 ] |
|
the log shows that a client hadn't finished a lock cancellation while MDT thought that the client is dead and evicted it. |
| Comment by Mahmoud Hanafi [ 27/Jul/15 ] |
|
why was this causing a pause on the MDS? |
| Comment by Zhenyu Xu [ 28/Jul/15 ] |
|
Do you have log from the MDS? |
| Comment by Mahmoud Hanafi [ 28/Jul/15 ] |
|
There isn't much more on the MDS logs other than the "lock callback timer expired" I tried to get a lustre debug dump but wan't able to capture it quickly enough. |
| Comment by Zhenyu Xu [ 29/Jul/15 ] |
|
I think it could be that client's too busy cancelling locks but still miss the lock callback timeout, can you try http://review.whamcloud.com/#/c/14342/ and http://review.whamcloud.com/#/c/12603 on the client node? |
| Comment by Jay Lan (Inactive) [ 29/Jul/15 ] |
|
I need a b2_5 port of #12603. I have a conflict in lustre/ldlm/ldlm_request.c. |
| Comment by Zhenyu Xu [ 30/Jul/15 ] |
|
here http://review.whamcloud.com/15800 is the b2_5 port of #12603 |
| Comment by Jay Lan (Inactive) [ 30/Jul/15 ] |
|
Thank you , Zhenyu. |
| Comment by Mahmoud Hanafi [ 03/Aug/15 ] |
|
We tried the patch and it didn't help. Running a 312 cpu ior job and canceling the ior run would cause all the ldlm treads to lockup in _raw_spin_unlock_irqrestore. Uploading debug logs debug.out.withpatch.ofed3.5.2.1438633632.bz2 |
| Comment by Jinshan Xiong (Inactive) [ 04/Aug/15 ] |
|
Those threads were not stuck but simply busy. Also the eviction happened on MDC. How many threads did you notice on this state? |
| Comment by Mahmoud Hanafi [ 04/Aug/15 ] |
|
I had a 312 CPU job so they or at least that many threads in that state. The threads never finish and cause the client to get evicted. |
| Comment by Mahmoud Hanafi [ 11/Aug/15 ] |
|
please update this case. |
| Comment by Jinshan Xiong (Inactive) [ 11/Aug/15 ] |
|
I will work on this issue. |
| Comment by Jinshan Xiong (Inactive) [ 12/Aug/15 ] |
|
There are 128 ldlm block threads and all of them are busy at discarding pages. There was probably a long queue out there and it would take long waiting time before this blocking AST from MDT gets handled. Based on the reality that this node has 1024 cores, I will increase the number of ldlm threads and see how it goes. At present there is a kernel module parameter ldlm_num_threads but it has a hard limit as 128. Nobody could predict that there exists such fat node. If it doesn't help by increasing number dlm threads, I will make blocking callback of inodebits lock to have higher priority. |
| Comment by Mahmoud Hanafi [ 12/Aug/15 ] |
|
We don't see this issue in 2.4.3 and can reproduce very easily in 2.5.3. So the behavior has changed. Any patches that may have caused this change? |
| Comment by Gerrit Updater [ 12/Aug/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15960 |
| Comment by Jinshan Xiong (Inactive) [ 12/Aug/15 ] |
|
I will check it. Meanwhile, if you have hardware resource, you can figure it out by 'git bisect'. |
| Comment by Tommi Tervo [ 12/Aug/15 ] |
|
We see this same issue on compute nodes with lot of memory (1.5TB, 32 CPU's) (2.6.32-504.12.2.el6.x86_64, MOFED 2.3 and lustre 2.5.3.90) |
| Comment by Mahmoud Hanafi [ 14/Aug/15 ] |
|
patch: http://review.whamcloud.com/15960 helped some. The client will eventually worked through all the waiting ldlm_bl threads but it takes time. Where in 2.4.3 this was never the case. So still not fit for production on a large cpu node. bt during the wait. [0]kdb> btp 31094 Stack traceback for pid 31094 0xffff8ac79e562540 31094 2 1 753 R 0xffff8ac79e562bb0 ldlm_bl_169 [<ffffffff814760e8>] _raw_spin_unlock_irqrestore+0x8/0x10 [<ffffffffa0d08817>] osc_page_delete+0xe7/0x360 [osc] [<ffffffffa0c2f615>] cl_page_delete0+0xc5/0x4e0 [obdclass] [<ffffffffa0c2fa6a>] cl_page_delete+0x3a/0x120 [obdclass] [<ffffffffa1056776>] ll_invalidatepage+0x96/0x160 [lustre] [<ffffffffa106821d>] vvp_page_discard+0x8d/0x120 [lustre] [<ffffffffa0c2ba68>] cl_page_invoid+0x78/0x170 [obdclass] [<ffffffffa0c32d0c>] discard_cb+0xbc/0x1e0 [obdclass] [<ffffffffa0c305a7>] cl_page_gang_lookup+0x1f7/0x3f0 [obdclass] [<ffffffffa0c32b1a>] cl_lock_discard_pages+0xfa/0x1d0 [obdclass] [<ffffffffa0d0a0f2>] osc_lock_flush+0xf2/0x260 [osc] [<ffffffffa0d0a359>] osc_lock_cancel+0xf9/0x1e0 [osc] [<ffffffffa0c30fd5>] cl_lock_cancel0+0x65/0x150 [obdclass] [<ffffffffa0c31d4b>] cl_lock_cancel+0x14b/0x150 [obdclass] [<ffffffffa0d0ac3d>] osc_lock_blocking+0x5d/0xf0 [osc] [<ffffffffa0d0c019>] osc_dlm_blocking_ast0+0xf9/0x210 [osc] [<ffffffffa0d0c17c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc] [<ffffffffa0ec7670>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc] [<ffffffffa0ec7bd1>] ldlm_bl_thread_main+0x201/0x450 [ptlrpc] [<ffffffff81083ae6>] kthread+0x96/0xa0 [0]more> g Only 'q' or 'Q' are processed at more prompt, input ignored [<ffffffff8147f164>] kernel_thread_helper+0x4/0x10 r15 = 0xffff8fc79e187db0 r14 = 0xffff88a092987558 r13 = 0xffff8fc79e1875e0 r12 = 0xffffffff8147f31e bp = 0xffff88a092987558 bx = 0x0000000100033e68 r11 = 0xffffffffa1060bb0 r10 = 0x00000000000001e1 r9 = 0xffff88a736d76200 r8 = 0xffffffffa0d36e40 ax = 0xffffffffa0d36e58 cx = 0x0000000000000000 dx = 0xffffffffa0d36e58 si = 0x0000000000000282 di = 0x0000000000000282 orig_ax = 0xffffffffffffff01 ip = 0xffffffff814760e8 cs = 0x0000000000000010 Number of lock threads [mhanafi@endeavour2:~]$ ps -ef |grep ldlm_bl | wc -l 1295 [mhanafi@endeavour2:~]$ ps -ef |grep ldlm_cb | wc -l 438 |
| Comment by Jinshan Xiong (Inactive) [ 14/Aug/15 ] |
|
Hi Mahmoud, Do you have a rough idea what type of locks(read, or write) are being canceled? For read lock, clients will have to do more work to check if the pages being destroyed are covered by other lock. |
| Comment by Jinshan Xiong (Inactive) [ 17/Aug/15 ] |
|
hmm.. this is write lock. What's the spinlock at 'osc_page_delete+0xe7/0x360'? it looks like this lock is being highly contended. |
| Comment by Jay Lan (Inactive) [ 19/Aug/15 ] |
|
Which commit or tag of 2.5.3 (ie b2_5) would be a good starting point to establish the 'git bisect good' for this problem as suggested in |
| Comment by Jinshan Xiong (Inactive) [ 19/Aug/15 ] |
|
what was the exact version number of 2.4 that was running okay on the node? |
| Comment by Jay Lan (Inactive) [ 19/Aug/15 ] |
|
Our version of 2.4 is based on 2.4.3 with some extra patches. You can see our git repo at The system in question is running a build at tag 2.4.3-12nasC, which corresponding to this: |
| Comment by Jinshan Xiong (Inactive) [ 20/Aug/15 ] |
|
I didn't find that area of code changed much from tag v2_4_3 to v2_5_3(I checked the extra patches applied to your branch as well). From the stack trace, it looks like all canceling threads are contended at LRU list lock at osc_lru_del(), which you can verify it by running the reproduction program with 'perf top' for example. But the thing is this area of code haven't changed that much so probably there exists a hidden issue, therefore I would like to hold the effort to create a patch now. There are roughly around 800 commits between v2_4_3 and v2_5_3. We can probably identify the problematic patch(es) in less than 10 iterations. |
| Comment by Jay Lan (Inactive) [ 20/Aug/15 ] |
|
b2_5 branch branched out at roughly We tested a branch of Our dedicated time on our big system ran out. We will continue tomorrow (6pm - 8pm) to narrow it down. If you can see the culprit by examining the code that would be even great! |
| Comment by Jay Lan (Inactive) [ 21/Aug/15 ] |
|
We have it! It is commit c8fd9c3 " We tested the commit a35113b - " It is a new kernel feature from Linux kernel 3.8 that you brought in. Somewhere in your code need to adapt to this new feature. |
| Comment by Peter Jones [ 21/Aug/15 ] |
|
Nice detective work Jay! |
| Comment by Jinshan Xiong (Inactive) [ 22/Aug/15 ] |
|
This is a huge patch - I'm looking at it. |
| Comment by Jinshan Xiong (Inactive) [ 22/Aug/15 ] |
|
Is that because there are signals pending for ldlm lock canceling threads? |
| Comment by Gerrit Updater [ 24/Aug/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/16063 |
| Comment by Jinshan Xiong (Inactive) [ 24/Aug/15 ] |
|
patch 16063 is to verify the idea by blocking signals of ldlm blocking threads. If it can fix the problem, I'm going to check updates of commit c8fd9c3 and decide which threads will need signals to be blocked. |
| Comment by Jay Lan (Inactive) [ 25/Aug/15 ] |
|
We do not have the system to test until tomorrow. BTW, I looked at the task_struct of one of the ldlm thread and saw its sigpending struct as shared_pending = { , } The hex value of 18446618212836368752 is 0xffff8d87bb5ce170 |
| Comment by Jay Lan (Inactive) [ 26/Aug/15 ] |
|
Any suggestion of testing if our problem still exhibits with patch 16063 tomorrow? |
| Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ] |
|
we should check the sigset task_struct::blocked of the thread in question. Most likely this is the problem. I will take further look if patch 16063 is not working. |
| Comment by Jay Lan (Inactive) [ 26/Aug/15 ] |
|
Jishan. Patch 16063 did not help. We have another block time reserved for Friday 4-6pm. Please advise if we should keep that reservation. A block time at Friday afternoon would prohibit jobs that needs more than 2 days to start. |
| Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ] |
|
Do you still own the test node? It would be helpful to get a coredump and upload to our ftp site so that I can take a further look. |
| Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ] |
|
Yes, please keep the reservation. I found a problem with the patch. I am updating it now. |
| Comment by Jay Lan (Inactive) [ 26/Aug/15 ] |
|
Taking a vmcore of a system of this size takes times. I am looking for your next patch. If that still fails we then will take a kdump. Also, "please keep the reservation" do you meant the block time now of the block time on Friday? |
| Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ] |
|
That was an answer to the question 'Please advise if we should keep that reservation.' |
| Comment by Jay Lan (Inactive) [ 26/Aug/15 ] |
|
Please take a look at the btall.gz we uploaded on July 27, it was a backtrace of all kernel threads from a machine with 1024 cores and 4TB. We still have the vmcore of that incident. #16 [ffff8fa79de1b9f8] osc_page_delete at ffffffffa0d7a807 [osc] 2) all kernel threads of "ldlm_cb*_**" with stack trace like this: }, 3) all kernel threads of "ptlrpc_hr*_**" with stack trace of: #0 [ffff88679ecb9ce0] schedule at ffffffff81473d9b #1 [ffff88679ecb9d88] cfs_cpt_bind at ffffffffa098a0bd [libcfs] #2 [ffff88679ecb9e28] ptlrpc_hr_main at ffffffffa0c28ad7 [ptlrpc] #3 [ffff88679ecb9ee8] kthread at ffffffff81083ae6 #4 [ffff88679ecb9f48] kernel_thread_helper at ffffffff8147f164 DOES have a blocked sigset: blocked = { sig = {18446744073709551615} }, 3) all kernel threads of "kiblnd_sd__" with stack trace of }, 4) all kernel threads of "ptlrpcd_????" with stack trace of PID: 27092 TASK: ffff8f079ef16340 CPU: 1019 COMMAND: "ptlrpcd_1023" #0 [ffff8f079ef19c30] schedule at ffffffff81473d9b #1 [ffff8f079ef19d78] schedule_timeout at ffffffff81474550 #2 [ffff8f079ef19e08] ptlrpcd at ffffffffa0c3d7c5 [ptlrpc] #3 [ffff8f079ef19ee8] kthread at ffffffff81083ae6 #4 [ffff8f079ef19f48] kernel_thread_helper at ffffffff8147f164 DOES have a blocked sigset: blocked = { sig = {18446744073709551615} }, It seems that all lustre kernel threads except #1 category ("ldlm_bl_???") have a blocked sigset. The content of the sigset all the same. |
| Comment by Jinshan Xiong (Inactive) [ 27/Aug/15 ] |
|
Hi Jay, I realized that the new patch set 16063 won't fix the problem either. I have no idea how commit c8fd9c3 could cause this problem. I will do furhther investigation and will update this ticket if I find something new. |
| Comment by Jay Lan (Inactive) [ 27/Aug/15 ] |
|
Hi Jinshan, If you get the conclusion because of the comment I made a few hour before, please be advised that the blocked sigset data were from the vmcore dated July 27. I do not have a chance to test your new patch 16063 (patch set #2) yet. |
| Comment by Jinshan Xiong (Inactive) [ 27/Aug/15 ] |
|
In that case, please try the 2nd patch anyway as I'm investigating the problem. Please take a coredump if the problem still exists so that we can do postmortem analysis. |
| Comment by Jay Lan (Inactive) [ 29/Aug/15 ] |
|
We reproduced the problem with patch at We took a vmcore. Either Mahmoud or Herbert will The git repo is at |
| Comment by Jinshan Xiong (Inactive) [ 30/Aug/15 ] |
|
hmm.. I will take a look at coredump after it's uploaded. Do you have another chance to run a test again to make sure that the problem does not exist in commit a35113b? I can't figure out why commit c8fd9c3 could cause this problem otherwise I will escalate this ticket. |
| Comment by Gerrit Updater [ 01/Sep/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/16165 |
| Comment by Jinshan Xiong (Inactive) [ 01/Sep/15 ] |
|
I pushed patch 16165 to revert the changes to ldlm and ptlrpc in c8fd9c3. Please check if you can still see the problem by applying this patch. |
| Comment by James A Simmons [ 01/Sep/15 ] |
|
Patch 16165 can't be the final fix since daemonize is actually gone for new kernels like RHEL7 and SLES12. Jay Lan how are your producing this problem. I like to see if I can duplicate it. |
| Comment by Jinshan Xiong (Inactive) [ 01/Sep/15 ] |
|
Obviously it isn't - it's a debug patch to identify the problem as the patch title says. |
| Comment by Jay Lan (Inactive) [ 01/Sep/15 ] |
|
Tommi Tervo also reported the problem at 12/Aug/15 6:21 AM. His system is a 1.5TB,32-cores. We can not reproduce the problem on small systems, but can easily reproduce on our 2TB,512-cores SGI UV2000 system. In Mahmoud's reproducer, the lustre fs uses stripe-count=10. His script submits PBS job to run mpiexec of 312 copies of IOR. Using 'top' to monitor the run. When you see IOR takes up the 'top' page, ctrl-C to kill the job. Then, if you see all ldlm_bl_xxx threads showing up in 'top' pages taking ~100% CPU time, the problem is reproduced. Sometimes it takes a couple of runs to reproduce the problem. |
| Comment by Mahmoud Hanafi [ 01/Sep/15 ] |
|
The reproducer runs on single stripe file per process IOR. Not stripe count of 10. |
| Comment by Jay Lan (Inactive) [ 02/Sep/15 ] |
|
Jinshan, 8 files of your patch caused conflicts. Could create a branch at commit c8fd9c3 and create your patch from there? Commit c8fd9c3 is where the problem starts. |
| Comment by Jinshan Xiong (Inactive) [ 02/Sep/15 ] |
|
the patch is based on b2_5, please apply it to the top of b2_5. Is it possible to enable lock stat and collect perf(1) data so that we know if there are contention in the code? |
| Comment by Jay Lan (Inactive) [ 03/Sep/15 ] |
|
Have you found anything from the crash dump, Jinshan and Oleg? |
| Comment by Oleg Drokin [ 06/Sep/15 ] |
|
We found that the patch from Jinshan managed to set blocking callback just like it was supposed to, so there now should be no differences prior to the patch you mention. We are still thinking of the reasons, but meanwhile Jinshan is very interested in b2_5 + the patch he just reverted to be run on your system to see how it helps. |
| Comment by Jay Lan (Inactive) [ 08/Sep/15 ] |
|
After applying the debug patch to nas-2.5.3, we can not reproduce the problem. What feature requires commit c8fd9c3? Do we need this patch on sles11sp3? |
| Comment by Jinshan Xiong (Inactive) [ 08/Sep/15 ] |
|
As far as I know, there is no new feature that requires commit c8fd9c3. It will just make lustre code more in line with kernel code. I will make further investigation on this issue. Please apply the debug patch in your release while we're investigating this issue. |
| Comment by Jay Lan (Inactive) [ 08/Sep/15 ] |
|
The ldlm_bl_xxx callbacks were the ones stalled the system. They all ran at ~100% cpu, with commit c8fd9c3. |
| Comment by Lu [ 06/Jan/16 ] |
|
Hi Jinshan and Mahamoud, By the way, our Lustre version is 2.5.3 without any modification/patch. |
| Comment by Lu [ 06/Jan/16 ] |
|
The back trace of ldlm_bl: The client syslog: The server log: |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ] |
|
Hi Lu, Have you noticed that there was heavy contention on some locks when this issue occurred? It looks like the OSC in question was in recovery state therefore it picked some read locks to cancel before recovery. |
| Comment by Jeremy Filizetti [ 10/Feb/16 ] |
|
Has there been any progress on fixing this issue? |
| Comment by Mahmoud Hanafi [ 10/Mar/16 ] |
|
We haven't seen this issue. Close will reopen if needed. |