[LU-6898] ldlm_resource_dump()) Granted locks (in reverse order) Created: 24/Jul/15  Updated: 29/Apr/16  Resolved: 10/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CENTOS6 Lustre2.5.3 Server MOFED2.4
SLES11 Lustre2.5.3 Client MOFED3.0


Attachments: File btall.gz     Text File debug.out.mofed.withpatch.1438631502.bz2     Text File debug.out.withpatch.mofed.secondrun.1438631826.bz2     Text File debug.out.withpatch.ofed3.5.2.1438633632.bz2     File dmesg.out.gz    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

On the client we see errors

[1437673998.848774] LustreError: 11-0: nbp8-MDT0000-mdc-ffff8806cb247400: Communicating with 10.151.27.60@o2ib, operation obd_ping failed with -107.^M
[1437673998.860774] Lustre: nbp8-MDT0000-mdc-ffff8806cb247400: Connection to nbp8-MDT0000 (at 10.151.27.60@o2ib) was lost; in progress operations using this service will wait for recovery to complete^M
[1437673998.880773] LustreError: 167-0: nbp8-MDT0000-mdc-ffff8806cb247400: This client was evicted by nbp8-MDT0000; in progress operations using this service will fail.^M
[1437673998.916773] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) nbp8-MDT0000-mdc-ffff8806cb247400: namespace resource [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount nonzero (1) after lock cleanup; forcing cleanup.^M
[1437673998.940772] LustreError: 81375:0:(ldlm_resource.c:809:ldlm_resource_complain()) Skipped 2587 previous similar messages^M
[1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x360375393:0xe66d:0x0].0 (ffff8fc07bee8a80) refcount = 2^M
[1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):^M
[1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: nbp8-MDT0000-mdc-ffff8806cb247400 lock: ffff8fc0fa26dbc0/0x7f099458bb92bf52 lrc: 2/0,0 mode: PR/PR res: [0x360375393:0xe66d:0x0].0 bits 0x1b rrc: 2 type: IBT flags: 0x12e400000000 nid: local remote: 0x551d423294fa4bce expref: -99 pid: 46426 timeout: 0 lvb_type: 3^M
[1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1454:ldlm_resource_dump()) Skipped 3648 previous similar messages^M
[1437673998.952772] LustreError: 81375:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x3603755cc:0x6454:0x0].0 (ffff8b075a9a8bc0) refcount = 2^M

Server

Jul 23 10:53:08 nbp8-mds1 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 226s: evicting client at 10.151.63.50@o2ib  ns: mdt-nbp8-MDT0000_UUID lock: ffff882a2c6794c0/0x551d4232f4dfbb5e lrc: 3/0,0 mode: PR/PR res: [0x4976d01:0xe2d4a4f3:0x0].0 bits 0x13 rrc: 848 type: IBT flags: 0x60200000000020 nid: 10.151.63.50@o2ib remote: 0x7f099458bb9c07c4 expref: 9 pid: 9672 timeout: 8029699438 lvb_type: 0
Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Aborting message for 12345-10.151.12.174@o2ib: LNetM[DE]Unlink() already called on the MD/ME.
Jul 23 10:53:09 nbp8-mds1 kernel: LNet: 5828:0:(lib-move.c:865:lnet_post_send_locked()) Skipped 41 previous similar messages
Jul 23 10:53:39 nbp8-mds1 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline

On the client all ldlm_bl threads are stuck

0xffff8a0739b06080    21185        2  1  711   R  0xffff8a0739b066f0  ldlm_bl_110^M
 [<ffffffff814760e8>] _raw_spin_unlock_irqrestore+0x8/0x10^M
 [<ffffffffa0d7a807>] osc_page_delete+0xe7/0x360 [osc]^M
 [<ffffffffa0ad14d5>] cl_page_delete0+0xc5/0x4e0 [obdclass]^M
 [<ffffffffa0ad192a>] cl_page_delete+0x3a/0x120 [obdclass]^M
 [<ffffffffa0ee16a6>] ll_invalidatepage+0x96/0x160 [lustre]^M
 [<ffffffffa0ef314d>] vvp_page_discard+0x8d/0x120 [lustre]^M
 [<ffffffffa0acda58>] cl_page_invoid+0x78/0x170 [obdclass]^M
 [<ffffffffa0ad490c>] discard_cb+0xbc/0x1e0 [obdclass]^M
 [<ffffffffa0ad2467>] cl_page_gang_lookup+0x1f7/0x3f0 [obdclass]^M
 [<ffffffffa0ad471a>] cl_lock_discard_pages+0xfa/0x1d0 [obdclass]^M
 [<ffffffffa0d7c0d2>] osc_lock_flush+0xf2/0x260 [osc]^M
 [<ffffffffa0d7c339>] osc_lock_cancel+0xf9/0x1e0 [osc]^M
 [<ffffffffa0ad2bd5>] cl_lock_cancel0+0x65/0x150 [obdclass]^M
 [<ffffffffa0ad394b>] cl_lock_cancel+0x14b/0x150 [obdclass]^M
 [<ffffffffa0d7cc1d>] osc_lock_blocking+0x5d/0xf0 [osc]^M
 [<ffffffffa0d7dff9>] osc_dlm_blocking_ast0+0xf9/0x210 [osc]^M
 [<ffffffffa0d7e15c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]^M
 [<ffffffffa0be4eef>] ldlm_cancel_callback+0x5f/0x180 [ptlrpc]^M
 [<ffffffffa0bf380f>] ldlm_cli_cancel_local+0x7f/0x480 [ptlrpc]^M
 [<ffffffffa0bf6b82>] ldlm_cli_cancel_list_local+0xf2/0x290 [ptlrpc]^M
 [<ffffffffa0bfba07>] ldlm_bl_thread_main+0xf7/0x450 [ptlrpc]^M
 [<ffffffff81083ae6>] kthread+0x96/0xa0^M
 [<ffffffff8147f164>] kernel_thread_helper+0x4/0x10^M

These events will cause the MDS IO to stop for a few minutes.



 Comments   
Comment by Peter Jones [ 24/Jul/15 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 27/Jul/15 ]

the log shows that a client hadn't finished a lock cancellation while MDT thought that the client is dead and evicted it.

Comment by Mahmoud Hanafi [ 27/Jul/15 ]

why was this causing a pause on the MDS?
FYI, the client is a 1024 core sharemem host. 100's of ldlm_bl threads are all stuck at osc_page_delete.

Comment by Zhenyu Xu [ 28/Jul/15 ]

Do you have log from the MDS?

Comment by Mahmoud Hanafi [ 28/Jul/15 ]

There isn't much more on the MDS logs other than the "lock callback timer expired" I tried to get a lustre debug dump but wan't able to capture it quickly enough.

Comment by Zhenyu Xu [ 29/Jul/15 ]

I think it could be that client's too busy cancelling locks but still miss the lock callback timeout, can you try http://review.whamcloud.com/#/c/14342/ and http://review.whamcloud.com/#/c/12603 on the client node?

Comment by Jay Lan (Inactive) [ 29/Jul/15 ]

I need a b2_5 port of #12603. I have a conflict in lustre/ldlm/ldlm_request.c.

Comment by Zhenyu Xu [ 30/Jul/15 ]

here http://review.whamcloud.com/15800 is the b2_5 port of #12603

Comment by Jay Lan (Inactive) [ 30/Jul/15 ]

Thank you , Zhenyu.

Comment by Mahmoud Hanafi [ 03/Aug/15 ]

We tried the patch and it didn't help. Running a 312 cpu ior job and canceling the ior run would cause all the ldlm treads to lockup in _raw_spin_unlock_irqrestore. Uploading debug logs

debug.out.withpatch.ofed3.5.2.1438633632.bz2
debug.out.withpatch.mofed.1438631502.bz2
debug.out.withpatch.mofed.secondrun.1438631826.bz2

Comment by Jinshan Xiong (Inactive) [ 04/Aug/15 ]

Those threads were not stuck but simply busy. Also the eviction happened on MDC. How many threads did you notice on this state?

Comment by Mahmoud Hanafi [ 04/Aug/15 ]

I had a 312 CPU job so they or at least that many threads in that state. The threads never finish and cause the client to get evicted.

Comment by Mahmoud Hanafi [ 11/Aug/15 ]

please update this case.

Comment by Jinshan Xiong (Inactive) [ 11/Aug/15 ]

I will work on this issue.

Comment by Jinshan Xiong (Inactive) [ 12/Aug/15 ]

There are 128 ldlm block threads and all of them are busy at discarding pages. There was probably a long queue out there and it would take long waiting time before this blocking AST from MDT gets handled.

Based on the reality that this node has 1024 cores, I will increase the number of ldlm threads and see how it goes. At present there is a kernel module parameter ldlm_num_threads but it has a hard limit as 128. Nobody could predict that there exists such fat node.

If it doesn't help by increasing number dlm threads, I will make blocking callback of inodebits lock to have higher priority.

Comment by Mahmoud Hanafi [ 12/Aug/15 ]

We don't see this issue in 2.4.3 and can reproduce very easily in 2.5.3. So the behavior has changed. Any patches that may have caused this change?

Comment by Gerrit Updater [ 12/Aug/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/15960
Subject: LU-6898 ldlm: increase LDLM_NTHRS_MAX of ldlm threads
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 6ef32d9684b78d04bf67b936c28992a154f14345

Comment by Jinshan Xiong (Inactive) [ 12/Aug/15 ]

I will check it. Meanwhile, if you have hardware resource, you can figure it out by 'git bisect'.

Comment by Tommi Tervo [ 12/Aug/15 ]

We see this same issue on compute nodes with lot of memory (1.5TB, 32 CPU's) (2.6.32-504.12.2.el6.x86_64, MOFED 2.3 and lustre 2.5.3.90)

Comment by Mahmoud Hanafi [ 14/Aug/15 ]

patch: http://review.whamcloud.com/15960 helped some. The client will eventually worked through all the waiting ldlm_bl threads but it takes time. Where in 2.4.3 this was never the case. So still not fit for production on a large cpu node.

bt during the wait.

[0]kdb> btp 31094
Stack traceback for pid 31094
0xffff8ac79e562540    31094        2  1  753   R  0xffff8ac79e562bb0  ldlm_bl_169
 [<ffffffff814760e8>] _raw_spin_unlock_irqrestore+0x8/0x10
 [<ffffffffa0d08817>] osc_page_delete+0xe7/0x360 [osc]
 [<ffffffffa0c2f615>] cl_page_delete0+0xc5/0x4e0 [obdclass]
 [<ffffffffa0c2fa6a>] cl_page_delete+0x3a/0x120 [obdclass]
 [<ffffffffa1056776>] ll_invalidatepage+0x96/0x160 [lustre]
 [<ffffffffa106821d>] vvp_page_discard+0x8d/0x120 [lustre]
 [<ffffffffa0c2ba68>] cl_page_invoid+0x78/0x170 [obdclass]
 [<ffffffffa0c32d0c>] discard_cb+0xbc/0x1e0 [obdclass]
 [<ffffffffa0c305a7>] cl_page_gang_lookup+0x1f7/0x3f0 [obdclass]
 [<ffffffffa0c32b1a>] cl_lock_discard_pages+0xfa/0x1d0 [obdclass]
 [<ffffffffa0d0a0f2>] osc_lock_flush+0xf2/0x260 [osc]
 [<ffffffffa0d0a359>] osc_lock_cancel+0xf9/0x1e0 [osc]
 [<ffffffffa0c30fd5>] cl_lock_cancel0+0x65/0x150 [obdclass]
 [<ffffffffa0c31d4b>] cl_lock_cancel+0x14b/0x150 [obdclass]
 [<ffffffffa0d0ac3d>] osc_lock_blocking+0x5d/0xf0 [osc]
 [<ffffffffa0d0c019>] osc_dlm_blocking_ast0+0xf9/0x210 [osc]
 [<ffffffffa0d0c17c>] osc_ldlm_blocking_ast+0x4c/0x100 [osc]
 [<ffffffffa0ec7670>] ldlm_handle_bl_callback+0xc0/0x420 [ptlrpc]
 [<ffffffffa0ec7bd1>] ldlm_bl_thread_main+0x201/0x450 [ptlrpc]
 [<ffffffff81083ae6>] kthread+0x96/0xa0
[0]more> g
Only 'q' or 'Q' are processed at more prompt, input ignored
 [<ffffffff8147f164>] kernel_thread_helper+0x4/0x10
     r15 = 0xffff8fc79e187db0      r14 = 0xffff88a092987558 
     r13 = 0xffff8fc79e1875e0      r12 = 0xffffffff8147f31e 
      bp = 0xffff88a092987558       bx = 0x0000000100033e68 
     r11 = 0xffffffffa1060bb0      r10 = 0x00000000000001e1 
      r9 = 0xffff88a736d76200       r8 = 0xffffffffa0d36e40 
      ax = 0xffffffffa0d36e58       cx = 0x0000000000000000 
      dx = 0xffffffffa0d36e58       si = 0x0000000000000282 
      di = 0x0000000000000282  orig_ax = 0xffffffffffffff01 
      ip = 0xffffffff814760e8       cs = 0x0000000000000010 

Number of lock threads

[mhanafi@endeavour2:~]$ ps -ef |grep ldlm_bl | wc -l
1295
[mhanafi@endeavour2:~]$ ps -ef |grep ldlm_cb | wc -l
438
Comment by Jinshan Xiong (Inactive) [ 14/Aug/15 ]

Hi Mahmoud,

Do you have a rough idea what type of locks(read, or write) are being canceled? For read lock, clients will have to do more work to check if the pages being destroyed are covered by other lock.

Comment by Jinshan Xiong (Inactive) [ 17/Aug/15 ]

hmm.. this is write lock.

What's the spinlock at 'osc_page_delete+0xe7/0x360'? it looks like this lock is being highly contended.

Comment by Jay Lan (Inactive) [ 19/Aug/15 ]

Which commit or tag of 2.5.3 (ie b2_5) would be a good starting point to establish the 'git bisect good' for this problem as suggested in
https://jira.hpdd.intel.com/browse/LU-6898?focusedCommentId=123923&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-123923

Comment by Jinshan Xiong (Inactive) [ 19/Aug/15 ]

what was the exact version number of 2.4 that was running okay on the node?

Comment by Jay Lan (Inactive) [ 19/Aug/15 ]

Our version of 2.4 is based on 2.4.3 with some extra patches. You can see our git repo at
https://github.com/jlan/lustre-nas/tree/nas-2.4.3

The system in question is running a build at tag 2.4.3-12nasC, which corresponding to this:
8d46eeb - LU-4604 obdclass: handle ldt_device_nr/ldt_linkage properly
which was committed to nas-2.4.3 on Dec 23, 2014.

Comment by Jinshan Xiong (Inactive) [ 20/Aug/15 ]

I didn't find that area of code changed much from tag v2_4_3 to v2_5_3(I checked the extra patches applied to your branch as well).

From the stack trace, it looks like all canceling threads are contended at LRU list lock at osc_lru_del(), which you can verify it by running the reproduction program with 'perf top' for example. But the thing is this area of code haven't changed that much so probably there exists a hidden issue, therefore I would like to hold the effort to create a patch now.

There are roughly around 800 commits between v2_4_3 and v2_5_3. We can probably identify the problematic patch(es) in less than 10 iterations.

Comment by Jay Lan (Inactive) [ 20/Aug/15 ]

b2_5 branch branched out at roughly
"LU-1617 build: ignore automatically generated files,"
which is close to v2_4_0 (LU-2784).

We tested a branch of LU-2784 from b2_5 tree and (YEAH!!!) it did not show the problem. The branch from v2_4_52 did show the problem!

Our dedicated time on our big system ran out. We will continue tomorrow (6pm - 8pm) to narrow it down. If you can see the culprit by examining the code that would be even great!

Comment by Jay Lan (Inactive) [ 21/Aug/15 ]

We have it! It is commit c8fd9c3 "LU-2850 kernel: 3.8 upstream kills daemonize()" that caused the ldlm lock threads taking up 100% cpu time and not going away on large CPU-size systems!

We tested the commit a35113b - "LU-2901 mdt: duplicate link names in directory", the one immediately before c8fd9c3 and it did not show the problem.

It is a new kernel feature from Linux kernel 3.8 that you brought in. Somewhere in your code need to adapt to this new feature.

Comment by Peter Jones [ 21/Aug/15 ]

Nice detective work Jay!

Comment by Jinshan Xiong (Inactive) [ 22/Aug/15 ]

This is a huge patch - I'm looking at it.

Comment by Jinshan Xiong (Inactive) [ 22/Aug/15 ]

Is that because there are signals pending for ldlm lock canceling threads?

Comment by Gerrit Updater [ 24/Aug/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/16063
Subject: LU-6898 ldlm: block signal for dlm block thread
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 3e676c6dd7d279fb846a20fd6750c45f14d617d6

Comment by Jinshan Xiong (Inactive) [ 24/Aug/15 ]

patch 16063 is to verify the idea by blocking signals of ldlm blocking threads. If it can fix the problem, I'm going to check updates of commit c8fd9c3 and decide which threads will need signals to be blocked.

Comment by Jay Lan (Inactive) [ 25/Aug/15 ]

We do not have the system to test until tomorrow.

BTW, I looked at the task_struct of one of the ldlm thread and saw its sigpending struct as
below:

shared_pending = {
list =

{ next = 0xffff8d87bb5ce160, prev = 0xffff8d87bb5ce160 }

,
signal = {
sig =

{18446618212836368752}

}
},

The hex value of 18446618212836368752 is 0xffff8d87bb5ce170
Does this make sense?

Comment by Jay Lan (Inactive) [ 26/Aug/15 ]

Any suggestion of testing if our problem still exhibits with patch 16063 tomorrow?

Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ]

we should check the sigset task_struct::blocked of the thread in question.

Most likely this is the problem. I will take further look if patch 16063 is not working.

Comment by Jay Lan (Inactive) [ 26/Aug/15 ]

Jishan. Patch 16063 did not help.

We have another block time reserved for Friday 4-6pm. Please advise if we should keep that reservation. A block time at Friday afternoon would prohibit jobs that needs more than 2 days to start.

Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ]

Do you still own the test node? It would be helpful to get a coredump and upload to our ftp site so that I can take a further look.

Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ]

Yes, please keep the reservation.

I found a problem with the patch. I am updating it now.

Comment by Jay Lan (Inactive) [ 26/Aug/15 ]

Taking a vmcore of a system of this size takes times. I am looking for your next patch. If that still fails we then will take a kdump.

Also, "please keep the reservation" do you meant the block time now of the block time on Friday?

Comment by Jinshan Xiong (Inactive) [ 26/Aug/15 ]

That was an answer to the question 'Please advise if we should keep that reservation.'

Comment by Jay Lan (Inactive) [ 26/Aug/15 ]

Please take a look at the btall.gz we uploaded on July 27, it was a backtrace of all kernel threads from a machine with 1024 cores and 4TB.

We still have the vmcore of that incident.
Checking the blocked signal in the vmcore, I found:
1) all kernel threads of "ldlm_bl_***" that with stack trace like this do NOT have blocked sigset:

#16 [ffff8fa79de1b9f8] osc_page_delete at ffffffffa0d7a807 [osc]
#17 [ffff8fa79de1ba28] cl_page_delete0 at ffffffffa0ad14d5 [obdclass]
#18 [ffff8fa79de1ba68] cl_page_delete at ffffffffa0ad192a [obdclass]
#19 [ffff8fa79de1ba88] ll_invalidatepage at ffffffffa0ee16a6 [lustre]
#20 [ffff8fa79de1bab8] vvp_page_discard at ffffffffa0ef314d [lustre]
#21 [ffff8fa79de1bad8] cl_page_invoid at ffffffffa0acda58 [obdclass]
#22 [ffff8fa79de1bb18] check_and_discard_cb at ffffffffa0ad5dac [obdclass]
#23 [ffff8fa79de1bb58] cl_page_gang_lookup at ffffffffa0ad2467 [obdclass]
#24 [ffff8fa79de1bbf8] cl_lock_discard_pages at ffffffffa0ad471a [obdclass]
#25 [ffff8fa79de1bc38] osc_lock_flush at ffffffffa0d7c0d2 [osc]
#26 [ffff8fa79de1bc88] osc_lock_cancel at ffffffffa0d7c339 [osc]
#27 [ffff8fa79de1bcc8] cl_lock_cancel0 at ffffffffa0ad2bd5 [obdclass]
#28 [ffff8fa79de1bce8] cl_lock_cancel at ffffffffa0ad394b [obdclass]
#29 [ffff8fa79de1bd08] osc_lock_blocking at ffffffffa0d7cc1d [osc]
#30 [ffff8fa79de1bd38] osc_dlm_blocking_ast0 at ffffffffa0d7dff9 [osc]
#31 [ffff8fa79de1bd78] osc_ldlm_blocking_ast at ffffffffa0d7e15c [osc]
#32 [ffff8fa79de1bdb8] ldlm_cancel_callback at ffffffffa0be4eef [ptlrpc]
#33 [ffff8fa79de1bdc8] ldlm_cli_cancel_local at ffffffffa0bf380f [ptlrpc]
#34 [ffff8fa79de1bde8] ldlm_cli_cancel_list_local at ffffffffa0bf6b82 [ptlrpc]
#35 [ffff8fa79de1be48] ldlm_bl_thread_main at ffffffffa0bfba07 [ptlrpc]
#36 [ffff8fa79de1bee8] kthread at ffffffff81083ae6
#37 [ffff8fa79de1bf48] kernel_thread_helper at ffffffff8147f164

2) all kernel threads of "ldlm_cb*_**" with stack trace like this:
PID: 4677 TASK: ffff8834baff84c0 CPU: 18 COMMAND: "ldlm_cb00_008"
#0 [ffff88366f143cb0] schedule at ffffffff81473d9b
#1 [ffff88366f143df8] ptlrpc_wait_event at ffffffffa0c25445 [ptlrpc]
#2 [ffff88366f143e78] ptlrpc_main at ffffffffa0c2e91b [ptlrpc]
#3 [ffff88366f143ee8] kthread at ffffffff81083ae6
#4 [ffff88366f143f48] kernel_thread_helper at ffffffff8147f164
DOES have a blocked sigset:
blocked = {
sig =

{18446744073709551615}
},

3) all kernel threads of "ptlrpc_hr*_**" with stack trace of:
#0 [ffff88679ecb9ce0] schedule at ffffffff81473d9b
#1 [ffff88679ecb9d88] cfs_cpt_bind at ffffffffa098a0bd [libcfs]
#2 [ffff88679ecb9e28] ptlrpc_hr_main at ffffffffa0c28ad7 [ptlrpc]
#3 [ffff88679ecb9ee8] kthread at ffffffff81083ae6
#4 [ffff88679ecb9f48] kernel_thread_helper at ffffffff8147f164
DOES have a blocked sigset:
blocked = {
sig = {18446744073709551615}

},

3) all kernel threads of "kiblnd_sd__" with stack trace of
PID: 25939 TASK: ffff88e768552040 CPU: 24 COMMAND: "kiblnd_sd_00_00"
#0 [ffff88e768555cb0] schedule at ffffffff81473d9b
#1 [ffff88e768555df8] kiblnd_scheduler at ffffffffa0ba52a0 [ko2iblnd]
#2 [ffff88e768555ee8] kthread at ffffffff81083ae6
#3 [ffff88e768555f48] kernel_thread_helper at ffffffff8147f164
DOES have a blocked sigset:
blocked = {
sig =

{18446744073709551615}
},

4) all kernel threads of "ptlrpcd_????" with stack trace of
PID: 27092 TASK: ffff8f079ef16340 CPU: 1019 COMMAND: "ptlrpcd_1023"
#0 [ffff8f079ef19c30] schedule at ffffffff81473d9b
#1 [ffff8f079ef19d78] schedule_timeout at ffffffff81474550
#2 [ffff8f079ef19e08] ptlrpcd at ffffffffa0c3d7c5 [ptlrpc]
#3 [ffff8f079ef19ee8] kthread at ffffffff81083ae6
#4 [ffff8f079ef19f48] kernel_thread_helper at ffffffff8147f164
DOES have a blocked sigset:
blocked = {
sig = {18446744073709551615}

},

It seems that all lustre kernel threads except #1 category ("ldlm_bl_???") have a blocked sigset. The content of the sigset all the same.

Comment by Jinshan Xiong (Inactive) [ 27/Aug/15 ]

Hi Jay,

I realized that the new patch set 16063 won't fix the problem either. I have no idea how commit c8fd9c3 could cause this problem. I will do furhther investigation and will update this ticket if I find something new.

Comment by Jay Lan (Inactive) [ 27/Aug/15 ]

Hi Jinshan,

If you get the conclusion because of the comment I made a few hour before, please be advised that the blocked sigset data were from the vmcore dated July 27. I do not have a chance to test your new patch 16063 (patch set #2) yet.

Comment by Jinshan Xiong (Inactive) [ 27/Aug/15 ]

In that case, please try the 2nd patch anyway as I'm investigating the problem.

Please take a coredump if the problem still exists so that we can do postmortem analysis.

Comment by Jay Lan (Inactive) [ 29/Aug/15 ]

We reproduced the problem with patch at
http://review.whamcloud.com/#/c/16063/

We took a vmcore. Either Mahmoud or Herbert will
encrypt and ftp to your server. Please note that the vmcore is restricted to US citizen.

The git repo is at
https://github.com/jlan/lustre-nas/tree/nas-2.5.3-LU6898

Comment by Jinshan Xiong (Inactive) [ 30/Aug/15 ]

hmm.. I will take a look at coredump after it's uploaded.

Do you have another chance to run a test again to make sure that the problem does not exist in commit a35113b? I can't figure out why commit c8fd9c3 could cause this problem otherwise I will escalate this ticket.

Comment by Gerrit Updater [ 01/Sep/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/16165
Subject: LU-6898 ldlm: debug patch
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 7bcd94d94c915f6f3670f6355494daaa66662149

Comment by Jinshan Xiong (Inactive) [ 01/Sep/15 ]

I pushed patch 16165 to revert the changes to ldlm and ptlrpc in c8fd9c3. Please check if you can still see the problem by applying this patch.

Comment by James A Simmons [ 01/Sep/15 ]

Patch 16165 can't be the final fix since daemonize is actually gone for new kernels like RHEL7 and SLES12. Jay Lan how are your producing this problem. I like to see if I can duplicate it.

Comment by Jinshan Xiong (Inactive) [ 01/Sep/15 ]

Obviously it isn't - it's a debug patch to identify the problem as the patch title says.

Comment by Jay Lan (Inactive) [ 01/Sep/15 ]

Tommi Tervo also reported the problem at 12/Aug/15 6:21 AM. His system is a 1.5TB,32-cores.

We can not reproduce the problem on small systems, but can easily reproduce on our 2TB,512-cores SGI UV2000 system.

In Mahmoud's reproducer, the lustre fs uses stripe-count=10. His script submits PBS job to run mpiexec of 312 copies of IOR.

Using 'top' to monitor the run. When you see IOR takes up the 'top' page, ctrl-C to kill the job. Then, if you see all ldlm_bl_xxx threads showing up in 'top' pages taking ~100% CPU time, the problem is reproduced. Sometimes it takes a couple of runs to reproduce the problem.

Comment by Mahmoud Hanafi [ 01/Sep/15 ]

The reproducer runs on single stripe file per process IOR. Not stripe count of 10.

Comment by Jay Lan (Inactive) [ 02/Sep/15 ]

Jinshan,

8 files of your patch caused conflicts.

Could create a branch at commit c8fd9c3 and create your patch from there? Commit c8fd9c3 is where the problem starts.

Comment by Jinshan Xiong (Inactive) [ 02/Sep/15 ]

the patch is based on b2_5, please apply it to the top of b2_5.

Is it possible to enable lock stat and collect perf(1) data so that we know if there are contention in the code?

Comment by Jay Lan (Inactive) [ 03/Sep/15 ]

Have you found anything from the crash dump, Jinshan and Oleg?

Comment by Oleg Drokin [ 06/Sep/15 ]

We found that the patch from Jinshan managed to set blocking callback just like it was supposed to, so there now should be no differences prior to the patch you mention.
I also see that all ldlm_bl callbacks are having very uniform backtraces all of them culminating at released spinlock which is kind of strange to be a coincidence considering there are 128 of them, but I have no explanation for it yet.

We are still thinking of the reasons, but meanwhile Jinshan is very interested in b2_5 + the patch he just reverted to be run on your system to see how it helps.

Comment by Jay Lan (Inactive) [ 08/Sep/15 ]

After applying the debug patch to nas-2.5.3, we can not reproduce the problem.

What feature requires commit c8fd9c3? Do we need this patch on sles11sp3?

Comment by Jinshan Xiong (Inactive) [ 08/Sep/15 ]

As far as I know, there is no new feature that requires commit c8fd9c3. It will just make lustre code more in line with kernel code.

I will make further investigation on this issue. Please apply the debug patch in your release while we're investigating this issue.

Comment by Jay Lan (Inactive) [ 08/Sep/15 ]

The ldlm_bl_xxx callbacks were the ones stalled the system. They all ran at ~100% cpu, with commit c8fd9c3.

Comment by Lu [ 06/Jan/16 ]

Hi Jinshan and Mahamoud,
Recently our MDS servers are frequently stucked by some Lustre clients. The symptom is same as Mahmoud Hanafi reported in this track and perhaps https://jira.hpdd.intel.com/browse/LU-4572
It happens in this sequence:
1. one client reports -107 connection error with a MDS server
2. it starts to release locks, a lot of ldlm_resource_dump found in syslog
3. During these period ldlm_bl threads take 100% CPU usage on the client, and mds reports "### lock expired XXX seconds" with this client.
4. Other clients find the MDS stuck.
Since we have 7 Lustre instances, sometimes, one client will has 107 error with more than one MDS server, then this client would stuck more than one Lustre instance.
Our current solution is manually 1) restart the client or 2) remount the MDT device. Do you have any updates or workaround on this problem?

By the way, our Lustre version is 2.5.3 without any modification/patch.

Comment by Lu [ 06/Jan/16 ]

The back trace of ldlm_bl:
<6>ldlm_bl_19 R running task 0 8801 2 0x00000000
<4> ffff8802d24b3b30 ffffffff8129120a 000000000003ffff ffff880201c33a00
<4> ffff8802d24b3b10 ffffffffa0527a12 ffff8802d24b3b30 ffffffffa0868809
<4> 000000000000000e 000000000000000e ffff8802d24b3be0 ffffffffa052a687
<4>Call Trace:
<4> [<ffffffff8129120a>] ? radix_tree_gang_lookup+0x7a/0xf0
<4> [<ffffffffa0527a12>] ? cl_page_vmpage+0x82/0xb0 [obdclass]
<4> [<ffffffffa0868809>] ? weigh_cb+0x19/0x50 [osc]
<4> [<ffffffffa052a687>] cl_page_gang_lookup+0x277/0x3f0 [obdclass]
<4> [<ffffffffa08687f0>] ? weigh_cb+0x0/0x50 [osc]
<4> [<ffffffffa08689b7>] osc_ldlm_weigh_ast+0x177/0x400 [osc]
<4> [<ffffffffa084d05d>] osc_cancel_for_recovery+0x5d/0xa0 [osc]
<4> [<ffffffffa064011b>] ldlm_cancel_lrur_policy+0xeb/0x100 [ptlrpc]
<4> [<ffffffffa0643f2b>] ldlm_prepare_lru_list+0x1cb/0x460 [ptlrpc]
<4> [<ffffffffa0640030>] ? ldlm_cancel_lrur_policy+0x0/0x100 [ptlrpc]
<4> [<ffffffffa0644344>] ldlm_cancel_lru_local+0x24/0x40 [ptlrpc]
<4> [<ffffffffa0644d02>] ldlm_cli_cancel+0x122/0x360 [ptlrpc]
<4> [<ffffffffa0869877>] osc_ldlm_blocking_ast+0x1b7/0x350 [osc]
<4> [<ffffffff8152a61d>] ? wait_for_completion+0x1d/0x20
<4> [<ffffffffa06483c0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
<4> [<ffffffffa06488f1>] ldlm_bl_thread_main+0x261/0x3c0 [ptlrpc]
<4> [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0648690>] ? ldlm_bl_thread_main+0x0/0x3c0 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<6>ldlm_bl_19 R running task 0 8802 2 0x00000000
<4> ffff88027f49bbb0 0000000000000046 000000000003ffff ffff880391a08000
<4> ffff88027f49bb10 ffffffffa0527a12 ffff88027f49bb30 ffffffffa0868809
<4> 000000000000000e 000000000000000e ffff88082aa4f098 ffff88027f49bfd8
<4>Call Trace:
<4> [<ffffffffa0527a12>] ? cl_page_vmpage+0x82/0xb0 [obdclass]
<4> [<ffffffffa0868809>] ? weigh_cb+0x19/0x50 [osc]
<4> [<ffffffffa08687f0>] ? weigh_cb+0x0/0x50 [osc]
<4> [<ffffffff8106c85a>] __cond_resched+0x2a/0x40
<4> [<ffffffff8152a3d0>] _cond_resched+0x30/0x40
<4> [<ffffffffa0868b25>] osc_ldlm_weigh_ast+0x2e5/0x400 [osc]
<4> [<ffffffffa084d05d>] osc_cancel_for_recovery+0x5d/0xa0 [osc]
<4> [<ffffffffa064011b>] ldlm_cancel_lrur_policy+0xeb/0x100 [ptlrpc]
<4> [<ffffffffa0643f2b>] ldlm_prepare_lru_list+0x1cb/0x460 [ptlrpc]
<4> [<ffffffffa0640030>] ? ldlm_cancel_lrur_policy+0x0/0x100 [ptlrpc]
<4> [<ffffffffa0644344>] ldlm_cancel_lru_local+0x24/0x40 [ptlrpc]
<4> [<ffffffffa0644d02>] ldlm_cli_cancel+0x122/0x360 [ptlrpc]
<4> [<ffffffffa0869877>] osc_ldlm_blocking_ast+0x1b7/0x350 [osc]
<4> [<ffffffffa06483c0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
<4> [<ffffffffa06488f1>] ldlm_bl_thread_main+0x261/0x3c0 [ptlrpc]
<4> [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0648690>] ? ldlm_bl_thread_main+0x0/0x3c0 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

The client syslog:
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 11-0: bes3fs-OST0051-osc-ffff881070edf000: Communicating with 192.168.50.165@tcp, operation obd_ping failed with -107.
Jan 5 18:12:49 bwsjnws020 kernel: Lustre: pubfs-MDT0000-mdc-ffff88105ab8e400: Connection to pubfs-MDT0000 (at 192.168.50.67@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jan 5 18:12:49 bwsjnws020 kernel: Lustre: Skipped 3 previous similar messages
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: Skipped 6 previous similar messages
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 167-0: pubfs-MDT0000-mdc-ffff88105ab8e400: This client was evicted by pubfs-MDT0000; in progress operations using this service will fail.
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: Skipped 3 previous similar messages
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 1058:0:(ldlm_resource.c:809:ldlm_resource_complain()) pubfs-MDT0000-mdc-ffff88105ab8e400: namespace resource [0x200007898:0x1:0x0].0 (ffff8806de1651c0) refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 1058:0:(ldlm_resource.c:809:ldlm_resource_complain()) Skipped 7 previous similar messages
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 1058:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x200007898:0x1:0x0].0 (ffff8806de1651c0) refcount = 3
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 1058:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x2000074d1:0x94e:0x0].0 (ffff8801d79be340) refcount = 3
Jan 5 18:12:49 bwsjnws020 kernel: Lustre: pubfs-MDT0000-mdc-ffff88105ab8e400: Connection restored to pubfs-MDT0000 (at 192.168.50.67@tcp)
Jan 5 18:12:49 bwsjnws020 kernel: Lustre: Skipped 3 previous similar messages
Jan 5 18:12:49 bwsjnws020 kernel: LustreError: 1059:0:(ldlm_resource.c:1448:ldlm_resource_dump()) — Resource: [0x1b995:0x0:0x0].0 (ffff880721859700) refcount = 2

The server log:
Jan 5 18:23:27 smds kernel: Lustre: 10627:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:23:27 smds kernel: req@ffff8812ab48ec00 x1518974973017848/t0(0) o101->c82eb4b4-c8b9-07c2-e3fc-87539e7b2adf@192.168.54.29@tcp:0/0 lens 576/3384 e 6 to 0 dl 1451989412 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:23:27 smds kernel: Lustre: 10627:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 12 previous similar messages
Jan 5 18:23:28 smds kernel: Lustre: 10819:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:23:28 smds kernel: req@ffff880650196400 x1513068049228408/t0(0) o101->5c631268-d0fc-8c0d-05f8-c3d7a7faae99@192.168.57.199@tcp:0/0 lens 576/3384 e 6 to 0 dl 1451989413 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:23:28 smds kernel: Lustre: 10819:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
Jan 5 18:23:34 smds kernel: Lustre: 10534:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:23:34 smds kernel: req@ffff88169949fc00 x1509295010536664/t0(0) o101->9dda6a2d-4137-995a-cad9-a28186cc8b73@192.168.57.44@tcp:0/0 lens 576/3384 e 6 to 0 dl 1451989419 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:23:34 smds kernel: Lustre: 10534:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
Jan 5 18:23:47 smds kernel: Lustre: 10819:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:23:47 smds kernel: req@ffff880213478800 x1520526088261116/t0(0) o101->fd06b39c-b81c-d983-43f8-89c8672bb709@202.122.33.204@tcp:0/0 lens 544/3384 e 6 to 0 dl 1451989432 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:24:03 smds kernel: Lustre: 10819:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:24:03 smds kernel: req@ffff880a487c0000 x1520538965607805/t0(0) o101->1a2f04d1-18eb-410e-f207-5222c830edf2@202.122.33.209@tcp:0/0 lens 544/3384 e 4 to 0 dl 1451989448 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:24:51 smds kernel: Lustre: 10819:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jan 5 18:24:51 smds kernel: req@ffff8803751db800 x1520526088262835/t0(0) o101->fd06b39c-b81c-d983-43f8-89c8672bb709@202.122.33.204@tcp:0/0 lens 544/3384 e 2 to 0 dl 1451989496 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:25:03 smds kernel: Lustre: 10554:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:90s); client may timeout. req@ffff8817e6235000 x1509380311299204/t0(0) o101->d0626c3e-a7a7-7421-ed73-e28aa9242043@192.168.61.234@tcp:0/0 lens 576/536 e 6 to 0 dl 1451989413 ref 1 fl Complete:/0/0 rc 0/0
Jan 5 18:25:03 smds kernel: Lustre: 10554:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 12 previous similar messages
Jan 05 18:33:01 smds ihep_heartbeat_log: the heplog system is collecting logs. @Jan 05 18:33:01
Jan 5 18:34:59 smds kernel: Lustre: 10547:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/4), not sending early reply
Jan 5 18:34:59 smds kernel: req@ffff8815354d3000 x1519342037432032/t0(0) o101->740c17eb-17dd-83c0-ad32-6236642ecf32@202.122.33.224@tcp:0/0 lens 576/3384 e 0 to 0 dl 1451990104 ref 2 fl Interpret:/0/0 rc 0/0
Jan 5 18:35:04 smds kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 601s: evicting client at 192.168.85.14@tcp ns: mdt-scrafs-MDT0000_UUID lock: ffff880b43a1e1c0/0x24184c57e1317020 lrc: 3/0,0 mode: PR/PR res: [0xa000001:0xb432f5:0x0].0 bits 0x13 rrc: 560 type: IBT flags: 0x60200000000020 nid: 192.168.85.14@tcp remote: 0x42dffa022b197861 expref: 22 pid: 10741 timeout: 15444565589 lvb_type: 0
Jan 5 18:35:04 smds kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 4 previous similar messages
Jan 5 18:35:04 smds kernel: LustreError: 10673:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff880bbf050800 ns: mdt-scrafs-MDT0000_UUID lock: ffff8807bae9db80/0x24184c57e1317090 lrc: 3/0,0 mode: PR/PR res: [0xa000001:0xb432f5:0x0].0 bits 0x13 rrc: 628 type: IBT flags: 0x50200400000020 nid: 192.168.85.14@tcp remote: 0x42dffa022b197868 expref: 7 pid: 10673 timeout: 0 lvb_type: 0

Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ]

Hi Lu,

Have you noticed that there was heavy contention on some locks when this issue occurred? It looks like the OSC in question was in recovery state therefore it picked some read locks to cancel before recovery.

Comment by Jeremy Filizetti [ 10/Feb/16 ]

Has there been any progress on fixing this issue?

Comment by Mahmoud Hanafi [ 10/Mar/16 ]

We haven't seen this issue. Close will reopen if needed.

Generated at Sat Feb 10 02:04:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.