[LU-3889]  LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) Created: 05/Sep/13  Updated: 16/Jun/14  Resolved: 04/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Patrick Farrell (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 1
Labels: HB, mn4
Environment:

CentOS 6.4 running fairly recent master.


Attachments: File osc_lock_LBUG-lu-3889.log.tar.gz    
Issue Links:
Duplicate
is duplicated by LU-4394 LBUG: (osc_lock.c:497:osc_lock_upcall... Resolved
Related
is related to LU-3027 Failure on test suite parallel-scale ... Resolved
is related to LU-3027 Failure on test suite parallel-scale ... Resolved
is related to LU-3433 Encountered a assertion for the ols_s... Resolved
Severity: 3
Rank (Obsolete): 10138

 Description   

This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4.

This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.)

Two things are needed: A reproducer script, and memory pressure on the system.

The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.]

for idx in $(seq 0 10000); do
    time ls -laR > /dev/null
    touch somefile
    rm -f somefiles
    echo $idx: $(date +%T) $(grep MemFree /proc/meminfo)
done


I used this little tiny piece of C to create the memory pressure. Start the reproducer script above, and then run this as well.

Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments.

#include <stdio.h>
#include <unistd.h>

int main()
{
    int i;
    char* junk;

start: i = 0;

    while(i < 50) { 
        printf("Malloc!\n"); 
        junk = malloc(1024*1024*1024); 
        junk[0] = i; 
        i++; 
    }

    printf("Mallocced 50 GB. Press enter to malloc another 50.\n");
    printf("Note: This seems to use roughly 10 MB of real memory each time.\n");
    getchar();
    goto start;
}

Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts:
As per my understanding of the code, osc_lock_enqueue() function enqueue the
lock and do not wait for network communication.After reply from server we
execute the call back function i.e. osc_lock_upcall() for the lock enqueue
through osc_lock_enqueue().

In this case after successful enqueue and before we get reply from server
(or call to the osc_lock_upcall()), I see in the log that we unused the lock
and hence the LBUG.

I will investigate more and update accordingly.



 Comments   
Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ]

The system where this was replicated is running the patch from LU-3027, ie: http://review.whamcloud.com/#/c/7481/

I may try the other, not accepted, patch for LU-3027 to see if it makes a difference:
http://review.whamcloud.com/#/c/7482/

Comment by Oleg Drokin [ 06/Sep/13 ]

this patch might be worth a try http://review.whamcloud.com/7569 (I hi this and other issue too and plan to try it). Meanwhile we reverted patch 7481 from b2_4 tree pending investigations

Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ]

Oleg,

With patch 7569 in place, I can no longer reproduce this bug as described above. Thank you!

Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ]

Oleg: One more question. Did 7481 cause any regressions/problems you are aware of, or was it reverted only because it didn't resolve the issue?

Comment by Cory Spitz [ 09/Sep/13 ]

Patrick, Yes, #7481 was removed for a reason. The change was removed from b2_4 with the commit description, "This uenxpectedly let do LU-7834 whic appears to be old problem." LU-7834 isn't a valid ticket though.

Comment by Andreas Dilger [ 23/Sep/13 ]

Closing this as a duplicate of LU-3027, since that has a patch that has fixed the problem reported here.

Comment by Sarah Liu [ 01/Oct/13 ]

Reopen this issue since hit the same error on FC18 client of build lustre-master #1687, build #1687 has the fix of LU-3027

https://maloo.whamcloud.com/test_sets/3fe5c386-26f8-11e3-94b1-52540035b04c

mds console

14:29:29:LustreError: 4824:0:(cl_lock.c:1414:cl_unuse_try()) result = -108, this is unlikely!
14:29:29:nfsd: non-standard errno: -108
14:29:29:Lustre: lustre-OST0001-osc-ffff88007bd9dc00: Connection restored to lustre-OST0001 (at 10.10.4.253@tcp)
14:29:29:nfsd: non-standard errno: -108
14:29:29:LustreError: 2767:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 
14:29:29:LustreError: 2767:0:(osc_lock.c:511:osc_lock_upcall()) LBUG
14:29:29:Pid: 2767, comm: ptlrpcd_0
14:29:29:
14:29:29:Call Trace:
14:29:29: [<ffffffffa0478895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
14:29:29: [<ffffffffa0478e97>] lbug_with_loc+0x47/0xb0 [libcfs]
14:29:29: [<ffffffffa0a05e4a>] osc_lock_upcall+0x44a/0x5f0 [osc]
14:29:29: [<ffffffffa0a05a00>] ? osc_lock_upcall+0x0/0x5f0 [osc]
14:29:29: [<ffffffffa09e67a6>] osc_enqueue_fini+0x106/0x240 [osc]
14:29:29: [<ffffffffa09eb1e2>] osc_enqueue_interpret+0xe2/0x1e0 [osc]
14:29:29: [<ffffffffa076bda4>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
14:29:29: [<ffffffffa079801b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
14:29:29: [<ffffffffa079853b>] ptlrpcd+0x20b/0x370 [ptlrpc]
14:29:29: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
14:29:29: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
14:29:29: [<ffffffff81096a36>] kthread+0x96/0xa0
14:29:29: [<ffffffff8100c0ca>] child_rip+0xa/0x20
14:29:30: [<ffffffff810969a0>] ? kthread+0x0/0xa0
14:29:30: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
14:29:30:
14:29:30:Kernel panic - not syncing: LBUG
14:29:30:Pid: 2767, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
14:29:30:Call Trace:
14:29:30: [<ffffffff8150de58>] ? panic+0xa7/0x16f
14:29:31: [<ffffffffa0478eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
14:29:31: [<ffffffffa0a05e4a>] ? osc_lock_upcall+0x44a/0x5f0 [osc]
14:29:31: [<ffffffffa0a05a00>] ? osc_lock_upcall+0x0/0x5f0 [osc]
14:29:31: [<ffffffffa09e67a6>] ? osc_enqueue_fini+0x106/0x240 [osc]
14:29:31: [<ffffffffa09eb1e2>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc]
14:29:31: [<ffffffffa076bda4>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
14:29:31: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
14:29:31: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
14:29:31: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
14:29:31: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
14:29:31: [<ffffffff81096a36>] ? kthread+0x96/0xa0
14:29:31: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
14:29:32: [<ffffffff810969a0>] ? kthread+0x0/0xa0
14:29:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
14:29:32:Initializing cgroup subsys cpuset
14:29:32:Initializing cgroup subsys cpu
Comment by Oleg Drokin [ 07/Oct/13 ]

after some research I believe the cause of this other occurrence is different from the first. Unfortunately we do not have any crashdumps from our test cluster still which is unacceptable in my opinion.
As result we are unable to inspect the lustre debug log and we are not able to see what's the actual value of cll_state is too.

Comment by Patrick Farrell (Inactive) [ 09/Oct/13 ]

Oleg - Do you think the new patch from Jinshan at LU-3027 (http://review.whamcloud.com/#/c/7841/) covers this second case?

Comment by Sarah Liu [ 28/Oct/13 ]

hit this bug in interop test between 2.4.1 server and 2.5 client:

https://maloo.whamcloud.com/test_sets/166bcee0-3eba-11e3-a21b-52540035b04c

server: 2.4.1 RHEL6 ldiskfs
client: lustre-b2_5 build #2 RHEL6 ldiskfs

client 1 console shows:

00:35:41:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-32vm5,client-32vm6.lab.whamcloud.com DURATION=900 == 00:35:31 (1382772931)
00:35:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 
00:35:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 				   /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 
00:37:46:LustreError: 4025:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 
00:37:46:LustreError: 4025:0:(osc_lock.c:511:osc_lock_upcall()) LBUG
00:37:47:Pid: 4025, comm: ptlrpcd_1
00:37:47:
00:37:47:Call Trace:
00:37:47: [<ffffffffa1214895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
00:37:48: [<ffffffffa1214e97>] lbug_with_loc+0x47/0xb0 [libcfs]
00:37:48: [<ffffffffa07b0e5a>] osc_lock_upcall+0x44a/0x5f0 [osc]
00:37:48: [<ffffffffa05d3000>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
00:37:49: [<ffffffffa07b0a10>] ? osc_lock_upcall+0x0/0x5f0 [osc]
00:37:49: [<ffffffffa07917a6>] osc_enqueue_fini+0x106/0x240 [osc]
00:37:50: [<ffffffffa0796202>] osc_enqueue_interpret+0xe2/0x1e0 [osc]
00:37:51: [<ffffffffa05c3e04>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
00:37:51: [<ffffffffa05ef20b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
00:37:51: [<ffffffffa05ef72b>] ptlrpcd+0x20b/0x370 [ptlrpc]
00:37:51: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
00:37:52: [<ffffffffa05ef520>] ? ptlrpcd+0x0/0x370 [ptlrpc]
00:37:52: [<ffffffff81096a36>] kthread+0x96/0xa0
00:37:52: [<ffffffff8100c0ca>] child_rip+0xa/0x20
00:37:53: [<ffffffff810969a0>] ? kthread+0x0/0xa0
00:37:54: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
00:37:54:
00:37:54:Kernel panic - not syncing: LBUG
00:37:54:Pid: 4025, comm: ptlrpcd_1 Not tainted 2.6.32-358.18.1.el6.x86_64 #1
00:37:55:Call Trace:
00:37:55: [<ffffffff8150da18>] ? panic+0xa7/0x16f
00:37:55: [<ffffffffa1214eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
00:37:56: [<ffffffffa07b0e5a>] ? osc_lock_upcall+0x44a/0x5f0 [osc]
00:37:56: [<ffffffffa05d3000>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
00:37:56: [<ffffffffa07b0a10>] ? osc_lock_upcall+0x0/0x5f0 [osc]
00:37:56: [<ffffffffa07917a6>] ? osc_enqueue_fini+0x106/0x240 [osc]
00:37:57: [<ffffffffa0796202>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc]
00:37:57: [<ffffffffa05c3e04>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
00:37:58: [<ffffffffa05ef20b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
00:37:58: [<ffffffffa05ef72b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
00:37:59: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
00:37:59: [<ffffffffa05ef520>] ? ptlrpcd+0x0/0x370 [ptlrpc]
00:37:59: [<ffffffff81096a36>] ? kthread+0x96/0xa0
00:38:00: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
00:38:00: [<ffffffff810969a0>] ? kthread+0x0/0xa0
00:38:01: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
00:38:01:Initializing cgroup subsys cpuset
Comment by Ann Koehler (Inactive) [ 07/Nov/13 ]

Cray is still seeing this LBUG in both 2.4.1 and 2.5 with the patches from LU-3027 applied:

http://review.whamcloud.com/#/c/7569/
http://review.whamcloud.com/#/c/7841/

The rate of occurrence has increased since these patches landed but there have been many other changes to the systems and testing so I can't say the increase is related to the patches.

Dumps are available if you want them.

Comment by Sarah Liu [ 09/Nov/13 ]

Hit this error in racer test
server and client: lustre-master build #1751 RHEL6 ldiskfs

https://maloo.whamcloud.com/test_sets/fca9c502-47e7-11e3-a445-52540035b04c

client console

/racer/racer.sh /mnt/lustre/racer 
15:10:34:LustreError: 4980:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 16966 of inode ffff88006806a6b8 failed -28
15:10:34:LustreError: 4982:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 16966 of inode ffff88006806a6b8 failed -28
15:10:34:LustreError: 16217:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 15 of inode ffff88007d0366b8 failed -28
15:10:35:LustreError: 4631:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88007a4fd638 failed -28
15:10:35:LustreError: 4631:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 1 previous similar message
15:10:35:LustreError: 16904:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1 of inode ffff88007c6fc678 failed -28
15:10:35:LustreError: 18922:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88007d73cb38 failed -28
15:10:36:LustreError: 18922:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 5 previous similar messages
15:10:36:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff880069823b38 failed -28
15:10:37:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 3 previous similar messages
15:10:37:LustreError: 17029:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 10571 of inode ffff880069ec6b38 failed -28
15:10:37:LustreError: 17029:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 5 previous similar messages
15:10:38:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88002f769178 failed -28
15:10:38:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 14 previous similar messages
15:10:38:LustreError: 30383:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 
15:10:40:LustreError: 30383:0:(osc_lock.c:511:osc_lock_upcall()) LBUG
15:10:40:Pid: 30383, comm: ptlrpcd_1
15:10:40:
15:10:40:Call Trace:
15:10:40: [<ffffffffa0834895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
15:10:41: [<ffffffffa0834e97>] lbug_with_loc+0x47/0xb0 [libcfs]
15:10:41: [<ffffffffa04a032a>] osc_lock_upcall+0x44a/0x5f0 [osc]
15:10:41: [<ffffffffa0e242b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
15:10:42: [<ffffffffa049fee0>] ? osc_lock_upcall+0x0/0x5f0 [osc]
15:10:42: [<ffffffffa04807a6>] osc_enqueue_fini+0x106/0x240 [osc]
15:10:43: [<ffffffffa0485282>] osc_enqueue_interpret+0xe2/0x1e0 [osc]
15:10:43: [<ffffffffa0e15034>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
15:10:43: [<ffffffffa0e4053b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
15:10:44: [<ffffffffa0e40a5b>] ptlrpcd+0x20b/0x370 [ptlrpc]
15:10:44: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
15:10:44: [<ffffffffa0e40850>] ? ptlrpcd+0x0/0x370 [ptlrpc]
15:10:44: [<ffffffff81096a36>] kthread+0x96/0xa0
15:10:44: [<ffffffff8100c0ca>] child_rip+0xa/0x20
15:10:44: [<ffffffff810969a0>] ? kthread+0x0/0xa0
15:10:45: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
15:10:45:
15:10:46:Kernel panic - not syncing: LBUG
15:10:46:Pid: 30383, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6.x86_64 #1
15:10:46:Call Trace:
15:10:46: [<ffffffff8150daac>] ? panic+0xa7/0x16f
15:10:46: [<ffffffffa0834eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
15:10:47: [<ffffffffa04a032a>] ? osc_lock_upcall+0x44a/0x5f0 [osc]
15:10:48: [<ffffffffa0e242b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
15:10:48: [<ffffffffa049fee0>] ? osc_lock_upcall+0x0/0x5f0 [osc]
15:10:48: [<ffffffffa04807a6>] ? osc_enqueue_fini+0x106/0x240 [osc]
15:10:49: [<ffffffffa0485282>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc]
15:10:49: [<ffffffffa0e15034>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
15:10:49: [<ffffffffa0e4053b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
15:10:49: [<ffffffffa0e40a5b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
15:10:50: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
15:10:50: [<ffffffffa0e40850>] ? ptlrpcd+0x0/0x370 [ptlrpc]
15:10:50: [<ffffffff81096a36>] ? kthread+0x96/0xa0
15:10:50: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
15:10:50: [<ffffffff810969a0>] ? kthread+0x0/0xa0
15:10:51: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
15:10:52:Initializing cgroup subsys cpuset
Comment by Cliff White (Inactive) [ 09/Nov/13 ]

Hit this error attempting to test ZFS on Hyperion, running IOR single-shared-file

2013-11-08 17:12:17 LustreError: 78127:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed:
2013-11-08 17:12:17 LustreError: 78134:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed:
2013-11-08 17:12:17 LustreError: 78134:0:(osc_lock.c:511:osc_lock_upcall()) LBUG
2013-11-08 17:12:17 Pid: 78134, comm: ptlrpcd_7
2013-11-08 17:12:17
2013-11-08 17:12:17 Call Trace:
2013-11-08 17:12:17  [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-11-08 17:12:17  [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-11-08 17:12:17  [<ffffffffa0dc532a>] osc_lock_upcall+0x44a/0x5f0 [osc]
2013-11-08 17:12:17  [<ffffffffa0ae62b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
2013-11-08 17:12:17  [<ffffffffa0dc4ee0>] ? osc_lock_upcall+0x0/0x5f0 [osc]
2013-11-08 17:12:17  [<ffffffffa0da57a6>] osc_enqueue_fini+0x106/0x240 [osc]
2013-11-08 17:12:17  [<ffffffffa0daa282>] osc_enqueue_interpret+0xe2/0x1e0 [osc]
2013-11-08 17:12:17  [<ffffffffa0ad7034>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
2013-11-08 17:12:17  [<ffffffffa0b0253b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
2013-11-08 17:12:17  [<ffffffffa0b02a5b>] ptlrpcd+0x20b/0x370 [ptlrpc]
2013-11-08 17:12:17  [<ffffffff81063990>] ? default_wake_function+0x0/0x20
2013-11-08 17:12:17  [<ffffffffa0b02850>] ? ptlrpcd+0x0/0x370 [ptlrpc]
2013-11-08 17:12:17  [<ffffffff81096a36>] kthread+0x96/0xa0
2013-11-08 17:12:17  [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-11-08 17:12:17  [<ffffffff810969a0>] ? kthread+0x0/0xa0
2013-11-08 17:12:17  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-11-08 17:12:17
2013-11-08 17:12:17 LustreError: dumping log to /tmp/lustre-log.1383959537.78134
Comment by Cliff White (Inactive) [ 11/Nov/13 ]

I am repeating the IOR test, but calling 'sync' on the clients every 180 seconds. So far have not hit the LBUG. Was quite repeatable previously.

Comment by Nathaniel Clark [ 11/Nov/13 ]

Fix issues with parallel-scale
http://review.whamcloud.com/8234

Comment by Jinshan Xiong (Inactive) [ 12/Nov/13 ]

There is almost no log for this issue. Is it possible to reproduce it with latest build and collect logs? To find a reproducible path is also useful to verify the patch in the future.

Comment by Patrick Farrell (Inactive) [ 12/Nov/13 ]

Jinshan - Our earlier reproducer (found in the ticket description) was fixed by the earlier patches.

Since getting those patches, Cray has only been hitting this unpredictably, during large test runs, and we haven't hit it with more than default debugging enabled. We've tried one of these runs with full debugging and didn't hit it.

Comment by Patrick Farrell (Inactive) [ 13/Nov/13 ]

A bit of good news. I hit this with +dlmtrace and +rpctrace enabled on master from today (with an unrelated patch for LU-4185).

I've uploaded the dump to ftp.whamcloud.com in LU-3889. The dump is called lu-3889-131112-client.tar.gz.

Edit:
I hit this bug by running racer.sh.

Edit again:
I was eventually able to get the logs, it just took a bit over an hour to extract the 1 GB log buffer from the dump. That's probably much larger than is needed, but I'd rather too large than too small.
The logs are now up as well:
lu-3889-131112-client_logs.tar.gz

Comment by Jinshan Xiong (Inactive) [ 13/Nov/13 ]

Hi Patrick,

Thank you, I'll take a look at the log. Can you please tell me what's the parameters you used to run the racer?

Jinshan

Comment by Patrick Farrell (Inactive) [ 13/Nov/13 ]

Jinshan - I ran it without any parameters. I simply edited it to change the time limit so it would run indefinitely and executed it (the racer.sh found in lustre/tests/racer/racer.sh)

I hit the bug after a bit less than two hours with just +dlmtrace and +rpctace. Earlier, I ran for about 12 hours with debug=-1 and didn't hit the bug.

Comment by Jinshan Xiong (Inactive) [ 13/Nov/13 ]

I can reproduce this issue and I'm working on it now.

Comment by Alexander Boyko [ 26/Nov/13 ]

Hi, root cause is bad lock cleanup for situation when process got fatal signal. Here is the fault analyze.
1. lov_lock_enqueue() enqueue a sublocks.
2. osc_enqueue() send the rpc to server.
3. cl_enqueue_locked()->cl_lock_state_wait() finish with -ERESTARTSYS
4. cl_enqueue_locked()->cl_unuse_try() unuse locks, ENQUEUED lock became NEW

00000020:00000001:29.0:1385413027.287708:0:26152:0:(cl_lock.c:1366:cl_unuse_try()) Process entered
00000020:00010000:29.0:1385413027.287709:0:26152:0:(cl_lock.c:150:cl_lock_trace0()) unuse lock: ffff880e6998b078@(3 ffff880e2ad1d0c0 1 2 0 2 1 0)(ffff880e6795ff08/1/0) at cl_unuse_try():1367
00000020:00010000:29.0:1385413027.287721:0:26152:0:(cl_lock.c:150:cl_lock_trace0()) hold release lock: ffff880e6998b078@(4 ffff880e2ad1d0c0 1 0 0 3 0 0)(ffff880e6795ff08/1/0) at cl_lock_hold_release():907

5. Got reply from server
6. osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING)

The main problem is lock state change from ENQUEUED to NEW before reply have come. And this call happened by error handling for cl_enqueue_locked(). cl_unuse_try() in cl_enqueue_locked() was introduced by this patch http://review.whamcloud.com/#/c/2654.

Jinshan, what do you think? What is the best way to fix this?

Comment by Jinshan Xiong (Inactive) [ 26/Nov/13 ]

Hi Boyko,

I think you're right. The root cause of this issue is that a CLS_ENQUEUED lock was unused which confused the code at osc_lock_upcall(). Thanks for looking at it.

I pushed a patch at http://review.whamcloud.com/8405, please give it a try.

Jinshan

Comment by Patrick Farrell (Inactive) [ 26/Nov/13 ]

Jinshan, Alex,

We just tested this patch with racer, and hit the following LBUG:

2013-11-26T17:06:49.095392-06:00 c1-0c0s1n1 LustreError: 23630:0:(cl_lock.c:1114:cl_use_try()) ASSERTION( result != -38 ) failed:
2013-11-26T17:06:49.095417-06:00 c1-0c0s1n1 LustreError: 23630:0:(cl_lock.c:1114:cl_use_try()) LBUG
2013-11-26T17:06:49.095442-06:00 c1-0c0s1n1 Pid: 23630, comm: cat
2013-11-26T17:06:49.120814-06:00 c1-0c0s1n1 Call Trace:
2013-11-26T17:06:49.120830-06:00 c1-0c0s1n1 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
2013-11-26T17:06:49.120842-06:00 c1-0c0s1n1 [<ffffffff81004849>] dump_trace+0x89/0x450
2013-11-26T17:06:49.146152-06:00 c1-0c0s1n1 [<ffffffffa02298d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2013-11-26T17:06:49.146167-06:00 c1-0c0s1n1 [<ffffffffa0229e37>] lbug_with_loc+0x47/0xc0 [libcfs]
2013-11-26T17:06:49.146186-06:00 c1-0c0s1n1 [<ffffffffa037ee8b>] cl_use_try+0x29b/0x2d0 [obdclass]
2013-11-26T17:06:49.171468-06:00 c1-0c0s1n1 [<ffffffffa037f01d>] cl_enqueue_try+0x15d/0x320 [obdclass]
2013-11-26T17:06:49.171484-06:00 c1-0c0s1n1 [<ffffffffa037feaf>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
2013-11-26T17:06:49.196788-06:00 c1-0c0s1n1 [<ffffffffa0380ace>] cl_lock_request+0x7e/0x270 [obdclass]
2013-11-26T17:06:49.196815-06:00 c1-0c0s1n1 [<ffffffffa03860ce>] cl_io_lock+0x39e/0x5d0 [obdclass]
2013-11-26T17:06:49.196832-06:00 c1-0c0s1n1 [<ffffffffa03863a2>] cl_io_loop+0xa2/0x1b0 [obdclass]
2013-11-26T17:06:49.222102-06:00 c1-0c0s1n1 [<ffffffffa0811d8f>] ll_file_io_generic+0x3bf/0x5f0 [lustre]
2013-11-26T17:06:49.222126-06:00 c1-0c0s1n1 [<ffffffffa081249b>] ll_file_aio_read+0x23b/0x290 [lustre]
2013-11-26T17:06:49.247420-06:00 c1-0c0s1n1 [<ffffffffa08130aa>] ll_file_read+0x1fa/0x290 [lustre]
2013-11-26T17:06:49.247451-06:00 c1-0c0s1n1 [<ffffffff81143868>] vfs_read+0xc8/0x180
2013-11-26T17:06:49.247481-06:00 c1-0c0s1n1 [<ffffffff81143a25>] sys_read+0x55/0x90
2013-11-26T17:06:49.247497-06:00 c1-0c0s1n1 [<ffffffff8138d7ab>] system_call_fastpath+0x16/0x1b
2013-11-26T17:06:49.247509-06:00 c1-0c0s1n1 [<00002aaaaad98190>] 0x2aaaaad98190
2013-11-26T17:06:49.386314-06:00 c1-0c0s1n1 Kernel panic - not syncing: LBUG
2013-11-26T17:06:49.437010-06:00 c1-0c0s1n1 Pid: 23630, comm: cat Tainted: P 3.0.80-0.5.1_1.0501.7664-cray_ari_c #1
2013-11-26T17:06:49.437035-06:00 c1-0c0s1n1 Call Trace:
2013-11-26T17:06:49.437055-06:00 c1-0c0s1n1 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
2013-11-26T17:06:49.462343-06:00 c1-0c0s1n1 [<ffffffff81004849>] dump_trace+0x89/0x450
2013-11-26T17:06:49.462372-06:00 c1-0c0s1n1 [<ffffffff8100581c>] show_trace_log_lvl+0x5c/0x80
2013-11-26T17:06:49.462388-06:00 c1-0c0s1n1 [<ffffffff81005855>] show_trace+0x15/0x20
2013-11-26T17:06:49.462408-06:00 c1-0c0s1n1 [<ffffffff813829db>] dump_stack+0x79/0x84
2013-11-26T17:06:49.462419-06:00 c1-0c0s1n1 [<ffffffff81382a7a>] panic+0x94/0x1d2
2013-11-26T17:06:49.487630-06:00 c1-0c0s1n1 [<ffffffffa0229e9b>] lbug_with_loc+0xab/0xc0 [libcfs]
2013-11-26T17:06:49.487654-06:00 c1-0c0s1n1 [<ffffffffa037ee8b>] cl_use_try+0x29b/0x2d0 [obdclass]
2013-11-26T17:06:49.513026-06:00 c1-0c0s1n1 [<ffffffffa037f01d>] cl_enqueue_try+0x15d/0x320 [obdclass]
2013-11-26T17:06:49.513051-06:00 c1-0c0s1n1 [<ffffffffa037feaf>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
2013-11-26T17:06:49.513067-06:00 c1-0c0s1n1 [<ffffffffa0380ace>] cl_lock_request+0x7e/0x270 [obdclass]
2013-11-26T17:06:49.538364-06:00 c1-0c0s1n1 [<ffffffffa03860ce>] cl_io_lock+0x39e/0x5d0 [obdclass]
2013-11-26T17:06:49.538389-06:00 c1-0c0s1n1 [<ffffffffa03863a2>] cl_io_loop+0xa2/0x1b0 [obdclass]
2013-11-26T17:06:49.538410-06:00 c1-0c0s1n1 [<ffffffffa0811d8f>] ll_file_io_generic+0x3bf/0x5f0 [lustre]
2013-11-26T17:06:49.563613-06:00 c1-0c0s1n1 [<ffffffffa081249b>] ll_file_aio_read+0x23b/0x290 [lustre]
2013-11-26T17:06:49.563637-06:00 c1-0c0s1n1 [<ffffffffa08130aa>] ll_file_read+0x1fa/0x290 [lustre]
2013-11-26T17:06:49.588991-06:00 c1-0c0s1n1 [<ffffffff81143868>] vfs_read+0xc8/0x180
2013-11-26T17:06:49.589016-06:00 c1-0c0s1n1 [<ffffffff81143a25>] sys_read+0x55/0x90
2013-11-26T17:06:49.589034-06:00 c1-0c0s1n1 [<ffffffff8138d7ab>] system_call_fastpath+0x16/0x1b
2013-11-26T17:06:49.589046-06:00 c1-0c0s1n1 [<00002aaaaad98190>] 0x2aaaaad9818f


We're trying to get debug logs from that attempt now.

One note of caution: Due to a network problem on the system, we're getting a lot of EBUSYs as well. The LBUG seems more likely to be related to the patch, but I just wanted you to know.

Comment by Jinshan Xiong (Inactive) [ 26/Nov/13 ]

did you apply other patches?

Comment by Patrick Farrell (Inactive) [ 26/Nov/13 ]

This is a Cray version of 2.5 - At this point, it's extremely close to the recent 2.5 release, but with this patch and that from LU-4152, which should be unrelated.

We're still trying to get debug logs, I'll let you know if we can get something there.

Update: We've had to table our attempt to get debug logs for tonight. The soonest we'll have more is tomorrow, sorry.
We have a dump if you'd like to take a look, but the debug was set to default levels, so I don't know if it will help much.

Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ]

From the message, it seems like ->clo_use is not defined which is impossible. Can you please print out the lock information with cl_lock_print()?

Can you also provide me a test program you're running to reproduce this issue?

Comment by Alexey Lyashkov [ 27/Nov/13 ]

Jay,

reproducer likely to be racer.

Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ]

Jinshan,

Alex is right, we're using racer. Our first priority this morning is starting testing on a clean system to see what happens, once that's going I'll look at printing out the lock info.

Here's a more detailed description of how we're running racer:
Here's how we ran racer, with some of the environment variables replaced with command line options:
racer.sh -t 3600 -T 7 -f 20 -l -d $TMPDIR/race.3
^-- t is time, T is threads, f is files, -d is directory, and -l is an option we added telling it it's on Lustre and to run the Lustre specific tests.

Comment by Alexey Lyashkov [ 27/Nov/13 ]

Patrik,

may you try a test patch without changes in osc ? Alex Boyko worked on test to verify all races is solved with patch so we expect to have a verification in next few days.

Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ]

Alex - Sure. Same patch, just only the LOV changes?

The the original patch as proposed by Jinshan, our testing on the system without the network issues is going well.

I left out one line of log from the earlier problem:
2013-11-26T17:06:48.862188-06:00 c1-0c0s1n1 LustreError: 23427:0:(lcommon_cl.c:1209:cl_file_inode_init()) Failure to initialize cl object [0x20a750c50:0xac35:0x0]: -16
We were getting these repeatedly, due to the network problem on our system.

Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ]

let me try to reproduce it.

Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ]

Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all.

We're going to try without the OSC changes next.

Comment by Alexey Lyashkov [ 27/Nov/13 ]

Jay,

can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that?

Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ]

Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all.

We're going to try without the OSC changes next.

Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion?

Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ]

Jinshan,

We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356

This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems.

We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything.

Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ]

We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen.

We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead.

Comment by Jinshan Xiong (Inactive) [ 01/Dec/13 ]

Hi Patrick, it's fine to take out of OSC changes.

Comment by Patrick Farrell (Inactive) [ 02/Dec/13 ]

Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them.

Comment by Alexander Boyko [ 03/Dec/13 ]

I have added the regression test for this issue http://review.whamcloud.com/8463. I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall().

Comment by Jinshan Xiong (Inactive) [ 11/Dec/13 ]

Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work.

Jinshan

Comment by Jian Yu [ 19/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
MDSCOUNT=4

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/06c7507e-6875-11e3-a9a3-52540035b04c

parallel-scale-nfsv4 test iorssf hit the same failure:
https://maloo.whamcloud.com/test_sets/d1809a06-6879-11e3-a9a3-52540035b04c

Comment by Jian Yu [ 20/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=1

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/d15a7052-68ff-11e3-ab68-52540035b04c
https://maloo.whamcloud.com/test_sets/1cb24ab0-691f-11e3-8dc5-52540035b04c

Comment by Jian Yu [ 02/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=1

racer test hit the same failure:
https://maloo.whamcloud.com/test_sets/c3410662-7362-11e3-8412-52540035b04c

Comment by Bob Glossman (Inactive) [ 03/Jan/14 ]

in b2_5: http://review.whamcloud.com/8717

Comment by Peter Jones [ 04/Jan/14 ]

Landed for 2.5.1 and 2.6

Comment by James A Simmons [ 16/Jan/14 ]

Also hit this bug for 2.4

Comment by Niu Yawei (Inactive) [ 10/Feb/14 ]

for b2_4: http://review.whamcloud.com/9194

Comment by Nathaniel Clark [ 16/Jun/14 ]

Patch 8234 never made it into b2_5 while 7778 with a typo did:

http://review.whamcloud.com/10731

Generated at Sat Feb 10 01:37:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.