[LU-3889] LBUG: (osc_lock.c:497:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) Created: 05/Sep/13 Updated: 16/Jun/14 Resolved: 04/Jan/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.6.0, Lustre 2.4.2 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | HB, mn4 | ||
| Environment: |
CentOS 6.4 running fairly recent master. |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 10138 | ||||||||||||||||||||||||
| Description |
|
This assertion is hit on a CentOS client system running master. It's also been noticed on Cray SLES clients running 2.4. This is fairly easy to reproduce on CentOS. I'll be attaching a log of this with debug=-1 set. (I was also running a special debug patch for this bug called rdebug, you may see some extra output from that.) Two things are needed: A reproducer script, and memory pressure on the system. The reproducer is the following shell script - This was originally a test for a different bug, so I'm not sure if every step is needed - run in a folder with at least a few thousand files in it: [It may work with a smaller number of files; I'm just describing how I've reproduced it.] for idx in $(seq 0 10000); do time ls -laR > /dev/null touch somefile rm -f somefiles echo $idx: $(date +%T) $(grep MemFree /proc/meminfo) done — Simply hold down enter and watch the test script output as free memory drops - Once you're down to a small amount free, the total amount of free memory will stop dropping. Then simply keep holding down enter to continue memory pressure, and the bug will happen after a few moments. #include <stdio.h> #include <unistd.h> int main() { int i; char* junk; start: i = 0; while(i < 50) { printf("Malloc!\n"); junk = malloc(1024*1024*1024); junk[0] = i; i++; } printf("Mallocced 50 GB. Press enter to malloc another 50.\n"); printf("Note: This seems to use roughly 10 MB of real memory each time.\n"); getchar(); goto start; } Rahul Deshmukh of Xyratex is looking at this with us, and these are his initial thoughts: In this case after successful enqueue and before we get reply from server I will investigate more and update accordingly. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ] |
|
The system where this was replicated is running the patch from I may try the other, not accepted, patch for |
| Comment by Oleg Drokin [ 06/Sep/13 ] |
|
this patch might be worth a try http://review.whamcloud.com/7569 (I hi this and other issue too and plan to try it). Meanwhile we reverted patch 7481 from b2_4 tree pending investigations |
| Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ] |
|
Oleg, With patch 7569 in place, I can no longer reproduce this bug as described above. Thank you! |
| Comment by Patrick Farrell (Inactive) [ 06/Sep/13 ] |
|
Oleg: One more question. Did 7481 cause any regressions/problems you are aware of, or was it reverted only because it didn't resolve the issue? |
| Comment by Cory Spitz [ 09/Sep/13 ] |
|
Patrick, Yes, #7481 was removed for a reason. The change was removed from b2_4 with the commit description, "This uenxpectedly let do LU-7834 whic appears to be old problem." LU-7834 isn't a valid ticket though. |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
Closing this as a duplicate of |
| Comment by Sarah Liu [ 01/Oct/13 ] |
|
Reopen this issue since hit the same error on FC18 client of build lustre-master #1687, build #1687 has the fix of https://maloo.whamcloud.com/test_sets/3fe5c386-26f8-11e3-94b1-52540035b04c mds console 14:29:29:LustreError: 4824:0:(cl_lock.c:1414:cl_unuse_try()) result = -108, this is unlikely! 14:29:29:nfsd: non-standard errno: -108 14:29:29:Lustre: lustre-OST0001-osc-ffff88007bd9dc00: Connection restored to lustre-OST0001 (at 10.10.4.253@tcp) 14:29:29:nfsd: non-standard errno: -108 14:29:29:LustreError: 2767:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 14:29:29:LustreError: 2767:0:(osc_lock.c:511:osc_lock_upcall()) LBUG 14:29:29:Pid: 2767, comm: ptlrpcd_0 14:29:29: 14:29:29:Call Trace: 14:29:29: [<ffffffffa0478895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 14:29:29: [<ffffffffa0478e97>] lbug_with_loc+0x47/0xb0 [libcfs] 14:29:29: [<ffffffffa0a05e4a>] osc_lock_upcall+0x44a/0x5f0 [osc] 14:29:29: [<ffffffffa0a05a00>] ? osc_lock_upcall+0x0/0x5f0 [osc] 14:29:29: [<ffffffffa09e67a6>] osc_enqueue_fini+0x106/0x240 [osc] 14:29:29: [<ffffffffa09eb1e2>] osc_enqueue_interpret+0xe2/0x1e0 [osc] 14:29:29: [<ffffffffa076bda4>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 14:29:29: [<ffffffffa079801b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 14:29:29: [<ffffffffa079853b>] ptlrpcd+0x20b/0x370 [ptlrpc] 14:29:29: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 14:29:29: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc] 14:29:29: [<ffffffff81096a36>] kthread+0x96/0xa0 14:29:29: [<ffffffff8100c0ca>] child_rip+0xa/0x20 14:29:30: [<ffffffff810969a0>] ? kthread+0x0/0xa0 14:29:30: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 14:29:30: 14:29:30:Kernel panic - not syncing: LBUG 14:29:30:Pid: 2767, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 14:29:30:Call Trace: 14:29:30: [<ffffffff8150de58>] ? panic+0xa7/0x16f 14:29:31: [<ffffffffa0478eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 14:29:31: [<ffffffffa0a05e4a>] ? osc_lock_upcall+0x44a/0x5f0 [osc] 14:29:31: [<ffffffffa0a05a00>] ? osc_lock_upcall+0x0/0x5f0 [osc] 14:29:31: [<ffffffffa09e67a6>] ? osc_enqueue_fini+0x106/0x240 [osc] 14:29:31: [<ffffffffa09eb1e2>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc] 14:29:31: [<ffffffffa076bda4>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 14:29:31: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 14:29:31: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] 14:29:31: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 14:29:31: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc] 14:29:31: [<ffffffff81096a36>] ? kthread+0x96/0xa0 14:29:31: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 14:29:32: [<ffffffff810969a0>] ? kthread+0x0/0xa0 14:29:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 14:29:32:Initializing cgroup subsys cpuset 14:29:32:Initializing cgroup subsys cpu |
| Comment by Oleg Drokin [ 07/Oct/13 ] |
|
after some research I believe the cause of this other occurrence is different from the first. Unfortunately we do not have any crashdumps from our test cluster still which is unacceptable in my opinion. |
| Comment by Patrick Farrell (Inactive) [ 09/Oct/13 ] |
|
Oleg - Do you think the new patch from Jinshan at |
| Comment by Sarah Liu [ 28/Oct/13 ] |
|
hit this bug in interop test between 2.4.1 server and 2.5 client: https://maloo.whamcloud.com/test_sets/166bcee0-3eba-11e3-a21b-52540035b04c server: 2.4.1 RHEL6 ldiskfs client 1 console shows: 00:35:41:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-32vm5,client-32vm6.lab.whamcloud.com DURATION=900 == 00:35:31 (1382772931) 00:35:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer 00:35:41:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=1 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer 00:37:46:LustreError: 4025:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 00:37:46:LustreError: 4025:0:(osc_lock.c:511:osc_lock_upcall()) LBUG 00:37:47:Pid: 4025, comm: ptlrpcd_1 00:37:47: 00:37:47:Call Trace: 00:37:47: [<ffffffffa1214895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 00:37:48: [<ffffffffa1214e97>] lbug_with_loc+0x47/0xb0 [libcfs] 00:37:48: [<ffffffffa07b0e5a>] osc_lock_upcall+0x44a/0x5f0 [osc] 00:37:48: [<ffffffffa05d3000>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 00:37:49: [<ffffffffa07b0a10>] ? osc_lock_upcall+0x0/0x5f0 [osc] 00:37:49: [<ffffffffa07917a6>] osc_enqueue_fini+0x106/0x240 [osc] 00:37:50: [<ffffffffa0796202>] osc_enqueue_interpret+0xe2/0x1e0 [osc] 00:37:51: [<ffffffffa05c3e04>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 00:37:51: [<ffffffffa05ef20b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 00:37:51: [<ffffffffa05ef72b>] ptlrpcd+0x20b/0x370 [ptlrpc] 00:37:51: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 00:37:52: [<ffffffffa05ef520>] ? ptlrpcd+0x0/0x370 [ptlrpc] 00:37:52: [<ffffffff81096a36>] kthread+0x96/0xa0 00:37:52: [<ffffffff8100c0ca>] child_rip+0xa/0x20 00:37:53: [<ffffffff810969a0>] ? kthread+0x0/0xa0 00:37:54: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 00:37:54: 00:37:54:Kernel panic - not syncing: LBUG 00:37:54:Pid: 4025, comm: ptlrpcd_1 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 00:37:55:Call Trace: 00:37:55: [<ffffffff8150da18>] ? panic+0xa7/0x16f 00:37:55: [<ffffffffa1214eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 00:37:56: [<ffffffffa07b0e5a>] ? osc_lock_upcall+0x44a/0x5f0 [osc] 00:37:56: [<ffffffffa05d3000>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 00:37:56: [<ffffffffa07b0a10>] ? osc_lock_upcall+0x0/0x5f0 [osc] 00:37:56: [<ffffffffa07917a6>] ? osc_enqueue_fini+0x106/0x240 [osc] 00:37:57: [<ffffffffa0796202>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc] 00:37:57: [<ffffffffa05c3e04>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 00:37:58: [<ffffffffa05ef20b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 00:37:58: [<ffffffffa05ef72b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] 00:37:59: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 00:37:59: [<ffffffffa05ef520>] ? ptlrpcd+0x0/0x370 [ptlrpc] 00:37:59: [<ffffffff81096a36>] ? kthread+0x96/0xa0 00:38:00: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 00:38:00: [<ffffffff810969a0>] ? kthread+0x0/0xa0 00:38:01: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 00:38:01:Initializing cgroup subsys cpuset |
| Comment by Ann Koehler (Inactive) [ 07/Nov/13 ] |
|
Cray is still seeing this LBUG in both 2.4.1 and 2.5 with the patches from http://review.whamcloud.com/#/c/7569/ The rate of occurrence has increased since these patches landed but there have been many other changes to the systems and testing so I can't say the increase is related to the patches. Dumps are available if you want them. |
| Comment by Sarah Liu [ 09/Nov/13 ] |
|
Hit this error in racer test https://maloo.whamcloud.com/test_sets/fca9c502-47e7-11e3-a445-52540035b04c client console /racer/racer.sh /mnt/lustre/racer 15:10:34:LustreError: 4980:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 16966 of inode ffff88006806a6b8 failed -28 15:10:34:LustreError: 4982:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 16966 of inode ffff88006806a6b8 failed -28 15:10:34:LustreError: 16217:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 15 of inode ffff88007d0366b8 failed -28 15:10:35:LustreError: 4631:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88007a4fd638 failed -28 15:10:35:LustreError: 4631:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 1 previous similar message 15:10:35:LustreError: 16904:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 1 of inode ffff88007c6fc678 failed -28 15:10:35:LustreError: 18922:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88007d73cb38 failed -28 15:10:36:LustreError: 18922:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 5 previous similar messages 15:10:36:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff880069823b38 failed -28 15:10:37:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 3 previous similar messages 15:10:37:LustreError: 17029:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 10571 of inode ffff880069ec6b38 failed -28 15:10:37:LustreError: 17029:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 5 previous similar messages 15:10:38:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Write page 0 of inode ffff88002f769178 failed -28 15:10:38:LustreError: 4613:0:(vvp_io.c:1079:vvp_io_commit_write()) Skipped 14 previous similar messages 15:10:38:LustreError: 30383:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 15:10:40:LustreError: 30383:0:(osc_lock.c:511:osc_lock_upcall()) LBUG 15:10:40:Pid: 30383, comm: ptlrpcd_1 15:10:40: 15:10:40:Call Trace: 15:10:40: [<ffffffffa0834895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 15:10:41: [<ffffffffa0834e97>] lbug_with_loc+0x47/0xb0 [libcfs] 15:10:41: [<ffffffffa04a032a>] osc_lock_upcall+0x44a/0x5f0 [osc] 15:10:41: [<ffffffffa0e242b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 15:10:42: [<ffffffffa049fee0>] ? osc_lock_upcall+0x0/0x5f0 [osc] 15:10:42: [<ffffffffa04807a6>] osc_enqueue_fini+0x106/0x240 [osc] 15:10:43: [<ffffffffa0485282>] osc_enqueue_interpret+0xe2/0x1e0 [osc] 15:10:43: [<ffffffffa0e15034>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 15:10:43: [<ffffffffa0e4053b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 15:10:44: [<ffffffffa0e40a5b>] ptlrpcd+0x20b/0x370 [ptlrpc] 15:10:44: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 15:10:44: [<ffffffffa0e40850>] ? ptlrpcd+0x0/0x370 [ptlrpc] 15:10:44: [<ffffffff81096a36>] kthread+0x96/0xa0 15:10:44: [<ffffffff8100c0ca>] child_rip+0xa/0x20 15:10:44: [<ffffffff810969a0>] ? kthread+0x0/0xa0 15:10:45: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 15:10:45: 15:10:46:Kernel panic - not syncing: LBUG 15:10:46:Pid: 30383, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6.x86_64 #1 15:10:46:Call Trace: 15:10:46: [<ffffffff8150daac>] ? panic+0xa7/0x16f 15:10:46: [<ffffffffa0834eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 15:10:47: [<ffffffffa04a032a>] ? osc_lock_upcall+0x44a/0x5f0 [osc] 15:10:48: [<ffffffffa0e242b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 15:10:48: [<ffffffffa049fee0>] ? osc_lock_upcall+0x0/0x5f0 [osc] 15:10:48: [<ffffffffa04807a6>] ? osc_enqueue_fini+0x106/0x240 [osc] 15:10:49: [<ffffffffa0485282>] ? osc_enqueue_interpret+0xe2/0x1e0 [osc] 15:10:49: [<ffffffffa0e15034>] ? ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 15:10:49: [<ffffffffa0e4053b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 15:10:49: [<ffffffffa0e40a5b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] 15:10:50: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 15:10:50: [<ffffffffa0e40850>] ? ptlrpcd+0x0/0x370 [ptlrpc] 15:10:50: [<ffffffff81096a36>] ? kthread+0x96/0xa0 15:10:50: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 15:10:50: [<ffffffff810969a0>] ? kthread+0x0/0xa0 15:10:51: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 15:10:52:Initializing cgroup subsys cpuset |
| Comment by Cliff White (Inactive) [ 09/Nov/13 ] |
|
Hit this error attempting to test ZFS on Hyperion, running IOR single-shared-file 2013-11-08 17:12:17 LustreError: 78127:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 2013-11-08 17:12:17 LustreError: 78134:0:(osc_lock.c:511:osc_lock_upcall()) ASSERTION( lock->cll_state >= CLS_QUEUING ) failed: 2013-11-08 17:12:17 LustreError: 78134:0:(osc_lock.c:511:osc_lock_upcall()) LBUG 2013-11-08 17:12:17 Pid: 78134, comm: ptlrpcd_7 2013-11-08 17:12:17 2013-11-08 17:12:17 Call Trace: 2013-11-08 17:12:17 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-11-08 17:12:17 [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-11-08 17:12:17 [<ffffffffa0dc532a>] osc_lock_upcall+0x44a/0x5f0 [osc] 2013-11-08 17:12:17 [<ffffffffa0ae62b0>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc] 2013-11-08 17:12:17 [<ffffffffa0dc4ee0>] ? osc_lock_upcall+0x0/0x5f0 [osc] 2013-11-08 17:12:17 [<ffffffffa0da57a6>] osc_enqueue_fini+0x106/0x240 [osc] 2013-11-08 17:12:17 [<ffffffffa0daa282>] osc_enqueue_interpret+0xe2/0x1e0 [osc] 2013-11-08 17:12:17 [<ffffffffa0ad7034>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc] 2013-11-08 17:12:17 [<ffffffffa0b0253b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 2013-11-08 17:12:17 [<ffffffffa0b02a5b>] ptlrpcd+0x20b/0x370 [ptlrpc] 2013-11-08 17:12:17 [<ffffffff81063990>] ? default_wake_function+0x0/0x20 2013-11-08 17:12:17 [<ffffffffa0b02850>] ? ptlrpcd+0x0/0x370 [ptlrpc] 2013-11-08 17:12:17 [<ffffffff81096a36>] kthread+0x96/0xa0 2013-11-08 17:12:17 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-11-08 17:12:17 [<ffffffff810969a0>] ? kthread+0x0/0xa0 2013-11-08 17:12:17 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-11-08 17:12:17 2013-11-08 17:12:17 LustreError: dumping log to /tmp/lustre-log.1383959537.78134 |
| Comment by Cliff White (Inactive) [ 11/Nov/13 ] |
|
I am repeating the IOR test, but calling 'sync' on the clients every 180 seconds. So far have not hit the LBUG. Was quite repeatable previously. |
| Comment by Nathaniel Clark [ 11/Nov/13 ] |
|
Fix issues with parallel-scale |
| Comment by Jinshan Xiong (Inactive) [ 12/Nov/13 ] |
|
There is almost no log for this issue. Is it possible to reproduce it with latest build and collect logs? To find a reproducible path is also useful to verify the patch in the future. |
| Comment by Patrick Farrell (Inactive) [ 12/Nov/13 ] |
|
Jinshan - Our earlier reproducer (found in the ticket description) was fixed by the earlier patches. Since getting those patches, Cray has only been hitting this unpredictably, during large test runs, and we haven't hit it with more than default debugging enabled. We've tried one of these runs with full debugging and didn't hit it. |
| Comment by Patrick Farrell (Inactive) [ 13/Nov/13 ] |
|
A bit of good news. I hit this with +dlmtrace and +rpctrace enabled on master from today (with an unrelated patch for I've uploaded the dump to ftp.whamcloud.com in Edit: Edit again: |
| Comment by Jinshan Xiong (Inactive) [ 13/Nov/13 ] |
|
Hi Patrick, Thank you, I'll take a look at the log. Can you please tell me what's the parameters you used to run the racer? Jinshan |
| Comment by Patrick Farrell (Inactive) [ 13/Nov/13 ] |
|
Jinshan - I ran it without any parameters. I simply edited it to change the time limit so it would run indefinitely and executed it (the racer.sh found in lustre/tests/racer/racer.sh) I hit the bug after a bit less than two hours with just +dlmtrace and +rpctace. Earlier, I ran for about 12 hours with debug=-1 and didn't hit the bug. |
| Comment by Jinshan Xiong (Inactive) [ 13/Nov/13 ] |
|
I can reproduce this issue and I'm working on it now. |
| Comment by Alexander Boyko [ 26/Nov/13 ] |
|
Hi, root cause is bad lock cleanup for situation when process got fatal signal. Here is the fault analyze. 00000020:00000001:29.0:1385413027.287708:0:26152:0:(cl_lock.c:1366:cl_unuse_try()) Process entered 00000020:00010000:29.0:1385413027.287709:0:26152:0:(cl_lock.c:150:cl_lock_trace0()) unuse lock: ffff880e6998b078@(3 ffff880e2ad1d0c0 1 2 0 2 1 0)(ffff880e6795ff08/1/0) at cl_unuse_try():1367 00000020:00010000:29.0:1385413027.287721:0:26152:0:(cl_lock.c:150:cl_lock_trace0()) hold release lock: ffff880e6998b078@(4 ffff880e2ad1d0c0 1 0 0 3 0 0)(ffff880e6795ff08/1/0) at cl_lock_hold_release():907 5. Got reply from server The main problem is lock state change from ENQUEUED to NEW before reply have come. And this call happened by error handling for cl_enqueue_locked(). cl_unuse_try() in cl_enqueue_locked() was introduced by this patch http://review.whamcloud.com/#/c/2654. Jinshan, what do you think? What is the best way to fix this? |
| Comment by Jinshan Xiong (Inactive) [ 26/Nov/13 ] |
|
Hi Boyko, I think you're right. The root cause of this issue is that a CLS_ENQUEUED lock was unused which confused the code at osc_lock_upcall(). Thanks for looking at it. I pushed a patch at http://review.whamcloud.com/8405, please give it a try. Jinshan |
| Comment by Patrick Farrell (Inactive) [ 26/Nov/13 ] |
|
Jinshan, Alex, We just tested this patch with racer, and hit the following LBUG: 2013-11-26T17:06:49.095392-06:00 c1-0c0s1n1 LustreError: 23630:0:(cl_lock.c:1114:cl_use_try()) ASSERTION( result != -38 ) failed: We're trying to get debug logs from that attempt now. One note of caution: Due to a network problem on the system, we're getting a lot of EBUSYs as well. The LBUG seems more likely to be related to the patch, but I just wanted you to know. |
| Comment by Jinshan Xiong (Inactive) [ 26/Nov/13 ] |
|
did you apply other patches? |
| Comment by Patrick Farrell (Inactive) [ 26/Nov/13 ] |
|
This is a Cray version of 2.5 - At this point, it's extremely close to the recent 2.5 release, but with this patch and that from We're still trying to get debug logs, I'll let you know if we can get something there. Update: We've had to table our attempt to get debug logs for tonight. The soonest we'll have more is tomorrow, sorry. |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ] |
|
From the message, it seems like ->clo_use is not defined which is impossible. Can you please print out the lock information with cl_lock_print()? Can you also provide me a test program you're running to reproduce this issue? |
| Comment by Alexey Lyashkov [ 27/Nov/13 ] |
|
Jay, reproducer likely to be racer. |
| Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ] |
|
Jinshan, Alex is right, we're using racer. Our first priority this morning is starting testing on a clean system to see what happens, once that's going I'll look at printing out the lock info. Here's a more detailed description of how we're running racer: |
| Comment by Alexey Lyashkov [ 27/Nov/13 ] |
|
Patrik, may you try a test patch without changes in osc ? Alex Boyko worked on test to verify all races is solved with patch so we expect to have a verification in next few days. |
| Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ] |
|
Alex - Sure. Same patch, just only the LOV changes? The the original patch as proposed by Jinshan, our testing on the system without the network issues is going well. I left out one line of log from the earlier problem: |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ] |
|
let me try to reproduce it. |
| Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ] |
|
Jinshan - We just tried for about 4 hours on ~20 clients with your full version of the patch and didn't hit any problems at all. We're going to try without the OSC changes next. |
| Comment by Alexey Lyashkov [ 27/Nov/13 ] |
|
Jay, can you explain why you introduce an OBD_FAIL_LOCK_STATE_WAIT_INTR but none tests used that? |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/13 ] |
Sorry I'm confused, with which version you didn't see the problem, and which version caused assertion? |
| Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ] |
|
Jinshan, We tested your version of the patch yesterday on a system which had network problems that were causing EBUSY over and over. While testing on that system, we hit the assertion I described in this comment: https://jira.hpdd.intel.com/browse/LU-3889?focusedCommentId=72356&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-72356 This morning, we tested for four hours with your version of the patch, on a different system (without network problems), and have had no problems. We are now testing the version of the patch without the OSC changes. We've been testing for about an hour, and have had no problems yet. We're going to test for a few more hours today, I will let you know if we see anything. |
| Comment by Patrick Farrell (Inactive) [ 27/Nov/13 ] |
|
We did about 3 hours of testing on the version of the patch without OSC changes. No problems were seen. We have a 24 hour general test run scheduled on one of our systems for this weekend. We're currently planning to test the version without OSC changes as suggested by Shadow, but if a new patch is generated, I could change the test run to test that instead. |
| Comment by Jinshan Xiong (Inactive) [ 01/Dec/13 ] |
|
Hi Patrick, it's fine to take out of OSC changes. |
| Comment by Patrick Farrell (Inactive) [ 02/Dec/13 ] |
|
Unfortunately, our test run failed due to unrelated reasons. We're going to do another later this week, I'll update with results as I have them. |
| Comment by Alexander Boyko [ 03/Dec/13 ] |
|
I have added the regression test for this issue http://review.whamcloud.com/8463. I think, it should be included at patch, but the test based on the ASSERT at osc_lock_upcall(). |
| Comment by Jinshan Xiong (Inactive) [ 11/Dec/13 ] |
|
Hi Boyko, I merged your patch into patch 8405, please take a look and thank you for your work. Jinshan |
| Comment by Jian Yu [ 19/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) racer test hit the same failure: parallel-scale-nfsv4 test iorssf hit the same failure: |
| Comment by Jian Yu [ 20/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) racer test hit the same failure: |
| Comment by Jian Yu [ 02/Jan/14 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5 racer test hit the same failure: |
| Comment by Bob Glossman (Inactive) [ 03/Jan/14 ] |
|
in b2_5: http://review.whamcloud.com/8717 |
| Comment by Peter Jones [ 04/Jan/14 ] |
|
Landed for 2.5.1 and 2.6 |
| Comment by James A Simmons [ 16/Jan/14 ] |
|
Also hit this bug for 2.4 |
| Comment by Niu Yawei (Inactive) [ 10/Feb/14 ] |
|
for b2_4: http://review.whamcloud.com/9194 |
| Comment by Nathaniel Clark [ 16/Jun/14 ] |
|
Patch 8234 never made it into b2_5 while 7778 with a typo did: |