[LU-3614] Kernel Panic "osc_lock_detach" Created: 22/Jul/13 Updated: 09/Oct/21 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Rustem Bikboulatov | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Linux 2.6.32-279.19.1.el6_lustre.x86_64 #1 SMP |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9290 |
| Description |
|
Hello, Our OST-servers periodically crashes with kernel panic error: general protection fault: 0000 1 SMP Pid: 26235, comm: ldlm_bl_04 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH |
| Comments |
| Comment by Rustem Bikboulatov [ 31/Jul/13 ] |
|
Hello, Today, we are faced with yet another case of "kernel panic" on another OSS server. Here is the output log. We have a crash dump file, and if you need to send it for analysis, we are ready to do it. These problems concern us, since we used Lustre to build news television production, and it is associated with risk of problems with the air. Please help in diagnosing and fixing the problem. Thanks. general protection fault: 0000 1 SMP Pid: 8815, comm: ldlm_bl_28 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH |
| Comment by Rustem Bikboulatov [ 07/Aug/13 ] |
|
It happened again on another server. [root@n23 lustre_2.1.5]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux /var/crash/127.0.0.1-2013-08-07-17\:00\:44/vmcore crash 6.0.4-2.el6 GNU gdb (GDB) 7.3.1 KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux general protection fault: 0000 1 SMP Pid: 26404, comm: ldlm_bl_10 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH crash> kmem -i TOTAL SWAP 524286 2 GB ---- Hello, anybody there? |
| Comment by Rustem Bikboulatov [ 09/Aug/13 ] |
|
Hello again! We install a lustre client on a separate server from the OSS. This server only worked with lustre client and application software. A day later we got the same symptom - server "kernel panic". The problem is most likely related only to lustre client. [root@r01 ~]# crash /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux /var/crash/127.0.0.1-2013-08-08-20\:42\:27/vmcore crash 6.0.4-2.el6 GNU gdb (GDB) 7.3.1 KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.19.1.el6_lustre.x86_64/vmlinux general protection fault: 0000 1 SMP Pid: 3170, comm: ldlm_bl_00 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH crash> kmem -i TOTAL SWAP 524286 2 GB ---- |
| Comment by Bruno Faccini (Inactive) [ 15/Nov/13 ] |
|
Running with latest b2_1 build #220 I got this same crash with reproducer of |
| Comment by Rustem Bikboulatov [ 15/Nov/13 ] |
|
Bruno, that's the good news. We will wait for the patch. |
| Comment by Patrick Farrell (Inactive) [ 07/Jan/14 ] |
|
Bruno - Any word on this one? For what it's worth, on the Cray side, we think this may be a cl_lock bug related issue, similar to Rustem - Can you share details of what you are running and what the setup is when you hit this bug? We'd like to try to get debugging information, but do not have a reliable reproducer. |
| Comment by Rustem Bikboulatov [ 14/Jan/14 ] |
|
Lustre Cluster Diagram |
| Comment by Rustem Bikboulatov [ 14/Jan/14 ] |
|
Patrick, here the cluster configuration: Lustre Server MGS/MDS - mmp-2 (refer to the diagram "20140113 - Hardware Diagram v0.1_R3.gif" in attachment) Environment: Mount points: OSS: MGS/MDS: /dev/lustre_mgs on /lustre/mgs type lustre (rw,noauto,_netdev,abort_recov) Clients: Stripe config: [root@mmp-1 ~]# lfs getstripe /array1/. kdump config: core_collector makedumpfile -c --message-level 1 -d 31 Application Software Description (LRVfarm): LRVfarm - it's the software that processes media files (video+audio) and creates a proxy video ( low resolution video). Each running task contains a small task file, which is located on Lustre file system. Software LRVfarm runs several threads (processes) - 8 processes on each client (r01, r02, r03, r04). For a total of 32 LRVfarm processes running in cluster. Each process uses the file locking feature when performing tasks. LRVfarm process locks a task file and perform the task. Other LRVfarm processes (on local client and remote clients) also attempt to lock that task files, but it is not possible in the case when the task already running (for the reason that the file is already locked by another LRVfarm process). Periodically, all clients are crashes (kernel panic) with different errors. Here the crash statistics for the last time: Client r01 2013-12-30-13:27:09 ASSERTION( stripe < lio->lis_stripe_count ) failed: Client r02 2014-01-13-16:50:05 ASSERTION( stripe < lio->lis_stripe_count ) failed: Client r03 2013-11-06-20:25:26 ASSERTION( stripe < lio->lis_stripe_count ) failed: Client r04 2013-12-08-06:19:11 cl_lock_mutex_get+0x2e/0xe0 [obdclass] The last crash with "osc_lock_detach" error occurred on October 13, 2013. At the time of creating of this ticket error "osc_lock_detach" was the most common one. Perhaps at that time was different load conditions on Lustre. But now the most frequently occurring errors - "ASSERTION( stripe < lio->lis_stripe_count ) failed" ( |
| Comment by Rustem Bikboulatov [ 14/Apr/14 ] |
|
Today there was a kernel panic "osc_lock_detach" again: ================================================================== general protection fault: 0000 1 SMP Pid: 23021, comm: ldlm_bl_37 Not tainted 2.6.32-279.19.1.el6_lustre.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH ========================================================================== Prior to this, we have changed the mount options for lustre clients on all servers. We excluded mode "flock", as it was suspected that a crash associated with it. Now mount options are: Clients (r01, r02, r03, r04, mmp-1, vn-1): Clients (cln01, cln02, cln03, cln04): I also want to note that the issue |