[LU-7739] replay-single test 70b hangs with LBUG '(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 )' Created: 04/Feb/16 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne-part-2 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
replay-single test_70b times out. In the MDS 2, MDS 3, MDS 4 console log, we see: 13:15:37:LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 13:15:37:LustreError: 25429:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 13:15:37:LustreError: 25429:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG 13:15:37:Pid: 25429, comm: ldlm_bl_01 13:15:37: 13:15:37:Call Trace: 13:15:37: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 13:15:37: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs] 13:15:37: [<ffffffffa0cff9d9>] mgc_blocking_ast+0x6e9/0x810 [mgc] 13:15:37: [<ffffffffa0758b57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] 13:15:37: [<ffffffffa07779ba>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] 13:15:37: [<ffffffffa077c55c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] 13:15:37: [<ffffffffa0cff3db>] mgc_blocking_ast+0xeb/0x810 [mgc] 13:15:37: [<ffffffffa0cff2f0>] ? mgc_blocking_ast+0x0/0x810 [mgc] 13:15:37: [<ffffffffa0780c90>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] 13:15:37: [<ffffffffa0781ba1>] ldlm_bl_thread_main+0x481/0x710 [ptlrpc] 13:15:37: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 13:15:37: [<ffffffffa0781720>] ? ldlm_bl_thread_main+0x0/0x710 [ptlrpc] 13:15:37: [<ffffffff810a0fce>] kthread+0x9e/0xc0 13:15:37: [<ffffffff8100c28a>] child_rip+0xa/0x20 13:15:37: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 13:15:37: [<ffffffff8100c280>] ? child_rip+0x0/0x20 13:15:37: 13:15:37:Kernel panic - not syncing: LBUG In the past month, I can only find two occurrences of this error for test_70b. Logs at |
| Comments |
| Comment by James Nunez (Inactive) [ 23/Feb/16 ] |
|
A similar error on sanity test_133g on unmount of MDS4. The LBUG/ASSERT is the same as above, but the stack trace differs 13:44:11:Lustre: DEBUG MARKER: umount -d -f /mnt/mds4 13:44:11:LustreError: 5639:0:(osp_object.c:588:osp_attr_get()) lustre-MDT0000-osp-MDT0003:osp_attr_get update error [0x200000403:0x1:0x0]: rc = -5 13:44:11:LustreError: 5639:0:(osp_object.c:588:osp_attr_get()) Skipped 2 previous similar messages 13:44:11:LustreError: 5639:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 13:44:11:LustreError: 5639:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG 13:44:11:Pid: 5639, comm: umount 13:44:11: 13:44:11:Call Trace: 13:44:11: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 13:44:11: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs] 13:44:11: [<ffffffffa0cc48b9>] mgc_blocking_ast+0x6e9/0x810 [mgc] 13:44:11: [<ffffffffa074ba87>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] 13:44:11: [<ffffffffa0472cf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 13:44:11: [<ffffffffa07698da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] 13:44:11: [<ffffffffa076ccae>] ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc] 13:44:11: [<ffffffffa076d011>] ldlm_cancel_resource_local+0x1c1/0x290 [ptlrpc] 13:44:11: [<ffffffffa076d177>] ldlm_cli_cancel_unused_resource+0x97/0x280 [ptlrpc] 13:44:11: [<ffffffffa076d545>] ldlm_cli_hash_cancel_unused+0x35/0x40 [ptlrpc] 13:44:11: [<ffffffffa04787cb>] cfs_hash_for_each_relax+0x1eb/0x350 [libcfs] 13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc] 13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc] 13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc] 13:44:11: [<ffffffffa047a6ac>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs] 13:44:11: [<ffffffffa076d4c6>] ldlm_cli_cancel_unused+0x166/0x1b0 [ptlrpc] 13:44:11: [<ffffffffa075ae98>] client_disconnect_export+0x228/0x460 [ptlrpc] 13:44:11: [<ffffffffa057af3a>] lustre_common_put_super+0x28a/0xb00 [obdclass] 13:44:11: [<ffffffffa05a0246>] server_put_super+0x116/0xcd0 [obdclass] 13:44:11: [<ffffffff811946eb>] generic_shutdown_super+0x5b/0xe0 13:44:11: [<ffffffff811947d6>] kill_anon_super+0x16/0x60 13:44:11: [<ffffffffa0572616>] lustre_kill_super+0x36/0x60 [obdclass] 13:44:11: [<ffffffff81194f77>] deactivate_super+0x57/0x80 13:44:11: [<ffffffff811b4f5f>] mntput_no_expire+0xbf/0x110 13:44:11: [<ffffffff811b5aab>] sys_umount+0x7b/0x3a0 13:44:11: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b 13:44:11: 13:44:11:Kernel panic - not syncing: LBUG 13:44:11:Pid: 5639, comm: umount Not tainted 2.6.32-573.18.1.el6_lustre.g55ae312.x86_64 #1 Log are at |
| Comment by Jian Yu [ 18/Apr/16 ] |
|
Lustre Branch: master insanity test 11 hit the same failure on MDS 2: Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre /dev/lvm-Role_MDS/P2 /mnt/mds2 LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: LustreError: 31250:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: LustreError: 31250:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG Pid: 31250, comm: ldlm_bl_01 Call Trace: [<ffffffffa0464875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0464e77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa044b8b9>] mgc_blocking_ast+0x6e9/0x810 [mgc] [<ffffffffa075ea87>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] [<ffffffffa077c4da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] [<ffffffffa078107c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] [<ffffffffa044b2bb>] mgc_blocking_ast+0xeb/0x810 [mgc] [<ffffffffa044b1d0>] ? mgc_blocking_ast+0x0/0x810 [mgc] [<ffffffffa0785400>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [<ffffffffa0786291>] ldlm_bl_thread_main+0x461/0x650 [ptlrpc] [<ffffffff81067670>] ? default_wake_function+0x0/0x20 [<ffffffffa0785e30>] ? ldlm_bl_thread_main+0x0/0x650 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 31250, comm: ldlm_bl_01 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff815394d1>] ? panic+0xa7/0x16f [<ffffffffa0464ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa044b8b9>] ? mgc_blocking_ast+0x6e9/0x810 [mgc] [<ffffffffa075ea87>] ? ldlm_cancel_callback+0x87/0x280 [ptlrpc] [<ffffffffa077c4da>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] [<ffffffffa078107c>] ? ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] [<ffffffffa044b2bb>] ? mgc_blocking_ast+0xeb/0x810 [mgc] [<ffffffffa044b1d0>] ? mgc_blocking_ast+0x0/0x810 [mgc] [<ffffffffa0785400>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [<ffffffffa0786291>] ? ldlm_bl_thread_main+0x461/0x650 [ptlrpc] [<ffffffff81067670>] ? default_wake_function+0x0/0x20 [<ffffffffa0785e30>] ? ldlm_bl_thread_main+0x0/0x650 [ptlrpc] [<ffffffff810a138e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 https://testing.hpdd.intel.com/test_sets/595a0f92-030b-11e6-b5f1-5254006e85c2 |
| Comment by Åke Sandgren [ 04/Nov/16 ] |
|
We're sometimes seeing a crash almost identical to the one from James Nunez, at umount. This is on a production system. |