[LU-7739] replay-single test 70b hangs with LBUG '(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 )' Created: 04/Feb/16  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

autotest review-dne-part-2


Issue Links:
Related
is related to LU-7298 replay-single test_70b: ASSERTION( dt... Closed
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_70b times out. In the MDS 2, MDS 3, MDS 4 console log, we see:

13:15:37:LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 
13:15:37:LustreError: 25429:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 
13:15:37:LustreError: 25429:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG
13:15:37:Pid: 25429, comm: ldlm_bl_01
13:15:37:
13:15:37:Call Trace:
13:15:37: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
13:15:37: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs]
13:15:37: [<ffffffffa0cff9d9>] mgc_blocking_ast+0x6e9/0x810 [mgc]
13:15:37: [<ffffffffa0758b57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
13:15:37: [<ffffffffa07779ba>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
13:15:37: [<ffffffffa077c55c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc]
13:15:37: [<ffffffffa0cff3db>] mgc_blocking_ast+0xeb/0x810 [mgc]
13:15:37: [<ffffffffa0cff2f0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
13:15:37: [<ffffffffa0780c90>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
13:15:37: [<ffffffffa0781ba1>] ldlm_bl_thread_main+0x481/0x710 [ptlrpc]
13:15:37: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
13:15:37: [<ffffffffa0781720>] ? ldlm_bl_thread_main+0x0/0x710 [ptlrpc]
13:15:37: [<ffffffff810a0fce>] kthread+0x9e/0xc0
13:15:37: [<ffffffff8100c28a>] child_rip+0xa/0x20
13:15:37: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
13:15:37: [<ffffffff8100c280>] ? child_rip+0x0/0x20
13:15:37:
13:15:37:Kernel panic - not syncing: LBUG

In the past month, I can only find two occurrences of this error for test_70b. Logs at
2016-01-28 15:21:30 - https://testing.hpdd.intel.com/test_sets/c296d92c-c620-11e5-b4e1-5254006e85c2
2016-02-03 19:34:24 - https://testing.hpdd.intel.com/test_sets/e4674cb8-caf7-11e5-be8d-5254006e85c2



 Comments   
Comment by James Nunez (Inactive) [ 23/Feb/16 ]

A similar error on sanity test_133g on unmount of MDS4. The LBUG/ASSERT is the same as above, but the stack trace differs

13:44:11:Lustre: DEBUG MARKER: umount -d -f /mnt/mds4
13:44:11:LustreError: 5639:0:(osp_object.c:588:osp_attr_get()) lustre-MDT0000-osp-MDT0003:osp_attr_get update error [0x200000403:0x1:0x0]: rc = -5
13:44:11:LustreError: 5639:0:(osp_object.c:588:osp_attr_get()) Skipped 2 previous similar messages
13:44:11:LustreError: 5639:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed: 
13:44:11:LustreError: 5639:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG
13:44:11:Pid: 5639, comm: umount
13:44:11:
13:44:11:Call Trace:
13:44:11: [<ffffffffa0467875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
13:44:11: [<ffffffffa0467e77>] lbug_with_loc+0x47/0xb0 [libcfs]
13:44:11: [<ffffffffa0cc48b9>] mgc_blocking_ast+0x6e9/0x810 [mgc]
13:44:11: [<ffffffffa074ba87>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
13:44:11: [<ffffffffa0472cf1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
13:44:11: [<ffffffffa07698da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
13:44:11: [<ffffffffa076ccae>] ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc]
13:44:11: [<ffffffffa076d011>] ldlm_cancel_resource_local+0x1c1/0x290 [ptlrpc]
13:44:11: [<ffffffffa076d177>] ldlm_cli_cancel_unused_resource+0x97/0x280 [ptlrpc]
13:44:11: [<ffffffffa076d545>] ldlm_cli_hash_cancel_unused+0x35/0x40 [ptlrpc]
13:44:11: [<ffffffffa04787cb>] cfs_hash_for_each_relax+0x1eb/0x350 [libcfs]
13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc]
13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc]
13:44:11: [<ffffffffa076d510>] ? ldlm_cli_hash_cancel_unused+0x0/0x40 [ptlrpc]
13:44:11: [<ffffffffa047a6ac>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs]
13:44:11: [<ffffffffa076d4c6>] ldlm_cli_cancel_unused+0x166/0x1b0 [ptlrpc]
13:44:11: [<ffffffffa075ae98>] client_disconnect_export+0x228/0x460 [ptlrpc]
13:44:11: [<ffffffffa057af3a>] lustre_common_put_super+0x28a/0xb00 [obdclass]
13:44:11: [<ffffffffa05a0246>] server_put_super+0x116/0xcd0 [obdclass]
13:44:11: [<ffffffff811946eb>] generic_shutdown_super+0x5b/0xe0
13:44:11: [<ffffffff811947d6>] kill_anon_super+0x16/0x60
13:44:11: [<ffffffffa0572616>] lustre_kill_super+0x36/0x60 [obdclass]
13:44:11: [<ffffffff81194f77>] deactivate_super+0x57/0x80
13:44:11: [<ffffffff811b4f5f>] mntput_no_expire+0xbf/0x110
13:44:11: [<ffffffff811b5aab>] sys_umount+0x7b/0x3a0
13:44:11: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
13:44:11:
13:44:11:Kernel panic - not syncing: LBUG
13:44:11:Pid: 5639, comm: umount Not tainted 2.6.32-573.18.1.el6_lustre.g55ae312.x86_64 #1

Log are at
2016-02-22 19:06:02 - https://testing.hpdd.intel.com/test_sets/102b46ac-d9e8-11e5-9e9f-5254006e85c2

Comment by Jian Yu [ 18/Apr/16 ]

Lustre Branch: master
Test Group: review-dne-part-2

insanity test 11 hit the same failure on MDS 2:

Lustre: DEBUG MARKER: mkdir -p /mnt/mds2; mount -t lustre                                  /dev/lvm-Role_MDS/P2 /mnt/mds2
LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 31250:0:(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 ) failed:  
LustreError: 31250:0:(mgc_request.c:995:mgc_blocking_ast()) LBUG
Pid: 31250, comm: ldlm_bl_01

Call Trace:
 [<ffffffffa0464875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 
 [<ffffffffa0464e77>] lbug_with_loc+0x47/0xb0 [libcfs] 
 [<ffffffffa044b8b9>] mgc_blocking_ast+0x6e9/0x810 [mgc]
 [<ffffffffa075ea87>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] 
 [<ffffffffa077c4da>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] 
 [<ffffffffa078107c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] 
 [<ffffffffa044b2bb>] mgc_blocking_ast+0xeb/0x810 [mgc]
 [<ffffffffa044b1d0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0785400>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] 
 [<ffffffffa0786291>] ldlm_bl_thread_main+0x461/0x650 [ptlrpc] 
 [<ffffffff81067670>] ? default_wake_function+0x0/0x20
 [<ffffffffa0785e30>] ? ldlm_bl_thread_main+0x0/0x650 [ptlrpc] 
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 31250, comm: ldlm_bl_01 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff815394d1>] ? panic+0xa7/0x16f
 [<ffffffffa0464ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 
 [<ffffffffa044b8b9>] ? mgc_blocking_ast+0x6e9/0x810 [mgc]
 [<ffffffffa075ea87>] ? ldlm_cancel_callback+0x87/0x280 [ptlrpc] 
 [<ffffffffa077c4da>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] 
 [<ffffffffa078107c>] ? ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] 
 [<ffffffffa044b2bb>] ? mgc_blocking_ast+0xeb/0x810 [mgc]
 [<ffffffffa044b1d0>] ? mgc_blocking_ast+0x0/0x810 [mgc]
 [<ffffffffa0785400>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] 
 [<ffffffffa0786291>] ? ldlm_bl_thread_main+0x461/0x650 [ptlrpc] 
 [<ffffffff81067670>] ? default_wake_function+0x0/0x20
 [<ffffffffa0785e30>] ? ldlm_bl_thread_main+0x0/0x650 [ptlrpc] 
 [<ffffffff810a138e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

https://testing.hpdd.intel.com/test_sets/595a0f92-030b-11e6-b5f1-5254006e85c2

Comment by Åke Sandgren [ 04/Nov/16 ]

We're sometimes seeing a crash almost identical to the one from James Nunez, at umount.
Using tag 2.8.56 + the fixes for LU-6808 on the servers, 2.5.41_DDN on the servers

This is on a production system.

Generated at Sat Feb 10 02:11:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.