[LU-7444] Crash in mgc_blocking_ast Created: 17/Nov/15  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This is somewhat similar to long fixed LU-272, but not quite.

I got a couple of these today and I am sure I saw this earlier too, all in replay-single test 74:

<4>[ 5300.248670] Lustre: DEBUG MARKER: == replay-single test 74: Ensure applications don't fail waiting for OST recovery == 12:30:47 (1447781447)
<4>[ 5302.637377] Lustre: Unmounted lustre-client
<4>[ 5303.061113] Lustre: Failing over lustre-OST0000
<4>[ 5303.061940] Lustre: Skipped 10 previous similar messages
<4>[ 5303.549967] Lustre: server umount lustre-OST0000 complete
<4>[ 5303.550731] Lustre: Skipped 10 previous similar messages
<3>[ 5314.676109] LustreError: 166-1: MGC192.168.10.216@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
<3>[ 5314.678289] LustreError: Skipped 10 previous similar messages
<6>[ 5317.035395] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
<6>[ 5320.676659] Lustre: MGS: Connection restored to 192.168.10.216@tcp (at 0@lo)
<6>[ 5320.677600] Lustre: Skipped 109 previous similar messages
<1>[ 5320.679609] BUG: unable to handle kernel paging request at ffff8800b32a2e78
<1>[ 5320.680562] IP: [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
<4>[ 5320.681565] PGD 1a2e063 PUD 501067 PMD 69b067 PTE 80000000b32a2060
<4>[ 5320.682610] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[ 5320.683156] last sysfs file: /sys/devices/virtual/block/loop0/queue/scheduler
<4>[ 5320.683156] CPU 6 
<4>[ 5320.683156] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs exportfs lquota lfsck jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet sha512_generic sha256_generic libcfs ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: speedstep_lib]
<4>[ 5320.683156] 
<4>[ 5320.683156] Pid: 26321, comm: ll_imp_inval Not tainted 2.6.32-rhe6.7-debug #1 Bochs Bochs
<4>[ 5320.683156] RIP: 0010:[<ffffffffa0bb5499>]  [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
<4>[ 5320.683156] RSP: 0018:ffff880093b53b00  EFLAGS: 00010286
<4>[ 5320.683156] RAX: 0000000000000001 RBX: ffff880039a41db8 RCX: 0000000000000000
<4>[ 5320.683156] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b0f9cef0
<4>[ 5320.683156] RBP: ffff880093b53b40 R08: 0000000000000000 R09: 00000000fffffffc
<4>[ 5320.683156] R10: 0000000000000000 R11: 0000000000000002 R12: ffff8800b32a2df0
<4>[ 5320.683156] R13: 001110e400000000 R14: ffff88006df7bf18 R15: 0000002000000000
<4>[ 5320.683156] FS:  0000000000000000(0000) GS:ffff880006380000(0000) knlGS:0000000000000000
<4>[ 5320.683156] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[ 5320.683156] CR2: ffff8800b32a2e78 CR3: 00000000b01a5000 CR4: 00000000000006e0
<4>[ 5320.683156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 5320.683156] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 5320.683156] Process ll_imp_inval (pid: 26321, threadinfo ffff880093b50000, task ffff8800655e8440)
<4>[ 5320.683156] Stack:
<4>[ 5320.683156]  ffff88006df7bf60 ffff880039a41db8 ffff880093b53b20 ffffffff81530afe
<4>[ 5320.683156] <d> ffff880093b53b40 ffffffffa07b3041 ffff880039a41db8 0000000000000002
<4>[ 5320.683156] <d> ffff880093b53bc0 ffffffffa07b5ad7 ffff880093b53b60 ffff880039a41df0
<4>[ 5320.683156] Call Trace:
<4>[ 5320.683156]  [<ffffffff81530afe>] ? _spin_unlock+0xe/0x10
<4>[ 5320.683156]  [<ffffffffa07b3041>] ? unlock_res_and_lock+0x41/0x50 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa07b5ad7>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa07d36ea>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa07d823c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa07c0a32>] cleanup_resource+0x142/0x370 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa045b86e>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs]
<4>[ 5320.683156]  [<ffffffffa07c0c8f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa045b1ae>] cfs_hash_for_each_relax+0x1fe/0x380 [libcfs]
<4>[ 5320.683156]  [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa045d14c>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs]
<4>[ 5320.683156]  [<ffffffffa07bcc00>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa0bb4487>] mgc_import_event+0x247/0x2a0 [mgc]
<4>[ 5320.683156]  [<ffffffffa0820f92>] ptlrpc_invalidate_import+0x312/0x990 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa0455701>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4>[ 5320.683156]  [<ffffffffa0822bc0>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlrpc]
<4>[ 5320.683156]  [<ffffffffa0822c08>] ptlrpc_invalidate_import_thread+0x48/0x2e0 [ptlrpc]
<4>[ 5320.683156]  [<ffffffff8109f82e>] kthread+0x9e/0xc0
<4>[ 5320.683156]  [<ffffffff8100c2ca>] child_rip+0xa/0x20
<4>[ 5320.683156]  [<ffffffff8109f790>] ? kthread+0x0/0xc0
<4>[ 5320.683156]  [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
<4>[ 5320.683156] Code: 00 01 00 a9 00 00 01 00 74 0d f6 05 e4 ae 8b ff 10 0f 85 9b 02 00 00 a9 00 00 00 01 0f 85 d8 00 00 00 4d 85 e4 0f 84 07 02 00 00 <41> 8b 84 24 88 00 00 00 85 c0 0f 8e 3c 05 00 00 41 f6 84 24 fc 
<1>[ 5320.683156] RIP  [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
<4>[ 5320.683156]  RSP <ffff880093b53b00>
<4>[ 5320.683156] CR2: ffff8800b32a2e78

Sample crashdump on my node: /exports/crashdumps/192.168.10.216-2015-11-17-12\:31\:13/

This is latest master + http://review.whamcloud.com/#/c/16940/
except I also had this happen without that patch too.

Apparently the very first time this was hit in first half of October in my testing.



 Comments   
Comment by Nathaniel Clark [ 22/May/16 ]

Crash on master.
replay-single/test_70c onyx-37vm3 crashed
https://testing.hpdd.intel.com/test_sets/b6654596-1cc2-11e6-952a-5254006e85c2

Generated at Sat Feb 10 02:08:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.