[LU-7444] Crash in mgc_blocking_ast Created: 17/Nov/15 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This is somewhat similar to long fixed I got a couple of these today and I am sure I saw this earlier too, all in replay-single test 74: <4>[ 5300.248670] Lustre: DEBUG MARKER: == replay-single test 74: Ensure applications don't fail waiting for OST recovery == 12:30:47 (1447781447) <4>[ 5302.637377] Lustre: Unmounted lustre-client <4>[ 5303.061113] Lustre: Failing over lustre-OST0000 <4>[ 5303.061940] Lustre: Skipped 10 previous similar messages <4>[ 5303.549967] Lustre: server umount lustre-OST0000 complete <4>[ 5303.550731] Lustre: Skipped 10 previous similar messages <3>[ 5314.676109] LustreError: 166-1: MGC192.168.10.216@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail <3>[ 5314.678289] LustreError: Skipped 10 previous similar messages <6>[ 5317.035395] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: <6>[ 5320.676659] Lustre: MGS: Connection restored to 192.168.10.216@tcp (at 0@lo) <6>[ 5320.677600] Lustre: Skipped 109 previous similar messages <1>[ 5320.679609] BUG: unable to handle kernel paging request at ffff8800b32a2e78 <1>[ 5320.680562] IP: [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc] <4>[ 5320.681565] PGD 1a2e063 PUD 501067 PMD 69b067 PTE 80000000b32a2060 <4>[ 5320.682610] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC <4>[ 5320.683156] last sysfs file: /sys/devices/virtual/block/loop0/queue/scheduler <4>[ 5320.683156] CPU 6 <4>[ 5320.683156] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs exportfs lquota lfsck jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet sha512_generic sha256_generic libcfs ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: speedstep_lib] <4>[ 5320.683156] <4>[ 5320.683156] Pid: 26321, comm: ll_imp_inval Not tainted 2.6.32-rhe6.7-debug #1 Bochs Bochs <4>[ 5320.683156] RIP: 0010:[<ffffffffa0bb5499>] [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc] <4>[ 5320.683156] RSP: 0018:ffff880093b53b00 EFLAGS: 00010286 <4>[ 5320.683156] RAX: 0000000000000001 RBX: ffff880039a41db8 RCX: 0000000000000000 <4>[ 5320.683156] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b0f9cef0 <4>[ 5320.683156] RBP: ffff880093b53b40 R08: 0000000000000000 R09: 00000000fffffffc <4>[ 5320.683156] R10: 0000000000000000 R11: 0000000000000002 R12: ffff8800b32a2df0 <4>[ 5320.683156] R13: 001110e400000000 R14: ffff88006df7bf18 R15: 0000002000000000 <4>[ 5320.683156] FS: 0000000000000000(0000) GS:ffff880006380000(0000) knlGS:0000000000000000 <4>[ 5320.683156] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>[ 5320.683156] CR2: ffff8800b32a2e78 CR3: 00000000b01a5000 CR4: 00000000000006e0 <4>[ 5320.683156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 5320.683156] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>[ 5320.683156] Process ll_imp_inval (pid: 26321, threadinfo ffff880093b50000, task ffff8800655e8440) <4>[ 5320.683156] Stack: <4>[ 5320.683156] ffff88006df7bf60 ffff880039a41db8 ffff880093b53b20 ffffffff81530afe <4>[ 5320.683156] <d> ffff880093b53b40 ffffffffa07b3041 ffff880039a41db8 0000000000000002 <4>[ 5320.683156] <d> ffff880093b53bc0 ffffffffa07b5ad7 ffff880093b53b60 ffff880039a41df0 <4>[ 5320.683156] Call Trace: <4>[ 5320.683156] [<ffffffff81530afe>] ? _spin_unlock+0xe/0x10 <4>[ 5320.683156] [<ffffffffa07b3041>] ? unlock_res_and_lock+0x41/0x50 [ptlrpc] <4>[ 5320.683156] [<ffffffffa07b5ad7>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] <4>[ 5320.683156] [<ffffffffa07d36ea>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] <4>[ 5320.683156] [<ffffffffa07d823c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] <4>[ 5320.683156] [<ffffffffa07c0a32>] cleanup_resource+0x142/0x370 [ptlrpc] <4>[ 5320.683156] [<ffffffffa045b86e>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs] <4>[ 5320.683156] [<ffffffffa07c0c8f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc] <4>[ 5320.683156] [<ffffffffa045b1ae>] cfs_hash_for_each_relax+0x1fe/0x380 [libcfs] <4>[ 5320.683156] [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] <4>[ 5320.683156] [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] <4>[ 5320.683156] [<ffffffffa045d14c>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs] <4>[ 5320.683156] [<ffffffffa07bcc00>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc] <4>[ 5320.683156] [<ffffffffa0bb4487>] mgc_import_event+0x247/0x2a0 [mgc] <4>[ 5320.683156] [<ffffffffa0820f92>] ptlrpc_invalidate_import+0x312/0x990 [ptlrpc] <4>[ 5320.683156] [<ffffffffa0455701>] ? libcfs_debug_msg+0x41/0x50 [libcfs] <4>[ 5320.683156] [<ffffffffa0822bc0>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlrpc] <4>[ 5320.683156] [<ffffffffa0822c08>] ptlrpc_invalidate_import_thread+0x48/0x2e0 [ptlrpc] <4>[ 5320.683156] [<ffffffff8109f82e>] kthread+0x9e/0xc0 <4>[ 5320.683156] [<ffffffff8100c2ca>] child_rip+0xa/0x20 <4>[ 5320.683156] [<ffffffff8109f790>] ? kthread+0x0/0xc0 <4>[ 5320.683156] [<ffffffff8100c2c0>] ? child_rip+0x0/0x20 <4>[ 5320.683156] Code: 00 01 00 a9 00 00 01 00 74 0d f6 05 e4 ae 8b ff 10 0f 85 9b 02 00 00 a9 00 00 00 01 0f 85 d8 00 00 00 4d 85 e4 0f 84 07 02 00 00 <41> 8b 84 24 88 00 00 00 85 c0 0f 8e 3c 05 00 00 41 f6 84 24 fc <1>[ 5320.683156] RIP [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc] <4>[ 5320.683156] RSP <ffff880093b53b00> <4>[ 5320.683156] CR2: ffff8800b32a2e78 Sample crashdump on my node: /exports/crashdumps/192.168.10.216-2015-11-17-12\:31\:13/ This is latest master + http://review.whamcloud.com/#/c/16940/ Apparently the very first time this was hit in first half of October in my testing. |
| Comments |
| Comment by Nathaniel Clark [ 22/May/16 ] |
|
Crash on master. |