Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7444

Crash in mgc_blocking_ast

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This is somewhat similar to long fixed LU-272, but not quite.

      I got a couple of these today and I am sure I saw this earlier too, all in replay-single test 74:

      <4>[ 5300.248670] Lustre: DEBUG MARKER: == replay-single test 74: Ensure applications don't fail waiting for OST recovery == 12:30:47 (1447781447)
      <4>[ 5302.637377] Lustre: Unmounted lustre-client
      <4>[ 5303.061113] Lustre: Failing over lustre-OST0000
      <4>[ 5303.061940] Lustre: Skipped 10 previous similar messages
      <4>[ 5303.549967] Lustre: server umount lustre-OST0000 complete
      <4>[ 5303.550731] Lustre: Skipped 10 previous similar messages
      <3>[ 5314.676109] LustreError: 166-1: MGC192.168.10.216@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      <3>[ 5314.678289] LustreError: Skipped 10 previous similar messages
      <6>[ 5317.035395] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
      <6>[ 5320.676659] Lustre: MGS: Connection restored to 192.168.10.216@tcp (at 0@lo)
      <6>[ 5320.677600] Lustre: Skipped 109 previous similar messages
      <1>[ 5320.679609] BUG: unable to handle kernel paging request at ffff8800b32a2e78
      <1>[ 5320.680562] IP: [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
      <4>[ 5320.681565] PGD 1a2e063 PUD 501067 PMD 69b067 PTE 80000000b32a2060
      <4>[ 5320.682610] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      <4>[ 5320.683156] last sysfs file: /sys/devices/virtual/block/loop0/queue/scheduler
      <4>[ 5320.683156] CPU 6 
      <4>[ 5320.683156] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs exportfs lquota lfsck jbd obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet sha512_generic sha256_generic libcfs ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: speedstep_lib]
      <4>[ 5320.683156] 
      <4>[ 5320.683156] Pid: 26321, comm: ll_imp_inval Not tainted 2.6.32-rhe6.7-debug #1 Bochs Bochs
      <4>[ 5320.683156] RIP: 0010:[<ffffffffa0bb5499>]  [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
      <4>[ 5320.683156] RSP: 0018:ffff880093b53b00  EFLAGS: 00010286
      <4>[ 5320.683156] RAX: 0000000000000001 RBX: ffff880039a41db8 RCX: 0000000000000000
      <4>[ 5320.683156] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8800b0f9cef0
      <4>[ 5320.683156] RBP: ffff880093b53b40 R08: 0000000000000000 R09: 00000000fffffffc
      <4>[ 5320.683156] R10: 0000000000000000 R11: 0000000000000002 R12: ffff8800b32a2df0
      <4>[ 5320.683156] R13: 001110e400000000 R14: ffff88006df7bf18 R15: 0000002000000000
      <4>[ 5320.683156] FS:  0000000000000000(0000) GS:ffff880006380000(0000) knlGS:0000000000000000
      <4>[ 5320.683156] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>[ 5320.683156] CR2: ffff8800b32a2e78 CR3: 00000000b01a5000 CR4: 00000000000006e0
      <4>[ 5320.683156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[ 5320.683156] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[ 5320.683156] Process ll_imp_inval (pid: 26321, threadinfo ffff880093b50000, task ffff8800655e8440)
      <4>[ 5320.683156] Stack:
      <4>[ 5320.683156]  ffff88006df7bf60 ffff880039a41db8 ffff880093b53b20 ffffffff81530afe
      <4>[ 5320.683156] <d> ffff880093b53b40 ffffffffa07b3041 ffff880039a41db8 0000000000000002
      <4>[ 5320.683156] <d> ffff880093b53bc0 ffffffffa07b5ad7 ffff880093b53b60 ffff880039a41df0
      <4>[ 5320.683156] Call Trace:
      <4>[ 5320.683156]  [<ffffffff81530afe>] ? _spin_unlock+0xe/0x10
      <4>[ 5320.683156]  [<ffffffffa07b3041>] ? unlock_res_and_lock+0x41/0x50 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa07b5ad7>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa07d36ea>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa07d823c>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa07c0a32>] cleanup_resource+0x142/0x370 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa045b86e>] ? cfs_hash_spin_lock+0xe/0x10 [libcfs]
      <4>[ 5320.683156]  [<ffffffffa07c0c8f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa045b1ae>] cfs_hash_for_each_relax+0x1fe/0x380 [libcfs]
      <4>[ 5320.683156]  [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa07c0c60>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa045d14c>] cfs_hash_for_each_nolock+0x8c/0x1d0 [libcfs]
      <4>[ 5320.683156]  [<ffffffffa07bcc00>] ldlm_namespace_cleanup+0x30/0xc0 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa0bb4487>] mgc_import_event+0x247/0x2a0 [mgc]
      <4>[ 5320.683156]  [<ffffffffa0820f92>] ptlrpc_invalidate_import+0x312/0x990 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa0455701>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      <4>[ 5320.683156]  [<ffffffffa0822bc0>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffffa0822c08>] ptlrpc_invalidate_import_thread+0x48/0x2e0 [ptlrpc]
      <4>[ 5320.683156]  [<ffffffff8109f82e>] kthread+0x9e/0xc0
      <4>[ 5320.683156]  [<ffffffff8100c2ca>] child_rip+0xa/0x20
      <4>[ 5320.683156]  [<ffffffff8109f790>] ? kthread+0x0/0xc0
      <4>[ 5320.683156]  [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
      <4>[ 5320.683156] Code: 00 01 00 a9 00 00 01 00 74 0d f6 05 e4 ae 8b ff 10 0f 85 9b 02 00 00 a9 00 00 00 01 0f 85 d8 00 00 00 4d 85 e4 0f 84 07 02 00 00 <41> 8b 84 24 88 00 00 00 85 c0 0f 8e 3c 05 00 00 41 f6 84 24 fc 
      <1>[ 5320.683156] RIP  [<ffffffffa0bb5499>] mgc_blocking_ast+0x169/0x810 [mgc]
      <4>[ 5320.683156]  RSP <ffff880093b53b00>
      <4>[ 5320.683156] CR2: ffff8800b32a2e78
      

      Sample crashdump on my node: /exports/crashdumps/192.168.10.216-2015-11-17-12\:31\:13/

      This is latest master + http://review.whamcloud.com/#/c/16940/
      except I also had this happen without that patch too.

      Apparently the very first time this was hit in first half of October in my testing.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: