Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8332

sanity-hsm soft lock up in journal write

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • autotest review-dne
    • 3
    • 9223372036854775807

    Description

      sanity-hsm hangs before any tests are run. From the suite_stdout, we see that there is a problem with the OST 4 setup:

      06:11:37:Starting ost4:   /dev/lvm-Role_OSS/P4 /mnt/lustre-ost4
      06:11:37:CMD: onyx-34vm8 mkdir -p /mnt/lustre-ost4; mount -t lustre   		                   /dev/lvm-Role_OSS/P4 /mnt/lustre-ost4
      06:11:37:CMD: onyx-34vm8 /usr/sbin/lctl get_param -n health_check
      06:11:37:CMD: onyx-34vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests//usr/lib64/lustre/tests/utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/usr/sbin:/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 
      06:11:37:CMD: onyx-34vm8 e2label /dev/lvm-Role_OSS/P4 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      06:11:37:CMD: onyx-34vm8 e2label /dev/lvm-Role_OSS/P4 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      06:11:37:CMD: onyx-34vm8 sync; sync; sync
      07:11:20:********** Timeout by autotest system **********
      

      On OST4 (vm8), we see

      06:12:07:[16441.962399] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P4 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      06:12:07:[16442.192158] Lustre: DEBUG MARKER: sync; sync; sync
      06:12:07:[16456.204737] Lustre: lustre-OST0003: Export ffff88005981e800 already connecting from 10.2.4.131@tcp
      06:12:07:[16461.204570] Lustre: lustre-OST0003: Export ffff88005981e800 already connecting from 10.2.4.131@tcp
      06:12:07:[16461.207146] Lustre: Skipped 1 previous similar message
      06:12:07:[16466.204701] Lustre: lustre-OST0003: Export ffff88005981e800 already connecting from 10.2.4.131@tcp
      06:12:07:[16466.207151] Lustre: Skipped 1 previous similar message
      06:12:07:[16468.053003] BUG: soft lockup - CPU#0 stuck for 23s! [ll_ost00_003:29743]
      06:12:07:[16468.053003] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ppdev parport_pc pcspkr virtio_balloon i2c_piix4 parport nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper ttm virtio_blk 8139too drm serio_raw ata_piix virtio_pci virtio_ring virtio i2c_core 8139cp mii floppy libata
      06:12:07:[16468.053003] CPU: 0 PID: 29743 Comm: ll_ost00_003 Tainted: G           OE  ------------   3.10.0-327.18.2.el7_lustre.x86_64 #1
      06:12:07:[16468.053003] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      06:12:07:[16468.053003] task: ffff88004b1a5c00 ti: ffff88004cd40000 task.ti: ffff88004cd40000
      06:12:07:[16468.053003] RIP: 0010:[<ffffffff8163d5f2>]  [<ffffffff8163d5f2>] _raw_spin_lock+0x32/0x50
      06:12:07:[16468.053003] RSP: 0018:ffff88004cd435d0  EFLAGS: 00000287
      06:12:07:[16468.053003] RAX: 00000000000003e0 RBX: ffff88007b87e680 RCX: 0000000000000b42
      06:12:07:[16468.053003] RDX: 0000000000000b38 RSI: 0000000000000b38 RDI: ffff880077768ba0
      06:12:07:[16468.053003] RBP: ffff88004cd435d0 R08: c010000000000000 R09: 0035f32b60080000
      06:12:07:[16468.053003] R10: ffac0ce21cc2d802 R11: 0000000000004000 R12: ffff880035f32958
      06:12:07:[16468.053003] R13: ffff880035f32b60 R14: ffffffff8121255b R15: ffff88004cd435e0
      06:12:07:[16468.053003] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      06:12:07:[16468.053003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      06:12:07:[16468.053003] CR2: 00007f7f78dc54a9 CR3: 000000000194a000 CR4: 00000000000006f0
      06:12:07:[16468.053003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      06:12:07:[16468.053003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      06:12:07:[16468.053003] Stack:
      06:12:07:[16468.053003]  ffff88004cd43658 ffffffffa0155bfc ffff88004cd43600 ffffffff81212e6c
      06:12:07:[16468.053003]  00000000811c11da ffff880077768800 0000000000000000 0000000000000025
      06:12:07:[16468.053003]  0000000000000004 0000000000000c35 ffff88007b8ee4b0 000000007608c8d8
      06:12:07:[16468.053003] Call Trace:
      06:12:07:[16468.053003]  [<ffffffffa0155bfc>] do_get_write_access+0x32c/0x4e0 [jbd2]
      06:12:07:[16468.053003]  [<ffffffff81212e6c>] ? __find_get_block+0xbc/0x120
      06:12:07:[16468.053003]  [<ffffffffa0155dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      06:12:07:[16468.053003]  [<ffffffffa05e6c0b>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa0620ec7>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa05f7038>] ldiskfs_mkdir+0x148/0x280 [ldiskfs]
      06:12:07:[16468.053003]  [<ffffffff811ea557>] vfs_mkdir+0xb7/0x160
      06:12:07:[16468.053003]  [<ffffffffa0c3a5c9>] simple_mkdir.isra.17.constprop.26+0x429/0x4c0 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa0c4ebfe>] osd_seq_load_locked.isra.19+0x19a/0x6a0 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffff811c13be>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
      06:12:07:[16468.053003]  [<ffffffffa0c3a9c1>] osd_seq_load+0x361/0x520 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa0c3eb96>] osd_obj_spec_lookup+0x66/0x300 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa0c2bf07>] osd_oi_lookup+0x47/0x150 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa0c2886b>] osd_fid_lookup+0x92b/0x1780 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffff8163d59b>] ? _raw_spin_unlock_bh+0x1b/0x40
      06:12:07:[16468.053003]  [<ffffffffa096f0c2>] ? ksocknal_queue_tx_locked+0x132/0x4d0 [ksocklnd]
      06:12:07:[16468.053003]  [<ffffffff812fc6a2>] ? put_dec+0x72/0x90
      06:12:07:[16468.053003]  [<ffffffff811c13be>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
      06:12:07:[16468.053003]  [<ffffffffa0c29715>] osd_object_init+0x55/0xf0 [osd_ldiskfs]
      06:12:07:[16468.053003]  [<ffffffffa07e42df>] lu_object_alloc+0xdf/0x310 [obdclass]
      06:12:07:[16468.053003]  [<ffffffffa07e46dc>] lu_object_find_try+0x16c/0x2b0 [obdclass]
      06:12:07:[16468.053003]  [<ffffffffa07e48cc>] lu_object_find_at+0xac/0xe0 [obdclass]
      06:12:07:[16468.053003]  [<ffffffffa0a41b06>] ? null_alloc_rs+0x176/0x330 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffffa07e5c58>] dt_locate_at+0x18/0xb0 [obdclass]
      06:12:07:[16468.053003]  [<ffffffffa07e7c65>] dt_find_or_create+0x55/0x8d0 [obdclass]
      06:12:07:[16468.053003]  [<ffffffff811c13be>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
      06:12:07:[16468.053003]  [<ffffffffa0d69ebc>] ofd_seq_load+0x2ac/0x9c0 [ofd]
      06:12:07:[16468.053003]  [<ffffffffa0d61afa>] ofd_get_info_hdl+0x76a/0x14e0 [ofd]
      06:12:07:[16468.053003]  [<ffffffffa0a66f25>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffffa0a134bb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffffa0a11078>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffff810b88b2>] ? default_wake_function+0x12/0x20
      06:12:07:[16468.053003]  [<ffffffff810af018>] ? __wake_up_common+0x58/0x90
      06:12:07:[16468.053003]  [<ffffffffa0a17570>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffffa0a16ad0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      06:12:07:[16468.053003]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
      06:12:07:[16468.053003]  [<ffffffff810b47e6>] ? finish_task_switch+0x56/0x170
      06:12:07:[16468.053003]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
      06:12:07:[16468.053003]  [<ffffffff81646318>] ret_from_fork+0x58/0x90
      06:12:07:[16468.053003]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
      

      Logs for this failure are at:
      https://testing.hpdd.intel.com/test_sets/e6bdb3d2-3bc0-11e6-a0ce-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: