Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11996

NMI watchdog in lfsck test 18

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • 3
    • 9223372036854775807

    Description

      Going through maloo crash reports ther's this NMI watchdog case that seems to only hit on b2_10 and related reviews:

      [ 9012.939072] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-lfsck test 18g: Find out orphan OST-object and repair it \(7\) =============================== 07:44:51 \(1548229491\)
      [ 9013.146289] Lustre: DEBUG MARKER: == sanity-lfsck test 18g: Find out orphan OST-object and repair it (7) =============================== 07:44:51 (1548229491)
      [ 9013.350706] Lustre: 20100:0:(osd_internal.h:1155:osd_trans_exec_op()) lustre-OST0000: opcode 2: before 256 < left 274, rollback = 2
      [ 9013.352818] Lustre: 20100:0:(osd_internal.h:1155:osd_trans_exec_op()) Skipped 16 previous similar messages
      [ 9013.354412] Lustre: 20100:0:(osd_handler.c:1673:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
      [ 9013.356182] Lustre: 20100:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 16 previous similar messages
      [ 9013.357856] Lustre: 20100:0:(osd_handler.c:1680:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/274/0
      [ 9013.359530] Lustre: 20100:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 16 previous similar messages
      [ 9013.361235] Lustre: 20100:0:(osd_handler.c:1690:osd_trans_dump_creds())   write: 0/0/0, punch: 0/0/0, quota 3/3/0
      [ 9013.362959] Lustre: 20100:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 16 previous similar messages
      [ 9013.364612] Lustre: 20100:0:(osd_handler.c:1697:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
      [ 9013.366183] Lustre: 20100:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 16 previous similar messages
      [ 9013.367853] Lustre: 20100:0:(osd_handler.c:1704:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
      [ 9013.369531] Lustre: 20100:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 16 previous similar messages
      [ 9054.660354] LNet: Service thread pid 9025 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 9054.663155] Pid: 9025, comm: ll_ost_out00_00 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:20:29 UTC 2019
      [ 9054.664802] Call Trace:
      [ 9054.665266]  [<ffffffffc0c6b130>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      [ 9054.666720]  [<ffffffffc0c6b60d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
      [ 9054.667862]  [<ffffffffc076c4f4>] fld_client_rpc+0x104/0x540 [fld]
      [ 9054.668923]  [<ffffffffc077131f>] fld_server_lookup+0x15f/0x320 [fld]
      [ 9054.670006]  [<ffffffffc0f61a81>] lfsck_orphan_it_next+0x8c1/0xc90 [lfsck]
      [ 9054.671267]  [<ffffffffc0f61ebe>] lfsck_orphan_it_load+0x6e/0x160 [lfsck]
      [ 9054.672419]  [<ffffffffc0a52198>] dt_index_walk+0xf8/0x430 [obdclass]
      [ 9054.673701]  [<ffffffffc0a5365c>] dt_index_read+0x44c/0x6b0 [obdclass]
      [ 9054.674822]  [<ffffffffc0cd8d22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
      [ 9054.676042]  [<ffffffffc0cdb37a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [ 9054.677249]  [<ffffffffc0c83e3b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [ 9054.678554]  [<ffffffffc0c87582>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [ 9054.679653]  [<ffffffff902bdf21>] kthread+0xd1/0xe0
      [ 9054.680543]  [<ffffffff909255f7>] ret_from_fork_nospec_end+0x0/0x39
      [ 9054.681615]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 9054.682506] LustreError: dumping log to /tmp/lustre-log.1548229534.9025
      [ 9061.992373] LustreError: 7499:0:(fid_handler.c:337:__seq_server_alloc_meta()) srv-lustre-OST0001: Allocated super-sequence failed: rc = -11
      [ 9061.994622] LustreError: 7499:0:(fid_handler.c:337:__seq_server_alloc_meta()) Skipped 3569 previous similar messages
      [ 9072.297316] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lfsck:7414]
      [ 9072.302126] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel parport_pc lrw gf128mul glue_helper parport ablk_helper virtio_balloon cryptd pcspkr joydev i2c_piix4 i2c_core ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too ata_piix crct10dif_pclmul crct10dif_common crc32c_intel libata 8139cp
      [ 9072.315961]  serio_raw virtio_pci mii virtio_ring virtio floppy
      [ 9072.316970] CPU: 1 PID: 7414 Comm: lfsck Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7_lustre.x86_64 #1
      [ 9072.318855] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 9072.319786] task: ffff9f2669f40000 ti: ffff9f2604b1c000 task.ti: ffff9f2604b1c000
      [ 9072.320985] RIP: 0010:[<ffffffff9055f6a9>]  [<ffffffff9055f6a9>] __write_lock_failed+0x9/0x20
      [ 9072.322414] RSP: 0018:ffff9f2604b1fbd0  EFLAGS: 00000297
      [ 9072.323282] RAX: ffff9f2642722000 RBX: ffff9f2604b1fbd8 RCX: 00000000007b1a63
      [ 9072.324431] RDX: 0000000000000000 RSI: ffff9f2667af8000 RDI: ffff9f2660f3c884
      [ 9072.325580] RBP: ffff9f2604b1fbd0 R08: 000000000001ba80 R09: ffffffffc0f5495d
      [ 9072.326732] R10: ffff9f267fd1ba80 R11: ffffe9e601710340 R12: 0000123100000000
      [ 9072.327877] R13: 00001cf600000000 R14: 00000000000e68c9 R15: 5c481b9300000001
      [ 9072.329023] FS:  0000000000000000(0000) GS:ffff9f267fd00000(0000) knlGS:0000000000000000
      [ 9072.330325] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 9072.331250] CR2: 00007fa602e2f000 CR3: 0000000035df6000 CR4: 00000000000606e0
      [ 9072.332402] Call Trace:
      [ 9072.332829]  [<ffffffff9091b527>] _raw_write_lock+0x17/0x20
      [ 9072.333758]  [<ffffffffc0f542d6>] lfsck_rbtree_cleanup.isra.87+0x46/0x140 [lfsck]
      [ 9072.334976]  [<ffffffffc0f54ac0>] lfsck_layout_slave_quit+0x290/0x340 [lfsck]
      [ 9072.336152]  [<ffffffffc0f5d842>] lfsck_layout_slave_double_scan+0x102/0x1110 [lfsck]
      [ 9072.337430]  [<ffffffff902d2010>] ? wake_up_state+0x20/0x20
      [ 9072.338344]  [<ffffffffc0f2bd1f>] lfsck_double_scan+0x5f/0x210 [lfsck]
      [ 9072.339438]  [<ffffffffc0a308c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
      [ 9072.340605]  [<ffffffffc0f30946>] lfsck_master_engine+0x4a6/0x1360 [lfsck]
      [ 9072.341717]  [<ffffffff902d2010>] ? wake_up_state+0x20/0x20
      [ 9072.342626]  [<ffffffffc0f304a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
      [ 9072.343838]  [<ffffffff902bdf21>] kthread+0xd1/0xe0
      [ 9072.344633]  [<ffffffff902bde50>] ? insert_kthread_work+0x40/0x40
      [ 9072.345627]  [<ffffffff909255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [ 9072.346676]  [<ffffffff902bde50>] ? insert_kthread_work+0x40/0x40
      

      sample reports:
      https://testing.whamcloud.com/test_sessions/a4bbd650-ac77-45a7-9d70-1d0eb7d6cfcf
      https://testing.whamcloud.com/test_sessions/455af6f1-1cfe-4820-a7c2-c136ee3eceef

      The very first time it crashed like that was in this review patch: https://testing.whamcloud.com/test_sessions/b8c26f6b-4930-4aa8-937f-eedf046e3427 but the patch is unrelated (the failure was miscathegorized as LU-11111 but it does not look related)

      Attachments

        Activity

          People

            wc-triage WC Triage
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: