[LU-11996] NMI watchdog in lfsck test 18 Created: 24/Feb/19  Updated: 24/Feb/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Going through maloo crash reports ther's this NMI watchdog case that seems to only hit on b2_10 and related reviews:

[ 9012.939072] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-lfsck test 18g: Find out orphan OST-object and repair it \(7\) =============================== 07:44:51 \(1548229491\)
[ 9013.146289] Lustre: DEBUG MARKER: == sanity-lfsck test 18g: Find out orphan OST-object and repair it (7) =============================== 07:44:51 (1548229491)
[ 9013.350706] Lustre: 20100:0:(osd_internal.h:1155:osd_trans_exec_op()) lustre-OST0000: opcode 2: before 256 < left 274, rollback = 2
[ 9013.352818] Lustre: 20100:0:(osd_internal.h:1155:osd_trans_exec_op()) Skipped 16 previous similar messages
[ 9013.354412] Lustre: 20100:0:(osd_handler.c:1673:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
[ 9013.356182] Lustre: 20100:0:(osd_handler.c:1673:osd_trans_dump_creds()) Skipped 16 previous similar messages
[ 9013.357856] Lustre: 20100:0:(osd_handler.c:1680:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/274/0
[ 9013.359530] Lustre: 20100:0:(osd_handler.c:1680:osd_trans_dump_creds()) Skipped 16 previous similar messages
[ 9013.361235] Lustre: 20100:0:(osd_handler.c:1690:osd_trans_dump_creds())   write: 0/0/0, punch: 0/0/0, quota 3/3/0
[ 9013.362959] Lustre: 20100:0:(osd_handler.c:1690:osd_trans_dump_creds()) Skipped 16 previous similar messages
[ 9013.364612] Lustre: 20100:0:(osd_handler.c:1697:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
[ 9013.366183] Lustre: 20100:0:(osd_handler.c:1697:osd_trans_dump_creds()) Skipped 16 previous similar messages
[ 9013.367853] Lustre: 20100:0:(osd_handler.c:1704:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
[ 9013.369531] Lustre: 20100:0:(osd_handler.c:1704:osd_trans_dump_creds()) Skipped 16 previous similar messages
[ 9054.660354] LNet: Service thread pid 9025 was inactive for 40.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 9054.663155] Pid: 9025, comm: ll_ost_out00_00 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Jan 7 20:20:29 UTC 2019
[ 9054.664802] Call Trace:
[ 9054.665266]  [<ffffffffc0c6b130>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
[ 9054.666720]  [<ffffffffc0c6b60d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[ 9054.667862]  [<ffffffffc076c4f4>] fld_client_rpc+0x104/0x540 [fld]
[ 9054.668923]  [<ffffffffc077131f>] fld_server_lookup+0x15f/0x320 [fld]
[ 9054.670006]  [<ffffffffc0f61a81>] lfsck_orphan_it_next+0x8c1/0xc90 [lfsck]
[ 9054.671267]  [<ffffffffc0f61ebe>] lfsck_orphan_it_load+0x6e/0x160 [lfsck]
[ 9054.672419]  [<ffffffffc0a52198>] dt_index_walk+0xf8/0x430 [obdclass]
[ 9054.673701]  [<ffffffffc0a5365c>] dt_index_read+0x44c/0x6b0 [obdclass]
[ 9054.674822]  [<ffffffffc0cd8d22>] tgt_obd_idx_read+0x612/0x860 [ptlrpc]
[ 9054.676042]  [<ffffffffc0cdb37a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 9054.677249]  [<ffffffffc0c83e3b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 9054.678554]  [<ffffffffc0c87582>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 9054.679653]  [<ffffffff902bdf21>] kthread+0xd1/0xe0
[ 9054.680543]  [<ffffffff909255f7>] ret_from_fork_nospec_end+0x0/0x39
[ 9054.681615]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 9054.682506] LustreError: dumping log to /tmp/lustre-log.1548229534.9025
[ 9061.992373] LustreError: 7499:0:(fid_handler.c:337:__seq_server_alloc_meta()) srv-lustre-OST0001: Allocated super-sequence failed: rc = -11
[ 9061.994622] LustreError: 7499:0:(fid_handler.c:337:__seq_server_alloc_meta()) Skipped 3569 previous similar messages
[ 9072.297316] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [lfsck:7414]
[ 9072.302126] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel parport_pc lrw gf128mul glue_helper parport ablk_helper virtio_balloon cryptd pcspkr joydev i2c_piix4 i2c_core ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too ata_piix crct10dif_pclmul crct10dif_common crc32c_intel libata 8139cp
[ 9072.315961]  serio_raw virtio_pci mii virtio_ring virtio floppy
[ 9072.316970] CPU: 1 PID: 7414 Comm: lfsck Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7_lustre.x86_64 #1
[ 9072.318855] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 9072.319786] task: ffff9f2669f40000 ti: ffff9f2604b1c000 task.ti: ffff9f2604b1c000
[ 9072.320985] RIP: 0010:[<ffffffff9055f6a9>]  [<ffffffff9055f6a9>] __write_lock_failed+0x9/0x20
[ 9072.322414] RSP: 0018:ffff9f2604b1fbd0  EFLAGS: 00000297
[ 9072.323282] RAX: ffff9f2642722000 RBX: ffff9f2604b1fbd8 RCX: 00000000007b1a63
[ 9072.324431] RDX: 0000000000000000 RSI: ffff9f2667af8000 RDI: ffff9f2660f3c884
[ 9072.325580] RBP: ffff9f2604b1fbd0 R08: 000000000001ba80 R09: ffffffffc0f5495d
[ 9072.326732] R10: ffff9f267fd1ba80 R11: ffffe9e601710340 R12: 0000123100000000
[ 9072.327877] R13: 00001cf600000000 R14: 00000000000e68c9 R15: 5c481b9300000001
[ 9072.329023] FS:  0000000000000000(0000) GS:ffff9f267fd00000(0000) knlGS:0000000000000000
[ 9072.330325] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9072.331250] CR2: 00007fa602e2f000 CR3: 0000000035df6000 CR4: 00000000000606e0
[ 9072.332402] Call Trace:
[ 9072.332829]  [<ffffffff9091b527>] _raw_write_lock+0x17/0x20
[ 9072.333758]  [<ffffffffc0f542d6>] lfsck_rbtree_cleanup.isra.87+0x46/0x140 [lfsck]
[ 9072.334976]  [<ffffffffc0f54ac0>] lfsck_layout_slave_quit+0x290/0x340 [lfsck]
[ 9072.336152]  [<ffffffffc0f5d842>] lfsck_layout_slave_double_scan+0x102/0x1110 [lfsck]
[ 9072.337430]  [<ffffffff902d2010>] ? wake_up_state+0x20/0x20
[ 9072.338344]  [<ffffffffc0f2bd1f>] lfsck_double_scan+0x5f/0x210 [lfsck]
[ 9072.339438]  [<ffffffffc0a308c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
[ 9072.340605]  [<ffffffffc0f30946>] lfsck_master_engine+0x4a6/0x1360 [lfsck]
[ 9072.341717]  [<ffffffff902d2010>] ? wake_up_state+0x20/0x20
[ 9072.342626]  [<ffffffffc0f304a0>] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[ 9072.343838]  [<ffffffff902bdf21>] kthread+0xd1/0xe0
[ 9072.344633]  [<ffffffff902bde50>] ? insert_kthread_work+0x40/0x40
[ 9072.345627]  [<ffffffff909255f7>] ret_from_fork_nospec_begin+0x21/0x21
[ 9072.346676]  [<ffffffff902bde50>] ? insert_kthread_work+0x40/0x40

sample reports:
https://testing.whamcloud.com/test_sessions/a4bbd650-ac77-45a7-9d70-1d0eb7d6cfcf
https://testing.whamcloud.com/test_sessions/455af6f1-1cfe-4820-a7c2-c136ee3eceef

The very first time it crashed like that was in this review patch: https://testing.whamcloud.com/test_sessions/b8c26f6b-4930-4aa8-937f-eedf046e3427 but the patch is unrelated (the failure was miscathegorized as LU-11111 but it does not look related)


Generated at Sat Feb 10 02:48:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.