Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.12.4
    • None
    • Dell R740. Centos 7.8. Kernel: 3.10.0-1127.el7.x86_64, lustre-2.12.4-1.el7.x86_64, zfs-0.7.13-1.el7.x86_64, spl-0.7.13-1.el7.x86_64
    • 3
    • 9223372036854775807

    Description

      Hi Folks,

      We recently upgraded our Lustre ZFS servers at SUT and have been experiencing an issues with the ZFS filesytem crashing. Last week we upgraded from Lustre 2.10.5 (plus a dozen patches) & ZFS 0.7.9, over to Lustre 2.12.4 & ZFS 0.7.13

      Now if we import and mount our main zfs/lustre filesystem, then and resume Slurm jobs and move onto starting the Slurm partitions we'll hit a kernel panic on the MDS shortly after the partitions are up:

       

      May  8 20:12:37 warble2 kernel: VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg)) failed
      May  8 20:12:37 warble2 kernel: PANIC at dnode.c:1635:dnode_setdirty()
      May  8 20:12:37 warble2 kernel: Showing stack for process 45209
      May  8 20:12:37 warble2 kernel: CPU: 7 PID: 45209 Comm: mdt01_123 Tainted: P           OE  ------------   3.10.0-1127.el7.x86_64 #1
      May  8 20:12:37 warble2 kernel: Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 2.5.4 01/13/2020
      May  8 20:12:37 warble2 kernel: Call Trace:
      May  8 20:12:37 warble2 kernel: [<ffffffff9077ff85>] dump_stack+0x19/0x1b
      May  8 20:12:37 warble2 kernel: [<ffffffffc04d4f24>] spl_dumpstack+0x44/0x50 [spl]
      May  8 20:12:37 warble2 kernel: [<ffffffffc04d4ff9>] spl_panic+0xc9/0x110 [spl]
      May  8 20:12:37 warble2 kernel: [<ffffffff900c7780>] ? wake_up_atomic_t+0x30/0x30
      May  8 20:12:37 warble2 kernel: [<ffffffffc0c21073>] ? dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc04d0238>] ? spl_kmem_zalloc+0xd8/0x180 [spl]
      May  8 20:12:37 warble2 kernel: [<ffffffff90784002>] ? mutex_lock+0x12/0x2f
      May  8 20:12:37 warble2 kernel: [<ffffffffc0c31a2c>] ? dmu_objset_userquota_get_ids+0x23c/0x440 [zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc0c40f39>] dnode_setdirty+0xe9/0xf0 [zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc0c4120c>] dnode_allocate+0x18c/0x230 [zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc0c2dd2b>] dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc1630032>] __osd_object_create+0x82/0x170 [osd_zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc163027b>] osd_mksym+0x6b/0x110 [osd_zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffff907850c2>] ? down_write+0x12/0x3d
      May  8 20:12:37 warble2 kernel: [<ffffffffc162b966>] osd_create+0x316/0xaf0 [osd_zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc18ed9c5>] lod_sub_create+0x1f5/0x480 [lod]
      May  8 20:12:37 warble2 kernel: [<ffffffffc18de179>] lod_create+0x69/0x340 [lod]
      May  8 20:12:37 warble2 kernel: [<ffffffffc1622690>] ? osd_trans_create+0x410/0x410 [osd_zfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc1958173>] mdd_create_object_internal+0xc3/0x300 [mdd]
      May  8 20:12:37 warble2 kernel: [<ffffffffc194122b>] mdd_create_object+0x7b/0x820 [mdd]
      May  8 20:12:37 warble2 kernel: [<ffffffffc194b7b8>] mdd_create+0xdd8/0x14a0 [mdd]
      May  8 20:12:37 warble2 kernel: [<ffffffffc17d96d4>] mdt_create+0xb54/0x1090 [mdt]
      May  8 20:12:37 warble2 kernel: [<ffffffffc119ae94>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
      May  8 20:12:37 warble2 kernel: [<ffffffffc17d9d7b>] mdt_reint_create+0x16b/0x360 [mdt]
      May  8 20:12:37 warble2 kernel: [<ffffffffc17dc963>] mdt_reint_rec+0x83/0x210 [mdt]
      May  8 20:12:37 warble2 kernel: [<ffffffffc17b9273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      May  8 20:12:37 warble2 kernel: [<ffffffffc17c46e7>] mdt_reint+0x67/0x140 [mdt]
      May  8 20:12:37 warble2 kernel: [<ffffffffc14af64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffffc1488d91>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffffc07dcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      May  8 20:12:37 warble2 kernel: [<ffffffffc145447b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffffc1451295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffff900d3dc3>] ? __wake_up+0x13/0x20
      May  8 20:12:37 warble2 kernel: [<ffffffffc1457de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffffc14572b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      May  8 20:12:37 warble2 kernel: [<ffffffff900c6691>] kthread+0xd1/0xe0
      May  8 20:12:37 warble2 kernel: [<ffffffff900c65c0>] ? insert_kthread_work+0x40/0x40
      May  8 20:12:37 warble2 kernel: [<ffffffff90792d1d>] ret_from_fork_nospec_begin+0x7/0x21
      May  8 20:12:37 warble2 kernel: [<ffffffff900c65c0>] ? insert_kthread_work+0x40/0x40

      This issue has come up once last week, and twice tonight. We note there's a little bit of chatter over at: https://github.com/openzfs/zfs/issues/8705 but no real feedback yet, and it's been open for some time now. Are there any recommendations from the experience of Lustre developers on how we might mitigate this particular problem?

      Right now we're cloning our server image to include ZFS 0.8.3 to see if that will help.

       

      Cheers,

      Simon

       

      Attachments

        Activity

          People

            bzzz Alex Zhuravlev
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: