Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10198

GPF llog_osd_declare_write_rec+0xb6/0x3d0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.5
    • None
    • 3
    • 9223372036854775807

    Description

      [ 1890.299266] LustreError: 19229:0:(mdd_device.c:263:llog_changelog_cancel()) lustre-MDD0000: cancel idx 23 of catalog [0x5:0xa:0x0]: rc = -2
      [ 1890.291625] Modules linked in: loop lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) mbcache lquota(OE) lfsck(OE) jbd2 obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxsf(OE) snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm iosf_mbi
      [ 1890.308621]  crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper snd_timer ppdev pcspkr snd ablk_helper soundcore cryptd virtio_net sg vboxvideo(OE) vboxguest(OE) ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops video drm i2c_piix4 parport_pc parport i2c_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ata_generic pata_acpi ahci libahci e1000 ata_piix crct10dif_pclmul crct10dif_common crc32c_intel libata serio_raw virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
      [ 1890.317566] CPU: 3 PID: 19257 Comm: lctl Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.x86_64 #1
      [ 1890.320750] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [ 1890.322062] task: ffff88007beddee0 ti: ffff880075950000 task.ti: ffff880075950000
      [ 1890.323310] RIP: 0010:[<ffffffffc07ba9b6>]  [<ffffffffc07ba9b6>] llog_osd_declare_write_rec+0xb6/0x3d0 [obdclass]
      [ 1890.325834] RSP: 0018:ffff8800759537a8  EFLAGS: 00010206
      [ 1890.327262] RAX: 5a5a5a5a5a5a5a5a RBX: ffff880067977000 RCX: 0000000000000000
      [ 1890.328622] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88007a9b2580
      [ 1890.329976] RBP: ffff8800759537d8 R08: 0000000000000001 R09: 000000000000ffef
      [ 1890.331375] R10: 0000000000002000 R11: 000000000000000f R12: ffff880075953d88
      [ 1890.332884] R13: ffff880067d70840 R14: ffff880067977078 R15: ffff8800671e1c00
      [ 1890.334374] FS:  00007f173df09740(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
      [ 1890.336313] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1890.338548] CR2: 00007fbcc671ae40 CR3: 000000006c198000 CR4: 00000000000406e0
      [ 1890.339733] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1890.340838] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1890.341975] Stack:
      [ 1890.343053]  ffff8800000029b9 ffff8800671e1200 ffffffffc1057100 ffff880075953d88
      [ 1890.344121]  ffff88006a054000 00000000000029b9 ffff880075953818 ffffffffc07af174
      [ 1890.345200]  ffff880067d70840 ffff8800671e1200 ffff880067d70840 ffff880075953d88
      [ 1890.346339] Call Trace:
      [ 1890.347426]  [<ffffffffc07af174>] llog_declare_write_rec+0x84/0x200 [obdclass]
      [ 1890.348452]  [<ffffffffc07af8f2>] llog_cancel_rec+0xe2/0x870 [obdclass]
      [ 1890.349732]  [<ffffffffc07b66fa>] llog_cat_cancel_records+0x13a/0x2e0 [obdclass]
      [ 1890.350742]  [<ffffffffc0662c47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 1890.351750]  [<ffffffffc10143d4>] llog_changelog_cancel_cb+0xe4/0x1d0 [mdd]
      [ 1890.353074]  [<ffffffffc07b08f5>] llog_process_thread+0x875/0x1460 [obdclass]
      [ 1890.354407]  [<ffffffffc10142f0>] ? mdd_device_free+0x2a0/0x2a0 [mdd]
      [ 1890.355508]  [<ffffffffc07b159c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [ 1890.356457]  [<ffffffffc07b6ad9>] llog_cat_process_cb+0x239/0x250 [obdclass]
      [ 1890.357413]  [<ffffffffc07b08f5>] llog_process_thread+0x875/0x1460 [obdclass]
      [ 1890.358301]  [<ffffffffc07b68a0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
      [ 1890.359194]  [<ffffffffc07b159c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      [ 1890.360036]  [<ffffffffc07b68a0>] ? llog_cat_cancel_records+0x2e0/0x2e0 [obdclass]
      [ 1890.360885]  [<ffffffffc07b5e69>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
      [ 1890.361731]  [<ffffffffc10142f0>] ? mdd_device_free+0x2a0/0x2a0 [mdd]
      [ 1890.362529]  [<ffffffffc10142f0>] ? mdd_device_free+0x2a0/0x2a0 [mdd]
      [ 1890.363344]  [<ffffffffc07b5f9e>] llog_cat_process+0x2e/0x30 [obdclass]
      ....
      
      

      I can reproduce this using this test https://review.whamcloud.com/#/c/29035/ in loop mode.

         10986:       4d 8b 7e 70             mov    0x70(%r14),%r15
         1098a:       4d 85 ff                test   %r15,%r15
         1098d:       0f 84 9c 02 00 00       je     10c2f <llog_osd_declare_write_rec+0x32f>
         10993:       44 89 d0                mov    %r10d,%eax
         10996:       48 c7 43 78 00 00 00    movq   $0x0,0x78(%rbx)
         1099d:       00
         1099e:       4c 8d 73 78             lea    0x78(%rbx),%r14
         109a2:       48 89 83 80 00 00 00    mov    %rax,0x80(%rbx)
         109a9:       49 8b 47 30             mov    0x30(%r15),%rax
         109ad:       48 85 c0                test   %rax,%rax
         109b0:       0f 84 7a 01 00 00       je     10b30 <llog_osd_declare_write_rec+0x230>
         109b6:       48 8b 40 08             mov    0x8(%rax),%rax
      
      
      

      I have done quick analyze, fault happened when accessing
      llog_handle->dt_object->do_body_ops
      it contains poison (rax) 0x5a5a5a5a5a..
      It is great that dt_declare_record_write check for NULL pointer
      LASSERT(dt->do_body_ops);
      maybe poison check is required here also? It will simplify analyze of future fails. 

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: