Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9304

BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd kernel BUG at include/linux/scatterlist.h:65!

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running 4 Lustre Clients, 2 OSS nodes each with 1 zpool, and 1 mds.
      This OSS node:

      1. zpool status -v
        pool: ost0
        state: ONLINE
        scan: none requested
        config:

      NAME STATE READ WRITE CKSUM
      ost0 ONLINE 0 0 0
      draid1-0

      {any}

      ONLINE 0 0 0
      mpathaj ONLINE 0 0 0
      mpathai ONLINE 0 0 0
      mpathah ONLINE 0 0 0
      mpathag ONLINE 0 0 0
      mpathaq ONLINE 0 0 0
      mpathap ONLINE 0 0 0
      mpathak ONLINE 0 0 0
      mpathz ONLINE 0 0 0
      mpatham ONLINE 0 0 0
      mpathal ONLINE 0 0 0
      mpathao ONLINE 0 0 0
      spares
      $draid1-0-s0 AVAIL

      errors: No known data errors

      This build of zfs was from coral-prototype branch and Lustre was a Lustre Master from Dec 1st.

      We were running our file system aging utility: FileAger.py (1-2 copies on each of the 4 client nodes) along an IOR: mpirun -wdir /mnt/lustre/ -np 4 -rr -machinefile hosts -env I_MPI_EXTRA_FILESYSTEM=on -env I_MPI_EXTRA_FILESYSTEM_LIST=lustre /home/johnsali/wolf-3/ior/src/ior -a POSIX -F -N 4 -d 2 -i 1 -s 20000 -b 16MB -t 16MB -k -w -r

      While this was running it appears we hit this failure.

      [159898.950714] BUG: Bad page state in process ll_ost_io01_013 pfn:1a01bcd
      [159898.960045] page:ffffea006806f340 count:-1 mapcount:0 mapping: (null) index:0x0
      [159898.970667] page flags: 0x6fffff00000000()
      [159898.976808] page dumped because: nonzero _count
      [159898.983412] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [159899.072452] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [159899.135473] CPU: 57 PID: 98747 Comm: ll_ost_io01_013 Tainted: G IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [159899.149461] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [159899.162801] ffffea006806f340 00000000424e76b3 ffff880f9e233908 ffffffff81636431
      [159899.172821] ffff880f9e233930 ffffffff81631645 ffffea006806f340 0000000000000000
      [159899.182870] 000fffff00000000 ffff880f9e233978 ffffffff811714dd fff00000fe000000
      [159899.192895] Call Trace:
      [159899.197269] [<ffffffff81636431>] dump_stack+0x19/0x1b
      [159899.204667] [<ffffffff81631645>] bad_page.part.59+0xdf/0xfc
      [159899.212639] [<ffffffff811714dd>] free_pages_prepare+0x16d/0x190
      [159899.220965] [<ffffffff81171e21>] free_hot_cold_page+0x31/0x140
      [159899.229171] [<ffffffff8117200f>] __free_pages+0x3f/0x60
      [159899.236690] [<ffffffffa100bad3>] osd_bufs_put+0x123/0x1f0 [osd_zfs]
      [159899.245372] [<ffffffffa118284a>] ofd_commitrw_write+0xea/0x1c20 [ofd]
      [159899.254234] [<ffffffffa1186f2d>] ofd_commitrw+0x51d/0xa40 [ofd]
      [159899.262551] [<ffffffffa0d538d5>] obd_commitrw+0x2ec/0x32f [ptlrpc]
      [159899.271488] [<ffffffffa0d2bf71>] tgt_brw_write+0xea1/0x1640 [ptlrpc]
      [159899.280509] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [159899.288372] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [159899.297010] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [159899.306746] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [159899.316058] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [159899.326348] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [159899.335679] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [159899.345029] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [159899.353394] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [159899.361264] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [159899.369596] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [159899.379160] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [159899.385881] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.394413] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [159899.401653] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [159899.410157] Disabling lock debugging due to kernel taint
      [163012.964891] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3c5:0x0] object 0x0:44785 extent [67108864-80752639]: client csum 7f08fe36, server csum f8fbfe4c
      [163012.990138] LustreError: Skipped 2 previous similar messages
      [163020.008131] LustreError: 168-f: BAD WRITE CHECKSUM: lsdraid-OST0000 from 12345-192.168.1.8@o2ib inode [0x200000406:0x3d6:0x0] object 0x0:44794 extent [83886080-100270079]: client csum 886feb33, server csum ccc0eb4a
      [163042.829796] -----------[ cut here ]-----------
      [163042.837389] kernel BUG at include/linux/scatterlist.h:65!
      [163042.845758] invalid opcode: 0000 1 SMP
      [163042.852645] Modules linked in: nfsv3 nfs_acl raid10 osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) zfs(OE) zunicode(OE) zavl(OE) zcommon(OE) znvpair(OE) spl(OE) zlib_deflate lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) sha512_generic crypto_null rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xprtrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ses dm_service_time enclosure intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd mpt3sas ipmi_devintf ipmi_ssif ipmi_si
      [163042.944819] raid_class sb_edac iTCO_wdt iTCO_vendor_support scsi_transport_sas sg edac_core pcspkr ipmi_msghandler wmi ioatdma mei_me mei lpc_ich shpchp i2c_i801 mfd_core acpi_pad acpi_power_meter dm_multipath dm_mod ip_tables ext4 mbcache jbd2 mlx4_ib mlx4_en ib_sa vxlan ib_mad ip6_udp_tunnel udp_tunnel ib_core ib_addr sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul igb crct10dif_common ttm ptp crc32c_intel ahci pps_core drm mlx4_core libahci dca i2c_algo_bit libata i2c_core [last unloaded: zunicode]
      [163043.010335] CPU: 12 PID: 84956 Comm: ll_ost_io00_002 Tainted: G B IOE ------------ 3.10.0-327.36.3.el7.x86_64 #1
      [163043.025057] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [163043.038989] task: ffff880fc52bc500 ti: ffff880fc55bc000 task.ti: ffff880fc55bc000
      [163043.049639] RIP: 0010:[<ffffffffa0960fef>] [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.063453] RSP: 0018:ffff880fc55bfab8 EFLAGS: 00010202
      [163043.071687] RAX: 0000000000000002 RBX: ffff8810f6db9b80 RCX: 0000000000000000
      [163043.081918] RDX: 0000000000000020 RSI: 0000000000000000 RDI: ffff880fc55bfad8
      [163043.092095] RBP: ffff880fc55bfb00 R08: 00000000000195a0 R09: ffff880fc55bfab8
      [163043.103441] R10: ffff88103e807900 R11: 0000000000000001 R12: 3635343332313036
      [163043.113462] R13: 0000000033323130 R14: 0000000000000534 R15: 0000000000000000
      [163043.123487] FS: 0000000000000000(0000) GS:ffff88103ef00000(0000) knlGS:0000000000000000
      [163043.134599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [163043.143101] CR2: 00007fce5afab000 CR3: 000000000194a000 CR4: 00000000001407e0
      [163043.153184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [163043.163242] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [163043.173280] Stack:
      [163043.177580] 0000000000000002 0000000000000000 0000000000000000 0000000000000000
      [163043.188354] 00000000f43b381e 0000000000000000 ffff880fcc7d1301 ffff880e73ecc200
      [163043.199140] 0000000000000000 ffff880fc55bfb68 ffffffffa0d5345c ffff88202563f0a8
      [163043.209907] Call Trace:
      [163043.215455] [<ffffffffa0d5345c>] tgt_checksum_bulk.isra.33+0x35a/0x4e7 [ptlrpc]
      [163043.226242] [<ffffffffa0d2c21d>] tgt_brw_write+0x114d/0x1640 [ptlrpc]
      [163043.235986] [<ffffffff810c15cc>] ? update_curr+0xcc/0x150
      [163043.244558] [<ffffffff810be46e>] ? account_entity_dequeue+0xae/0xd0
      [163043.254271] [<ffffffffa0c82560>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
      [163043.264858] [<ffffffffa0d28225>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      [163043.275043] [<ffffffffa0cd41ab>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [163043.286074] [<ffffffffa0967128>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [163043.296175] [<ffffffffa0cd1d68>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [163043.306194] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
      [163043.315553] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
      [163043.324714] [<ffffffffa0cd8260>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
      [163043.334070] [<ffffffffa0cd77c0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
      [163043.344635] [<ffffffff810a5b8f>] kthread+0xcf/0xe0
      [163043.352181] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.361606] [<ffffffff81646a98>] ret_from_fork+0x58/0x90
      [163043.369571] [<ffffffff810a5ac0>] ? kthread_create_on_node+0x140/0x140
      [163043.378772] Code: 89 43 38 48 8b 43 20 ff 50 c0 48 8b 55 d8 65 48 33 14 25 28 00 00 00 75 0d 48 83 c4 28 5b 41 5c 41 5d 41 5e 5d c3 e8 61 a0 71 e0 <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
      [163043.406113] RIP [<ffffffffa0960fef>] cfs_crypto_hash_update_page+0x9f/0xb0 [libcfs]
      [163043.416991] RSP <ffff880fc55bfab8>

      This happened fairly quickly. After this run I restarted the system and it happened again almost immediately.

      Attachments

        Issue Links

          Activity

            People

              utopiabound Nathaniel Clark
              jsalians_intel John Salinas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: