Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15503

Crash in qsd_upd_thread trying to print a debug message.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      maloo is hitting this crash:

      [125926.340052] general protection fault: 0000 [#1] SMP PTI
      [125926.341277] CPU: 1 PID: 825968 Comm: lquota_wb_lustr Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
      [125926.343699] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [125926.372056] RIP: 0010:string_nocheck+0x12/0x70
      [125926.373076] Code: 00 00 4c 89 e2 be 20 00 00 00 48 89 ef e8 86 93 00 00 4c 01 e3 eb 81 90 49 89 f2 48 89 ce 48 89 f8 48 c1 fe 30 66 85 f6 74 4f <44> 0f b6 0a 45 84 c9 74 46 83 ee 01 41 b8 01 00 00 00 48 8d 7c 37
      [125926.376578] RSP: 0018:ffffab2105dd3cb8 EFLAGS: 00010286
      [125926.377625] RAX: ffff9afe29483d9f RBX: ffff9afe29484000 RCX: ffff0a00ffffff04
      [125926.379021] RDX: 247c894800000028 RSI: ffffffffffffffff RDI: ffff9afe29483d9f
      [125926.380429] RBP: 247c894800000028 R08: 0000000000000055 R09: 0000000000000001
      [125926.381821] R10: ffff9afe29484000 R11: ffff9afe29483d4f R12: ffff0a00ffffff04
      [125926.383218] R13: ffffffffc159a59a R14: 0000000000000261 R15: ffffffffc159a59a
      [125926.384612] FS:  0000000000000000(0000) GS:ffff9afebfd00000(0000) knlGS:0000000000000000
      [125926.386200] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [125926.387345] CR2: 00007f172b497000 CR3: 000000009ac0a003 CR4: 00000000001606e0
      [125926.388746] Call Trace:
      [125926.394666]  string+0x40/0x50
      [125926.403630]  vsnprintf+0x33c/0x520
      [125926.404461]  libcfs_debug_msg+0x83d/0xb00 [libcfs]
      [125926.412242]  ? try_to_del_timer_sync+0x4d/0x80
      [125926.413177]  ? __next_timer_interrupt+0xf0/0xf0
      [125926.414185]  ? qsd_upd_thread+0x86e/0xd20 [lquota]
      [125926.415176]  qsd_upd_thread+0x86e/0xd20 [lquota]
      [125926.416136]  ? qsd_upd_add+0x100/0x100 [lquota]
      [125926.417086]  kthread+0x112/0x130
      [125926.417784]  ? kthread_flush_work_fn+0x10/0x10
      [125926.418703]  ret_from_fork+0x35/0x40
      [125926.419472] Modules linked in: dm_flakey nfsd nfs_acl obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul dm_mod ghash_clmulni_intel pcspkr joydev virtio_balloon i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic ata_piix libata virtio_net crc32c_intel serio_raw net_failover failover virtio_blk [last unloaded: dm_flakey] 

      There's only a single print in that function so I can only assume list_entry returns garbage?:

                              if (count % 7 == 0) {
                                      n = list_entry(&queue, struct qsd_upd_rec,
                                                     qur_link);
                                      CWARN("%s: The reintegration thread [%d] "
                                            "blocked more than %ld seconds\n",
                                            n->qur_qqi->qqi_qsd->qsd_svname,
                                            n->qur_qqi->qqi_qtype, count *
                                            cfs_time_seconds(QSD_WB_INTERVAL) / 10);
                              } 

      Example reports:

      https://testing.whamcloud.com/test_sets/785c0e7b-cd04-422a-8bc3-9eaacc47d4b0

      https://testing.whamcloud.com/test_sets/43f81877-2c6c-411a-990a-911905b85a7f

      https://testing.whamcloud.com/test_sets/44640986-5ef4-48cc-a468-beefa26fcd3a

       

      So far this was only observed on rhel8 testing only

      Attachments

        Issue Links

          Activity

            People

              ys Yang Sheng
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: