Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2936

nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7054

    Description

      ORNL did a full scale system test today and one of their OSSes crashed with the above assertion.

      It seems we never saw it before because we don't seriously test /proc/fs/lustre/health_check functionality in our testing, but it's actually heavily used by a lot of sites.

      I was able to reproduce the issue with racer while running this line in parallel:

      while :; do cat /proc/fs/lustre/health_check ; done
      
      [305098.783912] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed: 
      [305098.784415] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) LBUG
      [305098.784682] Pid: 1075, comm: cat
      [305098.784881] 
      [305098.784881] Call Trace:
      [305098.785248]  [<ffffffffa07b7915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [305098.785524]  [<ffffffffa07b7f17>] lbug_with_loc+0x47/0xb0 [libcfs]
      [305098.785810]  [<ffffffffa11b0e22>] ptlrpc_nrs_req_poll_nolock+0xc2/0x1c0 [ptlrpc]
      [305098.786239]  [<ffffffffa1170696>] ptlrpc_svcpt_health_check+0x56/0x180 [ptlrpc]
      [305098.786664]  [<ffffffffa1170812>] ptlrpc_service_health_check+0x52/0x70 [ptlrpc]
      [305098.787079]  [<ffffffffa05d61fd>] ost_health_check+0x4d/0x90 [ost]
      [305098.787345]  [<ffffffffa0e4c8e7>] obd_proc_read_health+0x2a7/0x3b0 [obdclass]
      [305098.792327]  [<ffffffffa0e6f36c>] lprocfs_fops_read+0xec/0x1f0 [obdclass]
      [305098.793699]  [<ffffffffa0e6f280>] ? lprocfs_fops_read+0x0/0x1f0 [obdclass]
      [305098.793964]  [<ffffffff811e1cc5>] proc_reg_read+0x85/0xc0
      [305098.794200]  [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0
      [305098.794429]  [<ffffffff8117bb21>] sys_read+0x51/0x90
      [305098.794657]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      [305098.794907] 
      [305098.992803] Kernel panic - not syncing: LBUG
      

      Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-03-09-00\:44\:57

      The problem was seemingly added along with NRS code drop.

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: