Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2936

nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7054

    Description

      ORNL did a full scale system test today and one of their OSSes crashed with the above assertion.

      It seems we never saw it before because we don't seriously test /proc/fs/lustre/health_check functionality in our testing, but it's actually heavily used by a lot of sites.

      I was able to reproduce the issue with racer while running this line in parallel:

      while :; do cat /proc/fs/lustre/health_check ; done
      
      [305098.783912] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed: 
      [305098.784415] LustreError: 1075:0:(ptlrpc_internal.h:165:nrs_svcpt2nrs()) LBUG
      [305098.784682] Pid: 1075, comm: cat
      [305098.784881] 
      [305098.784881] Call Trace:
      [305098.785248]  [<ffffffffa07b7915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [305098.785524]  [<ffffffffa07b7f17>] lbug_with_loc+0x47/0xb0 [libcfs]
      [305098.785810]  [<ffffffffa11b0e22>] ptlrpc_nrs_req_poll_nolock+0xc2/0x1c0 [ptlrpc]
      [305098.786239]  [<ffffffffa1170696>] ptlrpc_svcpt_health_check+0x56/0x180 [ptlrpc]
      [305098.786664]  [<ffffffffa1170812>] ptlrpc_service_health_check+0x52/0x70 [ptlrpc]
      [305098.787079]  [<ffffffffa05d61fd>] ost_health_check+0x4d/0x90 [ost]
      [305098.787345]  [<ffffffffa0e4c8e7>] obd_proc_read_health+0x2a7/0x3b0 [obdclass]
      [305098.792327]  [<ffffffffa0e6f36c>] lprocfs_fops_read+0xec/0x1f0 [obdclass]
      [305098.793699]  [<ffffffffa0e6f280>] ? lprocfs_fops_read+0x0/0x1f0 [obdclass]
      [305098.793964]  [<ffffffff811e1cc5>] proc_reg_read+0x85/0xc0
      [305098.794200]  [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0
      [305098.794429]  [<ffffffff8117bb21>] sys_read+0x51/0x90
      [305098.794657]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      [305098.794907] 
      [305098.992803] Kernel panic - not syncing: LBUG
      

      Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-03-09-00\:44\:57

      The problem was seemingly added along with NRS code drop.

      Attachments

        Issue Links

          Activity

            [LU-2936] nrs_svcpt2nrs()) ASSERTION( (!(hp) || (nrs_svcpt_has_hp(svcpt))) ) failed

            Patch landed to master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to master.

            As mentioned in Gerrit, this bug is addressed in the NRS follow-up patch as well, but '2' from the comment above that you have used is a better solution. Maybe '1' could be used to improve things on a future patch.

            nangelinas Nikitas Angelinas added a comment - As mentioned in Gerrit, this bug is addressed in the NRS follow-up patch as well, but '2' from the comment above that you have used is a better solution. Maybe '1' could be used to improve things on a future patch.
            green Oleg Drokin added a comment - patch in http://review.whamcloud.com/5665
            green Oleg Drokin added a comment -

            Ok, I now checked the dump and the situation is clear.
            The ptlrpc_svcpt_health_check tries to call ptlrpc_nrs_req_poll_nolock twice for every service partition that has anything pending.
            Once time with hp set to true and once to false.
            Then if we happen to be called for a service that does not have hp ops registered (like ost_create), the assertion trips as the underlying nrs code seems to assuem caller must be all smart about service types and request for a hp request for a service with no possible hp requests is a no-no (which is a bit strange, considering that it's perfectly ok to check if a service has any hp requests pendign even for non-hp services.)

            As such possible fixes are:
            1. remove the assertion and restrictions on caller knowledge on underlying service when trying to fetch requests.
            2. Check that a hp request is actually available before trying to fetch it from ptlrpc_svcpt_health_check

            green Oleg Drokin added a comment - Ok, I now checked the dump and the situation is clear. The ptlrpc_svcpt_health_check tries to call ptlrpc_nrs_req_poll_nolock twice for every service partition that has anything pending. Once time with hp set to true and once to false. Then if we happen to be called for a service that does not have hp ops registered (like ost_create), the assertion trips as the underlying nrs code seems to assuem caller must be all smart about service types and request for a hp request for a service with no possible hp requests is a no-no (which is a bit strange, considering that it's perfectly ok to check if a service has any hp requests pendign even for non-hp services.) As such possible fixes are: 1. remove the assertion and restrictions on caller knowledge on underlying service when trying to fetch requests. 2. Check that a hp request is actually available before trying to fetch it from ptlrpc_svcpt_health_check

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: