Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5188

nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.4.3
    • 3
    • 14392

    Description

      mounting mdt we start to see these errors then the system becomes unreponsive

      nbp6-mds login: LDISKFS-fs (dm-1): recovery complete
      LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (dm-2): recovery complete
      LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 
      Lustre: nbp6-MDT0000: Not available for connect from 10.153.1.57@o2ib233 (not set up)
      Lustre: nbp6-MDT0000: used disk, loading
      Lustre: nbp6-MDT0000: Not available for connect from 10.153.0.76@o2ib233 (not set up)
      Lustre: 2967:0:(mdt_handler.c:4969:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete.
      Lustre: 2967:0:(mdt_handler.c:4969:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 11-0: nbp6-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
      Lustre: nbp6-MDT0000: Will be in recovery for at least 5:00, or until 1083 clients reconnect
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 212864 previous similar messages
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 229931 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 450953 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 445853 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 930108 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 925668 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 1897185 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 1898934 previous similar messages
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 3829999 previous similar messages
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 3851279 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 7683067 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 7758692 previous similar messages
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 15494392 previous similar messages
      LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 15461799 previous similar messages
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 31047376 previous similar messages
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 30898154 previous similar messages
      INFO: task crond:3224 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      crond         D 0000000000000003     0  3224   2649 0x00000080
       ffff8805c5579d38 0000000000000086 ffff8805c5579d00 ffff8805c5579cfc
       0000000000000000 ffff88063fc24700 ffff880028296780 0000000000000500
       ffff880625ed85f8 ffff8805c5579fd8 000000000000fc40 ffff880625ed85f8
      Call Trace:
       [<ffffffff81540545>] schedule_timeout+0x215/0x2e0
       [<ffffffff81346555>] ? extract_entropy+0xe5/0x140
       [<ffffffff815401c3>] wait_for_common+0x123/0x180
       [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
       [<ffffffff815402dd>] wait_for_completion+0x1d/0x20
       [<ffffffff810921c8>] synchronize_sched+0x58/0x60
       [<ffffffff81092150>] ? wakeme_after_rcu+0x0/0x20
       [<ffffffff8122205c>] install_session_keyring_to_cred+0x6c/0xd0
       [<ffffffff812221f3>] join_session_keyring+0x133/0x160
       [<ffffffff810dbff7>] ? audit_syscall_entry+0x1d7/0x200
       [<ffffffff81220df8>] keyctl_join_session_keyring+0x38/0x70
       [<ffffffff81221a20>] sys_keyctl+0x170/0x190
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
      LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 61763336 previous similar messages
      LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 62092386 previous similar messages
      

      Attachments

        Issue Links

          Activity

            [LU-5188] nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.
            di.wang Di Wang added a comment - http://review.whamcloud.com/10828
            di.wang Di Wang added a comment -

            This patch cause LU-5244.

            It seems osp_sync_new_setattr_job return 0 but did not issue RPC, so we should decrease opd_syn_rpc_in_flight and opd_sync_rpc_in_progress in this case.

            di.wang Di Wang added a comment - This patch cause LU-5244 . It seems osp_sync_new_setattr_job return 0 but did not issue RPC, so we should decrease opd_syn_rpc_in_flight and opd_sync_rpc_in_progress in this case.

            Patch landed to master for 2.6.0.

            adilger Andreas Dilger added a comment - Patch landed to master for 2.6.0.

            The patched fixed our issue.

            mhanafi Mahmoud Hanafi added a comment - The patched fixed our issue.

            I think skipping the invalid record is OK, since LFSCK 2 should fix up the ownership of the OST objects on its next scan.

            adilger Andreas Dilger added a comment - I think skipping the invalid record is OK, since LFSCK 2 should fix up the ownership of the OST objects on its next scan.

            Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there.

            Right, I think we'd return 0 instead of -EINVAL, so that we just print an error message, and osp sync process can continue processing on other records.

            niu Niu Yawei (Inactive) added a comment - Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there. Right, I think we'd return 0 instead of -EINVAL, so that we just print an error message, and osp sync process can continue processing on other records.
            pjones Peter Jones added a comment -

            NIu

            Could you please complete any follow up work on this ticket?

            Thanks

            Peter

            pjones Peter Jones added a comment - NIu Could you please complete any follow up work on this ticket? Thanks Peter

            ok thanks will build and install the patch.

            mhanafi Mahmoud Hanafi added a comment - ok thanks will build and install the patch.
            green Oleg Drokin added a comment -

            Hm, it appears that this is a bug in LU-4345 patch that you have applied.

            Patch is at http://review.whamcloud.com/10706 - this should help your immediate problem.

            Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there.

            green Oleg Drokin added a comment - Hm, it appears that this is a bug in LU-4345 patch that you have applied. Patch is at http://review.whamcloud.com/10706 - this should help your immediate problem. Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there.

            People

              niu Niu Yawei (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: