Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2059

mgc to backup configuration on osd-based llogs

Details

    • 3
    • 4308

    Attachments

      Issue Links

        Activity

          [LU-2059] mgc to backup configuration on osd-based llogs
          yujian Jian Yu added a comment -

          The above patches were merged into Lustre b2_5 branch for 2.5.4 release.

          yujian Jian Yu added a comment - The above patches were merged into Lustre b2_5 branch for 2.5.4 release.
          yujian Jian Yu added a comment - Here are the back-ported patches for Lustre b2_5 branch: http://review.whamcloud.com/12427 (from http://review.whamcloud.com/10311 ) http://review.whamcloud.com/12428 (from http://review.whamcloud.com/10589 )

          insanity.sh test_1 is also being skipped for ZFS filesystems due to this bug.

          adilger Andreas Dilger added a comment - insanity.sh test_1 is also being skipped for ZFS filesystems due to this bug.

          sanity-quota test_7d is still being skipped for ZFS filesystems, reportedly because of this bug.

          adilger Andreas Dilger added a comment - sanity-quota test_7d is still being skipped for ZFS filesystems, reportedly because of this bug.
          utopiabound Nathaniel Clark added a comment - Port to b2_4: http://review.whamcloud.com/10771

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          tappro Mikhail Pershin added a comment - - edited

          Patch in gerrit with last fixes related to 'params' log handling and enabled tests:
          http://review.whamcloud.com/10311

          The test_1 in insanity is still excluded, it failed due to deadlock of some kind and need more investigation. Maybe it worths to file another bug for that or reopen LU-3562

          tappro Mikhail Pershin added a comment - - edited Patch in gerrit with last fixes related to 'params' log handling and enabled tests: http://review.whamcloud.com/10311 The test_1 in insanity is still excluded, it failed due to deadlock of some kind and need more investigation. Maybe it worths to file another bug for that or reopen LU-3562

          the local config logs are created as expected, meanwhile there was problem with 'params' log, it can be empty and that case shouldn't be critical. I've enabled all excepted tests and they passed with both ldiskfs and zfs except insanity test_1. It failed only with zfs as following:

          May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) failed processing log: -5
          May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) Skipped 2 previous similar messages
          May 13 14:56:15 nodez kernel: INFO: task mount.lustre:15680 blocked for more than 120 seconds.
          May 13 14:56:15 nodez kernel:      Tainted: P           ---------------    2.6.32 #11
          May 13 14:56:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          May 13 14:56:15 nodez kernel: mount.lustre  D ffff880001e0f7e0  3592 15680  15679 0x00000080
          May 13 14:56:15 nodez kernel: ffff8800241167f0 0000000000000082 0000000000000000 ffffffff814ffbf6
          May 13 14:56:15 nodez kernel: ffff88002d7a3b38 ffffffff81a2d020 ffff880024116ac0 000000000000f7e0
          May 13 14:56:15 nodez kernel: ffff88001eaabfd8 ffff880024116ac0 ffff8800241167f0 ffff88001eaab6c8
          May 13 14:56:15 nodez kernel: Call Trace:
          May 13 14:56:15 nodez kernel: [<ffffffff814ffbf6>] ? __mutex_lock_slowpath+0x1f6/0x2a0
          May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280
          May 13 14:56:15 nodez kernel: [<ffffffff81048804>] ? task_rq_lock+0x54/0xa0
          May 13 14:56:15 nodez kernel: [<ffffffff810512cd>] ? set_cpus_allowed_ptr+0xad/0x1b0
          May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280
          May 13 14:56:15 nodez kernel: [<ffffffff814fe319>] ? wait_for_common+0x139/0x1a0
          May 13 14:56:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20
          May 13 14:56:15 nodez kernel: [<ffffffffa0eed770>] ? client_lwp_config_process+0x0/0x1540 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0e72d9c>] ? llog_process_or_fork+0x2bc/0x4c0 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0e72faf>] ? llog_process+0xf/0x20 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ea9f79>] ? class_config_parse_llog+0x1e9/0x340 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0e78b02>] ? __llog_ctxt_put+0xf2/0x150 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa027948c>] ? mgc_process_cfg_log+0x56c/0xe90 [mgc]
          May 13 14:56:15 nodez kernel: [<ffffffffa027bdfa>] ? mgc_process_log+0x59a/0x1390 [mgc]
          May 13 14:56:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs]
          May 13 14:56:15 nodez kernel: [<ffffffffa027ccc5>] ? do_config_log_add+0xd5/0x5b0 [mgc]
          May 13 14:56:15 nodez kernel: [<ffffffffa0276dd0>] ? mgc_blocking_ast+0x0/0x800 [mgc]
          May 13 14:56:15 nodez kernel: [<ffffffffa0300240>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc]
          May 13 14:56:15 nodez kernel: [<ffffffffa027db8f>] ? mgc_process_config+0x69f/0x1290 [mgc]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec1dc9>] ? lustre_process_log+0x219/0x1000 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec2106>] ? lustre_process_log+0x556/0x1000 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffff811283e2>] ? kmem_cache_alloc_trace+0x1c2/0x1d0
          May 13 14:56:15 nodez kernel: [<ffffffffa0ee89c6>] ? lustre_start_lwp+0x146/0x700 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ee8a1f>] ? lustre_start_lwp+0x19f/0x700 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0eead66>] ? server_start_targets+0xeb6/0x1d30 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ebe4c6>] ? lustre_start_mgc+0x456/0x29b0 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0eb6af0>] ? class_config_llog_handler+0x0/0x1e50 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ef1a3f>] ? server_fill_super+0xbff/0x15c0 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec0bf8>] ? lustre_fill_super+0x1d8/0x520 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffff8113c9a9>] ? get_sb_nodev+0x59/0xb0
          May 13 14:56:15 nodez kernel: [<ffffffffa0eb8b5c>] ? lustre_get_sb+0x1c/0x30 [obdclass]
          May 13 14:56:15 nodez kernel: [<ffffffff8113bf16>] ? vfs_kern_mount+0x76/0x1b0
          May 13 14:56:15 nodez kernel: [<ffffffff8113c0c3>] ? do_kern_mount+0x53/0x130
          May 13 14:56:15 nodez kernel: [<ffffffff81157d32>] ? do_mount+0x302/0x8f0
          May 13 14:56:15 nodez kernel: [<ffffffff811583ba>] ? sys_mount+0x9a/0xf0
          May 13 14:56:15 nodez kernel: [<ffffffff81009f42>] ? system_call_fastpath+0x16/0x1b
          May 13 14:57:14 nodez kernel: LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
          May 13 14:57:14 nodez kernel: LustreError: Skipped 129 previous similar messages
          May 13 14:57:24 nodez kernel: LustreError: 11-0: lustre-MDT0000-osp-MDT0001: Communicating with 0@lo, operation mds_connect failed with -19.
          May 13 14:57:24 nodez kernel: LustreError: Skipped 51 previous similar messages
          May 13 14:58:15 nodez kernel: INFO: task ll_cfg_requeue:13691 blocked for more than 120 seconds.
          May 13 14:58:15 nodez kernel:      Tainted: P           ---------------    2.6.32 #11
          May 13 14:58:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          May 13 14:58:15 nodez kernel: ll_cfg_requeu D ffff880001e0f7e0  4760 13691      2 0x00000080
          May 13 14:58:15 nodez kernel: ffff88001f96c040 0000000000000046 0000000000000000 ffff880001e0f920
          May 13 14:58:15 nodez kernel: 0000000000000018 ffffffff81a2d020 ffff88001f96c310 000000000000f7e0
          May 13 14:58:15 nodez kernel: ffff880028f99fd8 ffff88001f96c310 ffff88001f96c040 ffff880001e0f920
          May 13 14:58:15 nodez kernel: Call Trace:
          May 13 14:58:15 nodez kernel: [<ffffffff812bc54a>] ? string+0x3a/0xf0
          May 13 14:58:15 nodez kernel: [<ffffffff812bdbf7>] ? vsnprintf+0x357/0x600
          May 13 14:58:15 nodez kernel: [<ffffffff814ffb17>] ? __mutex_lock_slowpath+0x117/0x2a0
          May 13 14:58:15 nodez kernel: [<ffffffff8150031f>] ? mutex_lock+0x1f/0x40
          May 13 14:58:15 nodez kernel: [<ffffffffa027b8c3>] ? mgc_process_log+0x63/0x1390 [mgc]
          May 13 14:58:15 nodez kernel: [<ffffffff81501377>] ? _spin_lock_irqsave+0x17/0x20
          May 13 14:58:15 nodez kernel: [<ffffffff81068353>] ? lock_timer_base+0x33/0x70
          May 13 14:58:15 nodez kernel: [<ffffffff8106899e>] ? try_to_del_timer_sync+0x6e/0xd0
          May 13 14:58:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs]
          May 13 14:58:15 nodez kernel: [<ffffffffa027ea8d>] ? mgc_requeue_thread+0x30d/0x7a0 [mgc]
          May 13 14:58:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20
          May 13 14:58:15 nodez kernel: [<ffffffffa027e780>] ? mgc_requeue_thread+0x0/0x7a0 [mgc]
          May 13 14:58:15 nodez kernel: [<ffffffff81077e9e>] ? kthread+0x8e/0xa0
          May 13 14:58:15 nodez kernel: [<ffffffff8100b0ea>] ? child_rip+0xa/0x20
          May 13 14:58:15 nodez kernel: [<ffffffff81077e10>] ? kthread+0x0/0xa0
          May 13 14:58:15 nodez kernel: [<ffffffff8100b0e0>] ? child_rip+0x0/0x20
          

          That looks like deadlock, not sure why it happens with ZFS only though

          tappro Mikhail Pershin added a comment - the local config logs are created as expected, meanwhile there was problem with 'params' log, it can be empty and that case shouldn't be critical. I've enabled all excepted tests and they passed with both ldiskfs and zfs except insanity test_1. It failed only with zfs as following: May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) failed processing log: -5 May 13 14:56:02 nodez kernel: LustreError: 13691:0:(mgc_request.c:516:do_requeue()) Skipped 2 previous similar messages May 13 14:56:15 nodez kernel: INFO: task mount.lustre:15680 blocked for more than 120 seconds. May 13 14:56:15 nodez kernel: Tainted: P --------------- 2.6.32 #11 May 13 14:56:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 13 14:56:15 nodez kernel: mount.lustre D ffff880001e0f7e0 3592 15680 15679 0x00000080 May 13 14:56:15 nodez kernel: ffff8800241167f0 0000000000000082 0000000000000000 ffffffff814ffbf6 May 13 14:56:15 nodez kernel: ffff88002d7a3b38 ffffffff81a2d020 ffff880024116ac0 000000000000f7e0 May 13 14:56:15 nodez kernel: ffff88001eaabfd8 ffff880024116ac0 ffff8800241167f0 ffff88001eaab6c8 May 13 14:56:15 nodez kernel: Call Trace: May 13 14:56:15 nodez kernel: [<ffffffff814ffbf6>] ? __mutex_lock_slowpath+0x1f6/0x2a0 May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280 May 13 14:56:15 nodez kernel: [<ffffffff81048804>] ? task_rq_lock+0x54/0xa0 May 13 14:56:15 nodez kernel: [<ffffffff810512cd>] ? set_cpus_allowed_ptr+0xad/0x1b0 May 13 14:56:15 nodez kernel: [<ffffffff814ff325>] ? schedule_timeout+0x1c5/0x280 May 13 14:56:15 nodez kernel: [<ffffffff814fe319>] ? wait_for_common+0x139/0x1a0 May 13 14:56:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20 May 13 14:56:15 nodez kernel: [<ffffffffa0eed770>] ? client_lwp_config_process+0x0/0x1540 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0e72d9c>] ? llog_process_or_fork+0x2bc/0x4c0 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0e72faf>] ? llog_process+0xf/0x20 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ea9f79>] ? class_config_parse_llog+0x1e9/0x340 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0e78b02>] ? __llog_ctxt_put+0xf2/0x150 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa027948c>] ? mgc_process_cfg_log+0x56c/0xe90 [mgc] May 13 14:56:15 nodez kernel: [<ffffffffa027bdfa>] ? mgc_process_log+0x59a/0x1390 [mgc] May 13 14:56:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs] May 13 14:56:15 nodez kernel: [<ffffffffa027ccc5>] ? do_config_log_add+0xd5/0x5b0 [mgc] May 13 14:56:15 nodez kernel: [<ffffffffa0276dd0>] ? mgc_blocking_ast+0x0/0x800 [mgc] May 13 14:56:15 nodez kernel: [<ffffffffa0300240>] ? ldlm_completion_ast+0x0/0x910 [ptlrpc] May 13 14:56:15 nodez kernel: [<ffffffffa027db8f>] ? mgc_process_config+0x69f/0x1290 [mgc] May 13 14:56:15 nodez kernel: [<ffffffffa0ec1dc9>] ? lustre_process_log+0x219/0x1000 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ec2106>] ? lustre_process_log+0x556/0x1000 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffff811283e2>] ? kmem_cache_alloc_trace+0x1c2/0x1d0 May 13 14:56:15 nodez kernel: [<ffffffffa0ee89c6>] ? lustre_start_lwp+0x146/0x700 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ee8a1f>] ? lustre_start_lwp+0x19f/0x700 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0eead66>] ? server_start_targets+0xeb6/0x1d30 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ebe4c6>] ? lustre_start_mgc+0x456/0x29b0 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0eb6af0>] ? class_config_llog_handler+0x0/0x1e50 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ef1a3f>] ? server_fill_super+0xbff/0x15c0 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ec0bf8>] ? lustre_fill_super+0x1d8/0x520 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffffa0ec0a20>] ? lustre_fill_super+0x0/0x520 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffff8113c9a9>] ? get_sb_nodev+0x59/0xb0 May 13 14:56:15 nodez kernel: [<ffffffffa0eb8b5c>] ? lustre_get_sb+0x1c/0x30 [obdclass] May 13 14:56:15 nodez kernel: [<ffffffff8113bf16>] ? vfs_kern_mount+0x76/0x1b0 May 13 14:56:15 nodez kernel: [<ffffffff8113c0c3>] ? do_kern_mount+0x53/0x130 May 13 14:56:15 nodez kernel: [<ffffffff81157d32>] ? do_mount+0x302/0x8f0 May 13 14:56:15 nodez kernel: [<ffffffff811583ba>] ? sys_mount+0x9a/0xf0 May 13 14:56:15 nodez kernel: [<ffffffff81009f42>] ? system_call_fastpath+0x16/0x1b May 13 14:57:14 nodez kernel: LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server. May 13 14:57:14 nodez kernel: LustreError: Skipped 129 previous similar messages May 13 14:57:24 nodez kernel: LustreError: 11-0: lustre-MDT0000-osp-MDT0001: Communicating with 0@lo, operation mds_connect failed with -19. May 13 14:57:24 nodez kernel: LustreError: Skipped 51 previous similar messages May 13 14:58:15 nodez kernel: INFO: task ll_cfg_requeue:13691 blocked for more than 120 seconds. May 13 14:58:15 nodez kernel: Tainted: P --------------- 2.6.32 #11 May 13 14:58:15 nodez kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 13 14:58:15 nodez kernel: ll_cfg_requeu D ffff880001e0f7e0 4760 13691 2 0x00000080 May 13 14:58:15 nodez kernel: ffff88001f96c040 0000000000000046 0000000000000000 ffff880001e0f920 May 13 14:58:15 nodez kernel: 0000000000000018 ffffffff81a2d020 ffff88001f96c310 000000000000f7e0 May 13 14:58:15 nodez kernel: ffff880028f99fd8 ffff88001f96c310 ffff88001f96c040 ffff880001e0f920 May 13 14:58:15 nodez kernel: Call Trace: May 13 14:58:15 nodez kernel: [<ffffffff812bc54a>] ? string+0x3a/0xf0 May 13 14:58:15 nodez kernel: [<ffffffff812bdbf7>] ? vsnprintf+0x357/0x600 May 13 14:58:15 nodez kernel: [<ffffffff814ffb17>] ? __mutex_lock_slowpath+0x117/0x2a0 May 13 14:58:15 nodez kernel: [<ffffffff8150031f>] ? mutex_lock+0x1f/0x40 May 13 14:58:15 nodez kernel: [<ffffffffa027b8c3>] ? mgc_process_log+0x63/0x1390 [mgc] May 13 14:58:15 nodez kernel: [<ffffffff81501377>] ? _spin_lock_irqsave+0x17/0x20 May 13 14:58:15 nodez kernel: [<ffffffff81068353>] ? lock_timer_base+0x33/0x70 May 13 14:58:15 nodez kernel: [<ffffffff8106899e>] ? try_to_del_timer_sync+0x6e/0xd0 May 13 14:58:15 nodez kernel: [<ffffffffa0dc7f6f>] ? libcfs_debug_msg+0x3f/0x50 [libcfs] May 13 14:58:15 nodez kernel: [<ffffffffa027ea8d>] ? mgc_requeue_thread+0x30d/0x7a0 [mgc] May 13 14:58:15 nodez kernel: [<ffffffff81050f80>] ? default_wake_function+0x0/0x20 May 13 14:58:15 nodez kernel: [<ffffffffa027e780>] ? mgc_requeue_thread+0x0/0x7a0 [mgc] May 13 14:58:15 nodez kernel: [<ffffffff81077e9e>] ? kthread+0x8e/0xa0 May 13 14:58:15 nodez kernel: [<ffffffff8100b0ea>] ? child_rip+0xa/0x20 May 13 14:58:15 nodez kernel: [<ffffffff81077e10>] ? kthread+0x0/0xa0 May 13 14:58:15 nodez kernel: [<ffffffff8100b0e0>] ? child_rip+0x0/0x20 That looks like deadlock, not sure why it happens with ZFS only though

          I noticed that several conf-sanity.sh tests are being skipped for ZFS because of LU-2059 "no local config for ZFS", but I think that is no longer true?

          conf-sanity.sh: 5d, 19b, 21b, 27a
          insanity.sh: 1, 2, 4, 10

          There is also conf-sanity.sh test_50h that doesn't give a reason, and sanity-quota.sh test_7d.

          It would be good to submit a patch or two to enable these tests along with a Test-parameters: line that runs the relevant test ten times to see what the success rate is.

          adilger Andreas Dilger added a comment - I noticed that several conf-sanity.sh tests are being skipped for ZFS because of LU-2059 "no local config for ZFS", but I think that is no longer true? conf-sanity.sh: 5d, 19b, 21b, 27a insanity.sh: 1, 2, 4, 10 There is also conf-sanity.sh test_50h that doesn't give a reason, and sanity-quota.sh test_7d. It would be good to submit a patch or two to enable these tests along with a Test-parameters: line that runs the relevant test ten times to see what the success rate is.

          hmm, there were tests in conf-sanity to verify OST can start with inactive MGS.

          bzzz Alex Zhuravlev added a comment - hmm, there were tests in conf-sanity to verify OST can start with inactive MGS.

          Well, I did that for sure when did the initial patch for this ticket which makes local config llogs. Nevertheless I agree this is first step to do now, just to make sure it still works. Meanwhile I tend to think you are right that problem might be in locking or something else what cause MGC to wait for MGS

          tappro Mikhail Pershin added a comment - Well, I did that for sure when did the initial patch for this ticket which makes local config llogs. Nevertheless I agree this is first step to do now, just to make sure it still works. Meanwhile I tend to think you are right that problem might be in locking or something else what cause MGC to wait for MGS

          People

            jfc John Fuchs-Chesney (Inactive)
            bzzz Alex Zhuravlev
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: