Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10027

Unable to finish mount on MDS while Rolling downgrade

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.1, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.5
    • None
    • Older Version: b_ieel3_0, build 222
      Newer Version: b2_10 build 26
    • 3
    • 9223372036854775807

    Description

      MDS hung when trying to mount MDS trying to downgrade it from 2.10.1 RC1 to b_ieel3_0 build 222.
      This happens for both ldiskfs as well as zfs

      ldiskfs

      [root@onyx-63 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/mds0
      mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 32767
      mount.lustre: change scheduler of /sys/block/sdb[ 2004.864821] libcfs: loading out-of-tree module taints kernel.
      /queue/scheduler[ 2004.871947] libcfs: module verification failed: signature and/or required key missing - tainting kernel
       from cfq to deadline
      [ 2004.888945] LNet: HW CPU cores: 32, npartitions: 4
      [ 2004.898521] alg: No test for adler32 (adler32-zlib)
      [ 2004.904129] alg: No test for crc32 (crc32-table)
      [ 2009.942002] sha512_ssse3: Using AVX optimized SHA-512 implementation
      [ 2012.937978] Lustre: Lustre: Build Version: 2.7.19.10--PRISTINE-3.10.0-514.10.2.el7_lustre.x86_64
      [ 2012.975643] LNet: Added LNI 10.2.5.135@tcp [8/256/0/180]
      [ 2012.981679] LNet: Accept secure, port 988
      [ 2013.034692] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,user_xattr,errors=remount-ro,no_mbcache
      
      [ 2065.369959] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2067.236139] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.18@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2067.255509] LustreError: Skipped 1 previous similar message
      [ 2076.647185] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      
      [ 2090.368245] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2101.644288] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2101.663651] LustreError: Skipped 2 previous similar messages
      [ 2115.367485] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2140.366559] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2140.385928] LustreError: Skipped 4 previous similar messages
      
      [ 2160.784144] INFO: task ll_mgs_0002:21386 blocked for more than 120 seconds.
      [ 2160.791943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 2160.800724] ll_mgs_0002     D ffffffffa0f78660     0 21386      2 0x00000080
      [ 2160.808691]  ffff88040aa43a10 0000000000000046 ffff880429e04e70 ffff88040aa43fd8
      [ 2160.818118]  ffff88040aa43fd8 ffff88040aa43fd8 ffff880429e04e70 ffff88081df59d48
      [ 2160.827408]  ffff88081df59d50 7fffffffffffffff ffff880429e04e70 ffffffffa0f78660
      [ 2160.836732] Call Trace:
      [ 2160.840486]  [<ffffffffa0f78660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs]
      [ 2160.850251]  [<ffffffff8168bd09>] schedule+0x29/0x70
      [ 2160.856792]  [<ffffffff81689759>] schedule_timeout+0x239/0x2d0
      [ 2160.864288]  [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0
      [ 2160.871684]  [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0
      [ 2160.879360]  [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0
      [ 2160.886649]  [<ffffffffa0f78660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs]
      [ 2160.896367]  [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170
      [ 2160.904146]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20
      [ 2160.911392]  [<ffffffffa095a047>] llog_process_or_fork+0x1d7/0x590 [obdclass]
      [ 2160.920354]  [<ffffffffa095a414>] llog_process+0x14/0x20 [obdclass]
      [ 2160.928335]  [<ffffffffa0f80c9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs]
      [ 2160.936899]  [<ffffffffa0f80db2>] mgs_check_index+0x62/0x2f0 [mgs]
      [ 2160.944786]  [<ffffffffa0f689fe>] mgs_target_reg+0x38e/0x1320 [mgs]
      [ 2160.952844]  [<ffffffffa0c74adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
      [ 2160.961512]  [<ffffffffa0c18b8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 2160.971041]  [<ffffffffa084fce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [ 2160.979619]  [<ffffffffa0c15c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
      [ 2160.988168]  [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90
      [ 2160.995664]  [<ffffffffa0c1c4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 2161.003618]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
      [ 2161.010710]  [<ffffffffa0c1b8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
      [ 2161.020100]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
      [ 2161.026468]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2161.034663]  [<ffffffff81696b98>] ret_from_fork+0x58/0x90
      [ 2161.041573]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2176.641784] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2176.662876] LustreError: Skipped 5 previous similar messages
      [ 2214.124574] LNet: Service thread pid 21386 was inactive for 200.36s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 2214.145286] Pid: 21386, comm: ll_mgs_0002
      [ 2214.150532] 
      [ 2214.150532] Call Trace:
      [ 2214.156430]  [<ffffffffa0f78660>] ? mgs_fsdb_handler+0x0/0x10a0 [mgs]
      [ 2214.164352]  [<ffffffff8168bd09>] schedule+0x29/0x70
      [ 2214.170643]  [<ffffffff81689759>] schedule_timeout+0x239/0x2d0
      [ 2214.177911]  [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0
      [ 2214.185069]  [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0
      [ 2214.192491]  [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0
      [ 2214.199520]  [<ffffffffa0f78660>] ? mgs_fsdb_handler+0x0/0x10a0 [mgs]
      [ 2214.207414]  [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170
      [ 2214.214917]  [<ffffffff810c5080>] ? default_wake_function+0x0/0x20
      [ 2214.222514]  [<ffffffffa095a047>] llog_process_or_fork+0x1d7/0x590 [obdclass]
      [ 2214.231181]  [<ffffffffa095a414>] llog_process+0x14/0x20 [obdclass]
      [ 2214.238870]  [<ffffffffa0f80c9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs]
      [ 2214.247145]  [<ffffffffa0f80db2>] mgs_check_index+0x62/0x2f0 [mgs]
      [ 2214.254738]  [<ffffffffa0f689fe>] mgs_target_reg+0x38e/0x1320 [mgs]
      [ 2214.262432]  [<ffffffffa0c74adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
      [ 2214.270794]  [<ffffffffa0c18b8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 2214.280035]  [<ffffffffa084fce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [ 2214.288315]  [<ffffffffa0c15c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
      [ 2214.296584]  [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90
      [ 2214.303799]  [<ffffffffa0c1c4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 2214.311475]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0
      [ 2214.318290]  [<ffffffffa0c1b8b0>] ? ptlrpc_main+0x0/0x1f60 [ptlrpc]
      [ 2214.325964]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
      [ 2214.332082]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
      [ 2214.338288]  [<ffffffff81696b98>] ret_from_fork+0x58/0x90
      [ 2214.344981]  [<ffffffff810b0630>] ? kthread+0x0/0xe0
      [ 2214.351182] 
      

      ZFS:

      [root@onyx-78 ~]# mount -t lustre -o acl,user_xattr lustre-mdt1/mdt1 /mnt/mds0
      [ 2214.245461] LNet: HW CPU cores: 72, npartitions: 8
      [ 2214.253142] alg: No test for adler32 (adler32-zlib)
      [ 2214.258723] alg: No test for crc32 (crc32-table)
      [ 2219.277933] sha512_ssse3: Using AVX2 optimized SHA-512 implementation
      [ 2222.279787] Lustre: Lustre: Build Version: 2.7.19.10--PRISTINE-3.10.0-514.10.2.el7_lustre.x86_64
      [ 2222.309748] LNet: Added LNI 10.2.2.50@tcp [8/512/0/180]
      [ 2222.315678] LNet: Accept secure, port 988
      [ 2273.432429] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2275.484816] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.16@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2275.504279] LustreError: Skipped 1 previous similar message
      [ 2298.432064] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2298.451527] LustreError: Skipped 1 previous similar message
      [ 2300.482820] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.16@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2300.502281] LustreError: Skipped 1 previous similar message
      [ 2323.431994] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2323.451546] LustreError: Skipped 1 previous similar message
      [ 2348.432062] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2348.455763] LustreError: Skipped 3 previous similar messages
      [ 2373.432088] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 2373.455409] LustreError: Skipped 3 previous similar messages
      [ 2401.707691] INFO: task ll_mgs_0001:32818 blocked for more than 120 seconds.
      [ 2401.717283] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 2401.727712] ll_mgs_0001     D ffff8810583fae30     0 32818      2 0x00000080
      [ 2401.737119]  ffff88103479bba0 0000000000000046 ffff881058a91f60 ffff88103479bfd8
      [ 2401.746898]  ffff88103479bfd8 ffff88103479bfd8 ffff881058a91f60 ffff8810583fae28
      [ 2401.756643]  ffff8810583fae2c ffff881058a91f60 00000000ffffffff ffff8810583fae30
      [ 2401.766335] Call Trace:
      [ 2401.770395]  [<ffffffff8168cdf9>] schedule_preempt_disabled+0x29/0x70
      [ 2401.778859]  [<ffffffff8168aa55>] __mutex_lock_slowpath+0xc5/0x1c0
      [ 2401.786989]  [<ffffffff81689ebf>] mutex_lock+0x1f/0x2f
      [ 2401.793954]  [<ffffffffa11d829f>] mgs_fsc_attach+0x22f/0x600 [mgs]
      [ 2401.802032]  [<ffffffffa11b247a>] mgs_llog_open+0x1fa/0x430 [mgs]
      [ 2401.810052]  [<ffffffffa0fd6adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
      [ 2401.818939]  [<ffffffffa0f7ab8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 2401.828681]  [<ffffffffa0be7ce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [ 2401.837460]  [<ffffffffa0f77c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
      [ 2401.846192]  [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90
      [ 2401.853882]  [<ffffffffa0f7e4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 2401.862057]  [<ffffffffa0f7d8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
      [ 2401.871654]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
      [ 2401.878227]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2401.886661]  [<ffffffff81696b98>] ret_from_fork+0x58/0x90
      [ 2401.893816]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2401.902243] INFO: task ll_mgs_0002:32819 blocked for more than 120 seconds.
      [ 2401.911156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 2401.921038] ll_mgs_0002     D ffffffffa11c4660     0 32819      2 0x00000080
      [ 2401.930084]  ffff88103523ba10 0000000000000046 ffff881058a94e70 ffff88103523bfd8
      [ 2401.939561]  ffff88103523bfd8 ffff88103523bfd8 ffff881058a94e70 ffff88085a687748
      [ 2401.949014]  ffff88085a687750 7fffffffffffffff ffff881058a94e70 ffffffffa11c4660
      [ 2401.958448] Call Trace:
      [ 2401.962252]  [<ffffffffa11c4660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs]
      [ 2401.972072]  [<ffffffff8168bd09>] schedule+0x29/0x70
      [ 2401.978673]  [<ffffffff81689759>] schedule_timeout+0x239/0x2d0
      [ 2401.986230]  [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0
      [ 2401.993663]  [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0
      [ 2402.001376]  [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0
      [ 2402.008685]  [<ffffffffa11c4660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs]
      [ 2402.018404]  [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170
      [ 2402.026180]  [<ffffffff810c5080>] ? wake_up_state+0x20/0x20
      [ 2402.033374]  [<ffffffffa0cbc047>] llog_process_or_fork+0x1d7/0x590 [obdclass]
      [ 2402.042291]  [<ffffffffa0cbc414>] llog_process+0x14/0x20 [obdclass]
      [ 2402.050215]  [<ffffffffa11ccc9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs]
      [ 2402.058710]  [<ffffffffa11ccdb2>] mgs_check_index+0x62/0x2f0 [mgs]
      [ 2402.066494]  [<ffffffffa11b49fe>] mgs_target_reg+0x38e/0x1320 [mgs]
      [ 2402.074414]  [<ffffffffa0fd6adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc]
      [ 2402.082981]  [<ffffffffa0f7ab8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
      [ 2402.092385]  [<ffffffffa0be7ce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
      [ 2402.100831]  [<ffffffffa0f77c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc]
      [ 2402.109235]  [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90
      [ 2402.116588]  [<ffffffffa0f7e4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc]
      [ 2402.124421]  [<ffffffffa0f7d8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
      [ 2402.133680]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
      [ 2402.139934]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2402.148035]  [<ffffffff81696b98>] ret_from_fork+0x58/0x90
      [ 2402.154864]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
      [ 2402.162962] INFO: task ll_mgs_0003:32907 blocked for more than 120 seconds.
      [ 2402.171544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 2402.181108] ll_mgs_0003     D ffff8810583fae30     0 32907      2 0x00000080
      [ 2402.189851]  ffff88102ae2bba0 0000000000000046 ffff881047b58000 ffff88102ae2bfd8
      [ 2402.199037]  ffff88102ae2bfd8 ffff88102ae2bfd8 ffff881047b58000 ffff8810583fae28
      [ 2402.208210]  ffff8810583fae2c ffff881047b58000 00000000ffffffff ffff8810583fae30
      

      Steps followed for Rolling Upgrade Downgrade:
      1. Started with all clients and servers with b_ieel3_0 build 222 and build the Lustre File system
      2. Upgraded OSS to 2.10.1RC1 re-mounted and ran sanity.sh
      3. Upgraded MDS to 2.10.1 RC1 , remounted and ran sanity.sh
      4. Upgraded Clients to 2.10.1RC1, remounted and ran sanity.sh
      5. Downgraded Clients to b_ieel3_0 build 222 remounted and ran sanity.sh
      6. Downgraded MDS to b_ieel3_0 build 222.
      For both ldiskfs and zfs when trying to mount , the system hangs

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              standan Saurabh Tandan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: