Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.3.0, Lustre 2.1.4
    • Lustre 2.2.0
    • None
    • 3
    • 4502

    Description

      Lustre hung this morning, we are running a e2fsck on the MDT at the moment.

      Mounting the mdt with ldiskfs we saw many large file like 'oi.XX.XX', what are these files?
      Can you please help us debugging the current problem?

      [root@weisshorn01 mdt]# ls
      capa_keys OBJECTS oi.16.15 oi.16.22 oi.16.3 oi.16.37 oi.16.44 oi.16.51 oi.16.59 oi.16.9
      CATALOGS oi.16.0 oi.16.16 oi.16.23 oi.16.30 oi.16.38 oi.16.45 oi.16.52 oi.16.6 PENDING
      CONFIGS oi.16.1 oi.16.17 oi.16.24 oi.16.31 oi.16.39 oi.16.46 oi.16.53 oi.16.60 ROOT
      fld oi.16.10 oi.16.18 oi.16.25 oi.16.32 oi.16.4 oi.16.47 oi.16.54 oi.16.61 seq_ctl
      last_rcvd oi.16.11 oi.16.19 oi.16.26 oi.16.33 oi.16.40 oi.16.48 oi.16.55 oi.16.62 seq_srv
      lost+found oi.16.12 oi.16.2 oi.16.27 oi.16.34 oi.16.41 oi.16.49 oi.16.56 oi.16.63
      lov_objid oi.16.13 oi.16.20 oi.16.28 oi.16.35 oi.16.42 oi.16.5 oi.16.57 oi.16.7
      NIDTBL_VERSIONS oi.16.14 oi.16.21 oi.16.29 oi.16.36 oi.16.43 oi.16.50 oi.16.58 oi.16.8
      [root@weisshorn01 mdt]# ls -l
      total 1957836
      rw-rr- 1 root root 144 May 15 14:43 capa_keys
      rwx----- 1 root root 2304 May 15 14:49 CATALOGS
      drwxrwxrwx 2 root root 4096 Jul 19 17:07 CONFIGS
      rw-rr- 1 root root 8192 May 15 14:43 fld
      rw-rr- 1 root root 392832 May 15 14:43 last_rcvd
      drwx------ 2 root root 16384 May 15 14:43 lost+found
      rw-rr- 1 root root 576 May 15 14:43 lov_objid
      drwxrwxrwx 2 root root 4096 May 15 14:43 NIDTBL_VERSIONS
      drwxrwxrwx 2 root root 237568 Jul 20 08:56 OBJECTS
      rw-rr- 1 root root 31494144 May 15 14:43 oi.16.0
      rw-rr- 1 root root 26271744 May 15 14:43 oi.16.1
      rw-rr- 1 root root 49315840 May 15 14:43 oi.16.10
      rw-rr- 1 root root 29536256 May 15 14:43 oi.16.11
      rw-rr- 1 root root 26890240 May 15 14:43 oi.16.12
      rw-rr- 1 root root 20484096 May 15 14:43 oi.16.13
      rw-rr- 1 root root 30490624 May 15 14:43 oi.16.14
      rw-rr- 1 root root 33075200 May 15 14:43 oi.16.15
      rw-rr- 1 root root 25034752 May 15 14:43 oi.16.16
      rw-rr- 1 root root 43155456 May 15 14:43 oi.16.17
      rw-rr- 1 root root 27435008 May 15 14:43 oi.16.18
      rw-rr- 1 root root 21987328 May 15 14:43 oi.16.19
      rw-rr- 1 root root 29138944 May 15 14:43 oi.16.2
      rw-rr- 1 root root 21946368 May 15 14:43 oi.16.20
      rw-rr- 1 root root 28278784 May 15 14:43 oi.16.21
      rw-rr- 1 root root 28504064 May 15 14:43 oi.16.22
      rw-rr- 1 root root 30584832 May 15 14:43 oi.16.23
      rw-rr- 1 root root 27758592 May 15 14:43 oi.16.24
      rw-rr- 1 root root 22654976 May 15 14:43 oi.16.25
      rw-rr- 1 root root 24956928 May 15 14:43 oi.16.26
      rw-rr- 1 root root 45015040 May 15 14:43 oi.16.27
      rw-rr- 1 root root 27344896 May 15 14:43 oi.16.28
      rw-rr- 1 root root 36724736 May 15 14:43 oi.16.29
      rw-rr- 1 root root 23318528 May 15 14:43 oi.16.3
      rw-rr- 1 root root 25739264 May 15 14:43 oi.16.30
      rw-rr- 1 root root 26865664 May 15 14:43 oi.16.31
      rw-rr- 1 root root 29147136 May 15 14:43 oi.16.32
      rw-rr- 1 root root 28573696 May 15 14:43 oi.16.33
      rw-rr- 1 root root 26796032 May 15 14:43 oi.16.34
      rw-rr- 1 root root 30167040 May 15 14:43 oi.16.35
      rw-rr- 1 root root 31641600 May 15 14:43 oi.16.36
      rw-rr- 1 root root 21430272 May 15 14:43 oi.16.37
      rw-rr- 1 root root 24567808 May 15 14:43 oi.16.38
      rw-rr- 1 root root 29364224 May 15 14:43 oi.16.39
      rw-rr- 1 root root 22032384 May 15 14:43 oi.16.4
      rw-rr- 1 root root 41111552 May 15 14:43 oi.16.40
      rw-rr- 1 root root 41889792 May 15 14:43 oi.16.41
      rw-rr- 1 root root 34344960 May 15 14:43 oi.16.42
      rw-rr- 1 root root 45531136 May 15 14:43 oi.16.43
      rw-rr- 1 root root 34304000 May 15 14:43 oi.16.44
      rw-rr- 1 root root 32129024 May 15 14:43 oi.16.45
      rw-rr- 1 root root 30593024 May 15 14:43 oi.16.46
      rw-rr- 1 root root 33566720 May 15 14:43 oi.16.47
      rw-rr- 1 root root 31928320 May 15 14:43 oi.16.48
      rw-rr- 1 root root 32591872 May 15 14:43 oi.16.49
      rw-rr- 1 root root 29097984 May 15 14:43 oi.16.5
      rw-rr- 1 root root 38350848 May 15 14:43 oi.16.50
      rw-rr- 1 root root 24289280 May 15 14:43 oi.16.51
      rw-rr- 1 root root 41656320 May 15 14:43 oi.16.52
      rw-rr- 1 root root 35467264 May 15 14:43 oi.16.53
      rw-rr- 1 root root 37556224 May 15 14:43 oi.16.54
      rw-rr- 1 root root 32391168 May 15 14:43 oi.16.55
      rw-rr- 1 root root 31694848 May 15 14:43 oi.16.56
      rw-rr- 1 root root 35209216 May 15 14:43 oi.16.57
      rw-rr- 1 root root 34750464 May 15 14:43 oi.16.58
      rw-rr- 1 root root 33206272 May 15 14:43 oi.16.59
      rw-rr- 1 root root 40476672 May 15 14:43 oi.16.6
      rw-rr- 1 root root 26509312 May 15 14:43 oi.16.60
      rw-rr- 1 root root 29929472 May 15 14:43 oi.16.61
      rw-rr- 1 root root 34635776 May 15 14:43 oi.16.62
      rw-rr- 1 root root 23273472 May 15 14:43 oi.16.63
      rw-rr- 1 root root 34062336 May 15 14:43 oi.16.7
      rw-rr- 1 root root 33783808 May 15 14:43 oi.16.8
      rw-rr- 1 root root 33796096 May 15 14:43 oi.16.9
      drwxr-xr-x 2 root root 5906432 May 15 14:43 PENDING
      drwxr-xr-x 856 root root 36864 Jan 1 1970 ROOT
      rw-rr- 1 root root 24 May 15 14:43 seq_ctl
      rw-rr- 1 root root 24 May 15 14:43 seq_srv

      Attachments

        Issue Links

          Activity

            [LU-1648] MDS Crash
            emoly.liu Emoly Liu added a comment -
            emoly.liu Emoly Liu added a comment - Port for b2_1 is at http://review.whamcloud.com/4743

            The patch has been landed to Lustre-2.3.

            Fabio, would have chance to verity it on your system? Thanks!

            yong.fan nasf (Inactive) added a comment - The patch has been landed to Lustre-2.3. Fabio, would have chance to verity it on your system? Thanks!

            My current idea is to increase the credit for llog cancel to prevent journal restart during the transaction. It may be not perfect solution, but should be workable. This is the patch:

            http://review.whamcloud.com/#change,3463

            yong.fan nasf (Inactive) added a comment - My current idea is to increase the credit for llog cancel to prevent journal restart during the transaction. It may be not perfect solution, but should be workable. This is the patch: http://review.whamcloud.com/#change,3463

            Yes, it is LU-81. But LU-81 did not totally fix the deadlock of changelog adding vs changelog canceling. There are other potencial deadlock cases caused by the race of starting journal handle and acquiring "lgh_lock".

            The original patch of LU-81 made the process sequences as:
            1) start journal handle
            2) acquire "lgh_lock"

            But it ignored the case of journal handle restarting. Under such case, the caller may be blocked with holding "lgh_lock". I have no idea to resolve such deadlock yet.

            Jul 20 09:06:37 weisshorn02 kernel: Call Trace:
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01432d4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01423c1>] jbd2_journal_restart+0xd1/0x130 [jbd2]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ace8ba>] ldiskfs_truncate_restart_trans+0x8a/0xa0 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3bfd>] ldiskfs_clear_blocks+0x9d/0x170 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3de4>] ldiskfs_free_data+0x114/0x170 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4023>] ldiskfs_free_branches+0x1e3/0x200 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3f16>] ldiskfs_free_branches+0xd6/0x200 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4629>] ldiskfs_truncate+0x5e9/0x660 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ac3f3b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0acec7e>] ? ldiskfs_mark_iloc_dirty+0x36e/0x5d0 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad0b23>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff814ee34e>] ? mutex_lock+0x1e/0x50
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad57c0>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad59a0>] ldiskfs_delete_inode+0x1e0/0x250 [ldiskfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811915fe>] generic_delete_inode+0xde/0x1d0
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81191755>] generic_drop_inode+0x65/0x80
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811905d2>] iput+0x62/0x70
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa074e41e>] mds_obd_destroy+0x3ae/0x850 [mds]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81177ee1>] ? __fput+0x1a1/0x210
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049eca5>] llog_lvfs_destroy+0x545/0xbb0 [obdclass]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0141ed5>] ? start_this_handle+0xe5/0x500 [jbd2]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0496acf>] llog_cancel_rec+0x31f/0x600 [obdclass]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049aee9>] llog_cat_cancel_records+0x99/0x230 [obdclass]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa063fbee>] llog_origin_handle_cancel+0x22e/0xa20 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
            Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

            yong.fan nasf (Inactive) added a comment - Yes, it is LU-81 . But LU-81 did not totally fix the deadlock of changelog adding vs changelog canceling. There are other potencial deadlock cases caused by the race of starting journal handle and acquiring "lgh_lock". The original patch of LU-81 made the process sequences as: 1) start journal handle 2) acquire "lgh_lock" But it ignored the case of journal handle restarting. Under such case, the caller may be blocked with holding "lgh_lock". I have no idea to resolve such deadlock yet. Jul 20 09:06:37 weisshorn02 kernel: Call Trace: Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01432d4>] ? do_get_write_access+0x3b4/0x520 [jbd2] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01423c1>] jbd2_journal_restart+0xd1/0x130 [jbd2] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ace8ba>] ldiskfs_truncate_restart_trans+0x8a/0xa0 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3bfd>] ldiskfs_clear_blocks+0x9d/0x170 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3de4>] ldiskfs_free_data+0x114/0x170 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4023>] ldiskfs_free_branches+0x1e3/0x200 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3f16>] ldiskfs_free_branches+0xd6/0x200 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4629>] ldiskfs_truncate+0x5e9/0x660 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ac3f3b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0acec7e>] ? ldiskfs_mark_iloc_dirty+0x36e/0x5d0 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad0b23>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff814ee34e>] ? mutex_lock+0x1e/0x50 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad57c0>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad59a0>] ldiskfs_delete_inode+0x1e0/0x250 [ldiskfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811915fe>] generic_delete_inode+0xde/0x1d0 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81191755>] generic_drop_inode+0x65/0x80 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811905d2>] iput+0x62/0x70 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa074e41e>] mds_obd_destroy+0x3ae/0x850 [mds] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81177ee1>] ? __fput+0x1a1/0x210 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049eca5>] llog_lvfs_destroy+0x545/0xbb0 [obdclass] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0141ed5>] ? start_this_handle+0xe5/0x500 [jbd2] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0496acf>] llog_cancel_rec+0x31f/0x600 [obdclass] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049aee9>] llog_cat_cancel_records+0x99/0x230 [obdclass] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa063fbee>] llog_origin_handle_cancel+0x22e/0xa20 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc] Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

            many threads are stuck at "start_this_handle":

            Jul 20 09:00:34 weisshorn02 kernel: Call Trace:
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff8127466d>] ? pointer+0xad/0xa60
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff812731ee>] ? number+0x2ee/0x320
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa01424f0>] jbd2_journal_start+0xd0/0x110 [jbd2]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0af8b08>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa05d0d41>] fsfilt_ldiskfs_start+0x91/0x480 [fsfilt_ldiskfs]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa063fdaa>] llog_origin_handle_cancel+0x3ea/0xa20 [ptlrpc]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
            Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
            
            

            it looks like LU-81, but the fix should be in 2.2 already, any comment?

            liang Liang Zhen (Inactive) added a comment - many threads are stuck at "start_this_handle": Jul 20 09:00:34 weisshorn02 kernel: Call Trace: Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff8127466d>] ? pointer+0xad/0xa60 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff812731ee>] ? number+0x2ee/0x320 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa01424f0>] jbd2_journal_start+0xd0/0x110 [jbd2] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0af8b08>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa05d0d41>] fsfilt_ldiskfs_start+0x91/0x480 [fsfilt_ldiskfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa063fdaa>] llog_origin_handle_cancel+0x3ea/0xa20 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc] Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70 Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc] it looks like LU-81 , but the fix should be in 2.2 already, any comment?

            These oi.16.xx files are used for mapping global identifier (FID) to local identifier (ino# & gen) for ldiskfs-based backend filesystem. These files are used on server only, invisible to client.

            According to current design and implementation, the OI file size/space cannot be shrink. I am making patch in LU-1512 to slow the size/space growing speed. On Lustre-2.3, you can remove those OI files by force, then OI scrub can rebuild them automatically. Usually, these new created OI files are smaller. But for the former release, no better solution for that.

            Anyway, OI file size/space issue should not hung the system. Have you seen some error meesage for "-ENOSPC" on MDS when the system hung?

            yong.fan nasf (Inactive) added a comment - These oi.16.xx files are used for mapping global identifier (FID) to local identifier (ino# & gen) for ldiskfs-based backend filesystem. These files are used on server only, invisible to client. According to current design and implementation, the OI file size/space cannot be shrink. I am making patch in LU-1512 to slow the size/space growing speed. On Lustre-2.3, you can remove those OI files by force, then OI scrub can rebuild them automatically. Usually, these new created OI files are smaller. But for the former release, no better solution for that. Anyway, OI file size/space issue should not hung the system. Have you seen some error meesage for "-ENOSPC" on MDS when the system hung?

            [root@weisshorn01 ~]# e2fsck -fp /dev/mapper/mds
            scratch-MDT0000: Deleted inode 11624457 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11624753 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11624777 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11624784 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11624869 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11624887 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11625059 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11625127 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11654190 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11654602 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11654713 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11654772 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11654866 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655003 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655022 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655110 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655125 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655138 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655142 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655261 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655323 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655324 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655337 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 11655396 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 15234080 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 15255286 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 15785601 has zero dtime. FIXED.
            scratch-MDT0000: Deleted inode 18362624 has zero dtime. FIXED.
            scratch-MDT0000: Symlink /ROOT/tast/RUN_nproma1_bcLAI/unit.21 (inode #12063425) is invalid.

            scratch-MDT0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
            (i.e., without -a or -p options)

            fverzell Fabio Verzelloni added a comment - [root@weisshorn01 ~] # e2fsck -fp /dev/mapper/mds scratch-MDT0000: Deleted inode 11624457 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11624753 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11624777 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11624784 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11624869 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11624887 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11625059 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11625127 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11654190 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11654602 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11654713 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11654772 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11654866 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655003 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655022 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655110 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655125 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655138 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655142 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655261 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655323 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655324 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655337 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 11655396 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 15234080 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 15255286 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 15785601 has zero dtime. FIXED. scratch-MDT0000: Deleted inode 18362624 has zero dtime. FIXED. scratch-MDT0000: Symlink /ROOT/tast/RUN_nproma1_bcLAI/unit.21 (inode #12063425) is invalid. scratch-MDT0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)

            Btw, I think the crash is not about these files, could you please post console output or whatever information from MDS about the crash so we can investigate on it? Thanks

            liang Liang Zhen (Inactive) added a comment - Btw, I think the crash is not about these files, could you please post console output or whatever information from MDS about the crash so we can investigate on it? Thanks
            liang Liang Zhen (Inactive) added a comment - - edited

            Please don't touch this files, so I'm correct on LU-1642, it is because of OI leak (LU-1512), these files are actually supposed to be big and growing forever, but growing speed should be slower and slower, not like now, they always grow at the same speed.

            Again, we need Fan Yong to comment on this, I have added him to CC list, I believe he has some way to fix this, which might require you to run a tool to rebuild these files.

            liang Liang Zhen (Inactive) added a comment - - edited Please don't touch this files, so I'm correct on LU-1642 , it is because of OI leak ( LU-1512 ), these files are actually supposed to be big and growing forever, but growing speed should be slower and slower, not like now, they always grow at the same speed. Again, we need Fan Yong to comment on this, I have added him to CC list, I believe he has some way to fix this, which might require you to run a tool to rebuild these files.

            People

              yong.fan nasf (Inactive)
              fverzell Fabio Verzelloni
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: