[LU-1648] MDS Crash Created: 20/Jul/12  Updated: 22/Apr/14  Resolved: 21/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: Lustre 2.3.0, Lustre 2.1.4

Type: Bug Priority: Critical
Reporter: Fabio Verzelloni Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:
    1. MDS HW ##
                                                                                                                                                                                                        • Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
                                                                                                                                                                                                          Architecture: x86_64
                                                                                                                                                                                                          CPU op-mode(s): 32-bit, 64-bit
                                                                                                                                                                                                          Byte Order: Little Endian
                                                                                                                                                                                                          CPU(s): 16
                                                                                                                                                                                                          Vendor ID: AuthenticAMD
                                                                                                                                                                                                          CPU family: 16
                                                                                                                                                                                                          64Gb RAM
                                                                                                                                                                                                          Interconnect IB 40Gb/s

      • MDT LSI 5480 Pikes Peak
        SSDs SLC

----------------------------------------------------------------------------------------------------

    1. OSS HW ##
                                                                                                                                                                                                        • Architecture: x86_64
                                                                                                                                                                                                          CPU op-mode(s): 32-bit, 64-bit
                                                                                                                                                                                                          Byte Order: Little Endian
                                                                                                                                                                                                          CPU(s): 32
                                                                                                                                                                                                          Vendor ID: GenuineIntel
                                                                                                                                                                                                          CPU family: 6
                                                                                                                                                                                                          64Gb RAM
                                                                                                                                                                                                          Interconnect IB 40Gb/s


      • OSTs ---> LSI 7900 SATA Disks

----------------------------------------------------------------------------------------------------

    1. Router nodes ##

                                                                                                                                                                                                        • 12 Cray XE6 Service nodes as router nodes - IB 40Gb/s

----------------------------------------------------------------------------------------------------

    1. Clients ##
                                                                                                                                                                                                        • ~ 1500 Cray XE6 nodes - Lustre 1.8.6

----------------------------------------------------------------------------------------------------

    1. LUSTRE Config ##
                                                                                                                                                                                                        • 1 MDS + 1 fail over (MDT on SSD array)
                                                                                                                                                                                                          12 OSSs - 6 OSTs per OSS (72 OSTs)

Luster Servers ---> 2.2.51.0
Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes)
----------------------------------------------------------------------------------------------------


Attachments: File cluster_20_jul.log    
Issue Links:
Related
is related to LU-4794 MDS threads all stuck in jbd2_journal... Resolved
Severity: 3
Rank (Obsolete): 4502

 Description   

Lustre hung this morning, we are running a e2fsck on the MDT at the moment.

Mounting the mdt with ldiskfs we saw many large file like 'oi.XX.XX', what are these files?
Can you please help us debugging the current problem?

[root@weisshorn01 mdt]# ls
capa_keys OBJECTS oi.16.15 oi.16.22 oi.16.3 oi.16.37 oi.16.44 oi.16.51 oi.16.59 oi.16.9
CATALOGS oi.16.0 oi.16.16 oi.16.23 oi.16.30 oi.16.38 oi.16.45 oi.16.52 oi.16.6 PENDING
CONFIGS oi.16.1 oi.16.17 oi.16.24 oi.16.31 oi.16.39 oi.16.46 oi.16.53 oi.16.60 ROOT
fld oi.16.10 oi.16.18 oi.16.25 oi.16.32 oi.16.4 oi.16.47 oi.16.54 oi.16.61 seq_ctl
last_rcvd oi.16.11 oi.16.19 oi.16.26 oi.16.33 oi.16.40 oi.16.48 oi.16.55 oi.16.62 seq_srv
lost+found oi.16.12 oi.16.2 oi.16.27 oi.16.34 oi.16.41 oi.16.49 oi.16.56 oi.16.63
lov_objid oi.16.13 oi.16.20 oi.16.28 oi.16.35 oi.16.42 oi.16.5 oi.16.57 oi.16.7
NIDTBL_VERSIONS oi.16.14 oi.16.21 oi.16.29 oi.16.36 oi.16.43 oi.16.50 oi.16.58 oi.16.8
[root@weisshorn01 mdt]# ls -l
total 1957836
rw-rr- 1 root root 144 May 15 14:43 capa_keys
rwx----- 1 root root 2304 May 15 14:49 CATALOGS
drwxrwxrwx 2 root root 4096 Jul 19 17:07 CONFIGS
rw-rr- 1 root root 8192 May 15 14:43 fld
rw-rr- 1 root root 392832 May 15 14:43 last_rcvd
drwx------ 2 root root 16384 May 15 14:43 lost+found
rw-rr- 1 root root 576 May 15 14:43 lov_objid
drwxrwxrwx 2 root root 4096 May 15 14:43 NIDTBL_VERSIONS
drwxrwxrwx 2 root root 237568 Jul 20 08:56 OBJECTS
rw-rr- 1 root root 31494144 May 15 14:43 oi.16.0
rw-rr- 1 root root 26271744 May 15 14:43 oi.16.1
rw-rr- 1 root root 49315840 May 15 14:43 oi.16.10
rw-rr- 1 root root 29536256 May 15 14:43 oi.16.11
rw-rr- 1 root root 26890240 May 15 14:43 oi.16.12
rw-rr- 1 root root 20484096 May 15 14:43 oi.16.13
rw-rr- 1 root root 30490624 May 15 14:43 oi.16.14
rw-rr- 1 root root 33075200 May 15 14:43 oi.16.15
rw-rr- 1 root root 25034752 May 15 14:43 oi.16.16
rw-rr- 1 root root 43155456 May 15 14:43 oi.16.17
rw-rr- 1 root root 27435008 May 15 14:43 oi.16.18
rw-rr- 1 root root 21987328 May 15 14:43 oi.16.19
rw-rr- 1 root root 29138944 May 15 14:43 oi.16.2
rw-rr- 1 root root 21946368 May 15 14:43 oi.16.20
rw-rr- 1 root root 28278784 May 15 14:43 oi.16.21
rw-rr- 1 root root 28504064 May 15 14:43 oi.16.22
rw-rr- 1 root root 30584832 May 15 14:43 oi.16.23
rw-rr- 1 root root 27758592 May 15 14:43 oi.16.24
rw-rr- 1 root root 22654976 May 15 14:43 oi.16.25
rw-rr- 1 root root 24956928 May 15 14:43 oi.16.26
rw-rr- 1 root root 45015040 May 15 14:43 oi.16.27
rw-rr- 1 root root 27344896 May 15 14:43 oi.16.28
rw-rr- 1 root root 36724736 May 15 14:43 oi.16.29
rw-rr- 1 root root 23318528 May 15 14:43 oi.16.3
rw-rr- 1 root root 25739264 May 15 14:43 oi.16.30
rw-rr- 1 root root 26865664 May 15 14:43 oi.16.31
rw-rr- 1 root root 29147136 May 15 14:43 oi.16.32
rw-rr- 1 root root 28573696 May 15 14:43 oi.16.33
rw-rr- 1 root root 26796032 May 15 14:43 oi.16.34
rw-rr- 1 root root 30167040 May 15 14:43 oi.16.35
rw-rr- 1 root root 31641600 May 15 14:43 oi.16.36
rw-rr- 1 root root 21430272 May 15 14:43 oi.16.37
rw-rr- 1 root root 24567808 May 15 14:43 oi.16.38
rw-rr- 1 root root 29364224 May 15 14:43 oi.16.39
rw-rr- 1 root root 22032384 May 15 14:43 oi.16.4
rw-rr- 1 root root 41111552 May 15 14:43 oi.16.40
rw-rr- 1 root root 41889792 May 15 14:43 oi.16.41
rw-rr- 1 root root 34344960 May 15 14:43 oi.16.42
rw-rr- 1 root root 45531136 May 15 14:43 oi.16.43
rw-rr- 1 root root 34304000 May 15 14:43 oi.16.44
rw-rr- 1 root root 32129024 May 15 14:43 oi.16.45
rw-rr- 1 root root 30593024 May 15 14:43 oi.16.46
rw-rr- 1 root root 33566720 May 15 14:43 oi.16.47
rw-rr- 1 root root 31928320 May 15 14:43 oi.16.48
rw-rr- 1 root root 32591872 May 15 14:43 oi.16.49
rw-rr- 1 root root 29097984 May 15 14:43 oi.16.5
rw-rr- 1 root root 38350848 May 15 14:43 oi.16.50
rw-rr- 1 root root 24289280 May 15 14:43 oi.16.51
rw-rr- 1 root root 41656320 May 15 14:43 oi.16.52
rw-rr- 1 root root 35467264 May 15 14:43 oi.16.53
rw-rr- 1 root root 37556224 May 15 14:43 oi.16.54
rw-rr- 1 root root 32391168 May 15 14:43 oi.16.55
rw-rr- 1 root root 31694848 May 15 14:43 oi.16.56
rw-rr- 1 root root 35209216 May 15 14:43 oi.16.57
rw-rr- 1 root root 34750464 May 15 14:43 oi.16.58
rw-rr- 1 root root 33206272 May 15 14:43 oi.16.59
rw-rr- 1 root root 40476672 May 15 14:43 oi.16.6
rw-rr- 1 root root 26509312 May 15 14:43 oi.16.60
rw-rr- 1 root root 29929472 May 15 14:43 oi.16.61
rw-rr- 1 root root 34635776 May 15 14:43 oi.16.62
rw-rr- 1 root root 23273472 May 15 14:43 oi.16.63
rw-rr- 1 root root 34062336 May 15 14:43 oi.16.7
rw-rr- 1 root root 33783808 May 15 14:43 oi.16.8
rw-rr- 1 root root 33796096 May 15 14:43 oi.16.9
drwxr-xr-x 2 root root 5906432 May 15 14:43 PENDING
drwxr-xr-x 856 root root 36864 Jan 1 1970 ROOT
rw-rr- 1 root root 24 May 15 14:43 seq_ctl
rw-rr- 1 root root 24 May 15 14:43 seq_srv



 Comments   
Comment by Liang Zhen (Inactive) [ 20/Jul/12 ]

Please don't touch this files, so I'm correct on LU-1642, it is because of OI leak (LU-1512), these files are actually supposed to be big and growing forever, but growing speed should be slower and slower, not like now, they always grow at the same speed.

Again, we need Fan Yong to comment on this, I have added him to CC list, I believe he has some way to fix this, which might require you to run a tool to rebuild these files.

Comment by Liang Zhen (Inactive) [ 20/Jul/12 ]

Btw, I think the crash is not about these files, could you please post console output or whatever information from MDS about the crash so we can investigate on it? Thanks

Comment by Fabio Verzelloni [ 20/Jul/12 ]

[root@weisshorn01 ~]# e2fsck -fp /dev/mapper/mds
scratch-MDT0000: Deleted inode 11624457 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11624753 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11624777 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11624784 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11624869 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11624887 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11625059 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11625127 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11654190 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11654602 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11654713 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11654772 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11654866 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655003 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655022 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655110 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655125 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655138 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655142 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655261 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655323 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655324 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655337 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 11655396 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 15234080 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 15255286 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 15785601 has zero dtime. FIXED.
scratch-MDT0000: Deleted inode 18362624 has zero dtime. FIXED.
scratch-MDT0000: Symlink /ROOT/tast/RUN_nproma1_bcLAI/unit.21 (inode #12063425) is invalid.

scratch-MDT0000: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)

Comment by nasf (Inactive) [ 20/Jul/12 ]

These oi.16.xx files are used for mapping global identifier (FID) to local identifier (ino# & gen) for ldiskfs-based backend filesystem. These files are used on server only, invisible to client.

According to current design and implementation, the OI file size/space cannot be shrink. I am making patch in LU-1512 to slow the size/space growing speed. On Lustre-2.3, you can remove those OI files by force, then OI scrub can rebuild them automatically. Usually, these new created OI files are smaller. But for the former release, no better solution for that.

Anyway, OI file size/space issue should not hung the system. Have you seen some error meesage for "-ENOSPC" on MDS when the system hung?

Comment by Liang Zhen (Inactive) [ 20/Jul/12 ]

many threads are stuck at "start_this_handle":

Jul 20 09:00:34 weisshorn02 kernel: Call Trace:
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff8127466d>] ? pointer+0xad/0xa60
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff812731ee>] ? number+0x2ee/0x320
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa01424f0>] jbd2_journal_start+0xd0/0x110 [jbd2]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0af8b08>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa05d0d41>] fsfilt_ldiskfs_start+0x91/0x480 [fsfilt_ldiskfs]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa063fdaa>] llog_origin_handle_cancel+0x3ea/0xa20 [ptlrpc]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
Jul 20 09:00:34 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]

it looks like LU-81, but the fix should be in 2.2 already, any comment?

Comment by nasf (Inactive) [ 25/Jul/12 ]

Yes, it is LU-81. But LU-81 did not totally fix the deadlock of changelog adding vs changelog canceling. There are other potencial deadlock cases caused by the race of starting journal handle and acquiring "lgh_lock".

The original patch of LU-81 made the process sequences as:
1) start journal handle
2) acquire "lgh_lock"

But it ignored the case of journal handle restarting. Under such case, the caller may be blocked with holding "lgh_lock". I have no idea to resolve such deadlock yet.

Jul 20 09:06:37 weisshorn02 kernel: Call Trace:
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01432d4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0142072>] start_this_handle+0x282/0x500 [jbd2]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa01423c1>] jbd2_journal_restart+0xd1/0x130 [jbd2]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ace8ba>] ldiskfs_truncate_restart_trans+0x8a/0xa0 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3bfd>] ldiskfs_clear_blocks+0x9d/0x170 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3de4>] ldiskfs_free_data+0x114/0x170 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4023>] ldiskfs_free_branches+0x1e3/0x200 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad3f16>] ldiskfs_free_branches+0xd6/0x200 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad4629>] ldiskfs_truncate+0x5e9/0x660 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ac3f3b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0acec7e>] ? ldiskfs_mark_iloc_dirty+0x36e/0x5d0 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad0b23>] ? ldiskfs_mark_inode_dirty+0x83/0x1f0 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff814ee34e>] ? mutex_lock+0x1e/0x50
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad57c0>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0ad59a0>] ldiskfs_delete_inode+0x1e0/0x250 [ldiskfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811915fe>] generic_delete_inode+0xde/0x1d0
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81191755>] generic_drop_inode+0x65/0x80
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff811905d2>] iput+0x62/0x70
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa074e41e>] mds_obd_destroy+0x3ae/0x850 [mds]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff81177ee1>] ? __fput+0x1a1/0x210
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049eca5>] llog_lvfs_destroy+0x545/0xbb0 [obdclass]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0141ed5>] ? start_this_handle+0xe5/0x500 [jbd2]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0496acf>] llog_cancel_rec+0x31f/0x600 [obdclass]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa049aee9>] llog_cat_cancel_records+0x99/0x230 [obdclass]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa063fbee>] llog_origin_handle_cancel+0x22e/0xa20 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c4903>] ? cfs_alloc+0x63/0x90 [libcfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa04d10df>] ? keys_fill+0x6f/0x1a0 [obdclass]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa060da87>] ldlm_cancel_handler+0x157/0x4a0 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06363c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03c44ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa03ceef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0630462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa06373cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffffa0636cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jul 20 09:06:37 weisshorn02 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Comment by nasf (Inactive) [ 25/Jul/12 ]

My current idea is to increase the credit for llog cancel to prevent journal restart during the transaction. It may be not perfect solution, but should be workable. This is the patch:

http://review.whamcloud.com/#change,3463

Comment by nasf (Inactive) [ 09/Aug/12 ]

The patch has been landed to Lustre-2.3.

Fabio, would have chance to verity it on your system? Thanks!

Comment by Emoly Liu [ 05/Dec/12 ]

Port for b2_1 is at http://review.whamcloud.com/4743

Generated at Sat Feb 10 01:18:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.