[LU-5585] MDS became unresponsive, clients hanging until MDS fail over Created: 04/Sep/14  Updated: 12/Aug/22  Resolved: 12/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File cs04r-sc-mds03-01-logs-20140911.txt     File cs04r-sc-mds03-01-lustre-dk_after_umount.xz     Text File cs04r-sc-mds03-01-meminfo-20140909-1705.txt     HTML File cs04r-sc-mds03-01-meminfo-20141002-1841     HTML File cs04r-sc-mds03-01-meminfo-20141003-1412     PNG File cs04r-sc-mds03-01-memory.png     File cs04r-sc-mds03-01-messages.txt.xz     Text File cs04r-sc-mds03-01-slabinfo-20140909-1705.txt     HTML File cs04r-sc-mds03-01-slabinfo-20141002-1841     HTML File cs04r-sc-mds03-01-slabinfo-20141003-1412     File lustre-log.1410414046.22423.xz     File manual_dump.txt.xz    
Issue Links:
Related
is related to LU-5583 clients receive IO error after MDT fa... Open
is related to LU-5595 replay-single test_38: MDS 1 OOM Resolved
Severity: 3
Rank (Obsolete): 15580

 Description   

This morning some of our clients were hanging (others had not been checked at that time), the active MDS was unresponsive and flooding the console with stack traces. We had to fail over to the second MDS to get the file system back.

Looking at the system logs, we see a large number of these messages:
kernel: socknal_sd00_02: page allocation failure. order:2, mode:0x20 all followed by many stack traces, full log attached. Our monitoring is showing that the memory was mainly used by buffers but this had been the case for all of last week already and was stable and only slowly increasing. After the restart the memory used by buffers has quickly increase to about 60% and currently seems to be stable about there...

Just before these page allocation failure messages we noticed a few client reconnect messages, but have not been able to find any network problems so far. Since the restart of the MDT, no unexpected client reconnects have been seen.

We are running lustre 2.5.2 + 4 patches as recommended in LU-5529 and LU-5514.

We've been hammering the MDS a bit since the upgrade, both creating files, stating many files/directories from many clients etc and removing many files, but I would still expect the MDS not to fall over like this.

Is this a problem/memory leak in Lustre or something else? Could it be related different compile options when compiling Lustre? We did compile the version on the MDS in house with these patches and there is always a chance we didn't quite use the same compile time options that the automatic build process would use...

What can we do to debug this further and avoid it in the future?



 Comments   
Comment by Peter Jones [ 05/Sep/14 ]

Bobijam

This ticket is perhaps related to the other just assigned to you. Could you please advise

Thanks

Peter

Comment by Andreas Dilger [ 05/Sep/14 ]

If you run into this case again, please try to log into the MDS and collect /proc/slabinfo and /proc/meminfo to see where all of the memory is allocated.

Ideally, you could also enable the allocation debugging (lctl set_param debug=+alloc), increase the maximum debug log size (lctl set_param debug_mb=200 and then unmount the MDS to see where it is freeing memory, and dump the debug log. Unfortunately this may not capture as much debug logging as one might want because it doesn't have enough memory to store the log itself.

Comment by Frederik Ferner (Inactive) [ 05/Sep/14 ]

In this case we were not able to log into the MDS anymore once we noticed the problem, so couldn't collect these. Equally the serial console was unusable do to the large number of stack traces printed there.

Would it be worth setting the additional debugging and debug log size now and collect the information in /proc/slabinfo and /proc/meminfo just before unmounting during a maintenance window, i.e. before it re-occurs? Then unmount and collect the debug log?

Comment by Andreas Dilger [ 05/Sep/14 ]

Frederik, it would definitely be useful to see what is in /proc/slabinfo and /proc/meminfo when the MDS is running low on memory. It may be best to just dump this periodically to another system so that it is captured as close to running out of memory as possible if you don't notice this in advance.

Comment by Frederik Ferner (Inactive) [ 10/Sep/14 ]

A quick update. Monitoring the memory usage on the MDS over the last week, we've not seen this issue again. See attached memory usage graph, the original issue happened early on Thursday 3, before that buffer memory usage seems to have gone up only, since then the memory/buffer usage has also decreased frequently.

Even though the memory usage wasn't that bad, I took the opportunity of a scheduled maintenance yesterday to collect a debug log just after unmounting the MDT, with malloc added to the debug and debug_mb increased. I've also collected meminfo/slabinfo just before unmounting the MDT, these are attached as well in case there is anything useful in there.

Comment by Frederik Ferner (Inactive) [ 11/Sep/14 ]

Andreas, All,

the original issue that all/some clients appear to be hanging on most metadata operations is back, this time the memory on the MDS doesn't look bad, so that might have been something else.

The symptoms now that we're taking more time in debugging it appear to be that many clients appear to be hanging for example when doing 'ls -l' on some directories (not the top level directory of the file system). The ls will eventually complete but it takes long enough for users to phone us and a detailed conversation about what they're doing, where, us looking into the machine and it still hasn't completed... This is even for directories with only 2 subdirectories and no files.

time ls -l /mnt/lustre03/i04
total 8
drwxrwxr-x+ 11 root       dls_sysadmin 4096 Aug 14 10:52 data
drwxrwsr-x+ 10 epics_user i04_data     4096 Aug  2  2011 epics

real	4m48.799s
user	0m0.001s
sys	0m0.001s

The MDS is reporting a number of these messages Lustre: lock timed out (enqueued at 1410413746, 300s ago along with a few threads completing after 200+s, system load is currently around 70, in top there's three processes at the top usually in state 'D': flush-253:6, kcopyd , jbd2/dm-6-8 (see top output below).

It all seems to have started sometime last night with this Sep 11 02:26:13 cs04r-sc-mds03-01 kernel: INFO: task mdt_rdpg00_000:12042 blocked for more than 120 seconds. nothing in the logs after that until 04:00 but it seems to have goten worse after that.

I can't rule out hardware issues on the disk backend but so far have not found any error messages that confirm that.

Syslog from for the relevant time the MDS will be attached, here's an extract. There's nothing Lustre or network related in syslog on the clients or the OSSes.

Sep 11 04:00:10 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 22232 completed after 217.24s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 11 04:00:10 cs04r-sc-mds03-01 kernel: LNet: Skipped 7 previous similar messages
Sep 11 04:47:43 cs04r-sc-mds03-01 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.144.140.46@o2ib  ns: mdt-lustre03-MDT0000_UUID lock: ffff880d7abef3c0/0x4a9a61dbe320f47a lrc: 3/0,0 mode: PR/PR res: [0x4a40692:0xb304ffb3:0x0].0 bits 0x13 rrc: 3 type: IBT flags: 0x60200000000020 nid: 10.144.140.46@o2ib remote: 0xc6d2a2809bd5a9f1 expref: 84365 pid: 20014 timeout: 4398798140 lvb_type: 0
Sep 11 04:47:43 cs04r-sc-mds03-01 kernel: LustreError: 28875:0:(client.c:1079:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8809f7420400 x1478844569897188/t0(0) o104->lustre03-MDT0000@10.144.140.46@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:N/0/ffffffff rc 0/-1
Sep 11 04:47:43 cs04r-sc-mds03-01 kernel: LustreError: 28875:0:(ldlm_lockd.c:662:ldlm_handle_ast_error()) ### client (nid 10.144.140.46@o2ib) returned 0 from blocking AST ns: mdt-lustre03-MDT0000_UUID lock: ffff880168665880/0x4a9a61dbe320f9dd lrc: 1/0,0 mode: --/CR res: [0x4a40695:0xb304ffb6:0x0].0 bits 0x5 rrc: 2 type: IBT flags: 0x64a01000000020 nid: 10.144.140.46@o2ib remote: 0xc6d2a2809bd5aa06 expref: 60513 pid: 12032 timeout: 4398949080 lvb_type: 0
Sep 11 04:49:10 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client b4d423ad-3219-f806-0fd2-5a2845b5faad (at 10.144.140.46@o2ib) reconnecting
Sep 11 04:49:10 cs04r-sc-mds03-01 kernel: Lustre: Skipped 67 previous similar messages
Sep 11 05:08:41 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 12048 completed after 321.32s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 11 05:08:41 cs04r-sc-mds03-01 kernel: LNet: Skipped 22 previous similar messages
Sep 11 06:09:12 cs04r-sc-mds03-01 kernel: LNet: Service thread pid 12036 completed after 329.22s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 11 06:09:12 cs04r-sc-mds03-01 kernel: LNet: Skipped 1 previous similar message
Sep 11 06:40:46 cs04r-sc-mds03-01 kernel: Lustre: lock timed out (enqueued at 1410413746, 300s ago)

We don't know what is triggering this, but at the moment we're still running a few jobs both scanning the file systems, copying data away and deleting files in addition to our normal user jobs, so I would expect the MDT to be a bit busier but not that bad.

Comment by Frederik Ferner (Inactive) [ 11/Sep/14 ]

MDT log from the time after failing over the MDT to this machine until now.

Comment by Frederik Ferner (Inactive) [ 11/Sep/14 ]

two dump files collected while I suspect the problem was ongoing. manual_dump.txt has been collected while I was experiencing the problem.

Comment by Frederik Ferner (Inactive) [ 11/Sep/14 ]

Ah, I think I may now have fixed this immediate problem.

We created a LVM snapshot just before extending the file system earlier this week. We had kept this snapshot around and wanted to keep it for a little longer while we were performing tests on the extended file system. However, searching for information on the kcopyd process, I came across a post to dm-devel about the performance impact of kcopyd. Even though the post was from May 2007, we decided to remove the snapshot and load average immediately started to drop and is now down at around 8, client metadata performance has also recovered, nicely noticeable as jump up for file open rates on the MDT at the time of disabling the snapshot...

I guess the lesson is that snapshots can still have a very large performance impact.

Comment by Frederik Ferner (Inactive) [ 02/Oct/14 ]

This might have come back, this time no LVM snapshot involved.

Over the last few days buffer memory usage has steadily gone up again, without going down as it used to, currently about 88% of the memory appears to be buffers and has been for over a day now. So far clients don't appear to be affected.

Most recent /proc/meminfo and /proc/slabinfo are attached. Below are the first few lines of slabtop output as well.

[bnh65367@cs04r-sc-mds03-01 ~]$ slabtop -o | head -20
 Active / Total Objects (% used)    : 40400048 / 44245203 (91.3%)
 Active / Total Slabs (% used)      : 2464035 / 2464417 (100.0%)
 Active / Total Caches (% used)     : 137 / 253 (54.2%)
 Active / Total Size (% used)       : 8144191.17K / 9226241.88K (88.3%)
 Minimum / Average / Maximum Object : 0.02K / 0.21K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
29618648 28952513  97%    0.10K 800504       37   3202016K buffer_head            
6944364 5580655  80%    0.50K 992052        7   3968208K ldlm_locks             
3200268 2900338  90%    0.31K 266689       12   1066756K ldlm_resources         
1036350 444641  42%    0.12K  34545       30    138180K size-128               
904113 851578  94%    0.55K 129159        7    516636K radix_tree_node        
574240 331937  57%    0.19K  28712       20    114848K size-192               
360192 166594  46%    0.08K   7504       48     30016K mdd_obj                
311202 166140  53%    0.11K   9153       34     36612K lod_obj                
282840 165746  58%    0.25K  18856       15     75424K mdt_obj                
260792 257126  98%    0.50K  32599        8    130396K size-512               
202257 190027  93%    1.02K  67419        3    269676K ldiskfs_inode_cache    
175761  83015  47%    0.06K   2979       59     11916K size-64                
 78288  62865  80%    0.03K    699      112      2796K size-32                
[bnh65367@cs04r-sc-mds03-01 ~]$ 
Comment by Andreas Dilger [ 03/Oct/14 ]

What is a bit odd here is that there are 5.5M in-use ldlm_locks on 2.9M ldlm_resources, yet there are only 190K inodes in memory (166K objects). This implies there is something kind of strange happening in the DLM, since there should only be a single resource per MDT object. There should be at least one ldlm_resource for each ldlm_lock, though having more locks than resources is OK as multiple clients may lock the same resource, or a single client may lock different parts of the same resource.

One experiment you might do is to run lctl get_param ldlm.namespaces.MDT.lru_size to get the count of locks held by all the clients, and then lctl set_param ldlm.namespaces.MDT.lru_size=clear on the clients to drop all their DLM locks. The set_param will cancel the corresponding locks on the server and flush the client metadata cache as a result, which may have a short term negative impact on metadata performance, in case that is unacceptable.

The cancellation of locks on the clients should result in all of the ldlm_locks structures being freed on the MDS (or at least the sum of the locks on the clients should match the number of ACTIVE ldlm_locks allocated on the MDS). If that isn't the case, it seems we have some kind of leak in the DLM.

Comment by Frederik Ferner (Inactive) [ 03/Oct/14 ]

I've done that, unfortunately it didn't seem to free up much memory.

During the initial sweep of lctl get_param ldlm.namespaces.MDT.lru_size for this file system, adding up the numbers for all reachable clients (a few are currently unresponsive and are being looked at, we assume unrelated), we seem to have about 5.3M locks on clients (corresponding to most recent snapshot of slabinfo of 5.4M ldlm_locks).

After the lru_size=clear, both numbers dropped, now, about 20 minutes later they are back at about 1.5M each.

Fresh meminfo/slabinfo about 20 minutes after clearing the locks are attached.

Comment by Frederik Ferner (Inactive) [ 03/Oct/14 ]

Unfortunately this started to severely affect file system performance so we had to fail over. I was nearly in time to do a clean unmount but not quite. By the time I started typing the umount command, the MDS froze completely and I was not able to collect any debug_log.

Since this is now a repeating feature of this file system any idea how we could prevent this from re-occuring would be much appreciated. If there is anything we can do do help debug this, let us know, we'll do what we can.

Frederik

Comment by Oleg Drokin [ 07/Oct/14 ]

Do you have many clients on this system?

It's been a known problem in the past that if you let cleint lrus to grow uncontrollably, servers become somewhat memory starved.

One possible workaround is to set lru_size on the clients to something conservative like 100 or 200. Also if you have mostly non-intersecting jobs on the clients that don't reuse same files between different jobs, some sites are dropping lock lrus (and other caches) forcefully in between job runs.

Comment by Peter Jones [ 12/Oct/14 ]

Bobijam

Could this be related to the issue reported in LU-5727?

Peter

Comment by Frederik Ferner (Inactive) [ 13/Oct/14 ]

Depends on your view, we've got just under 300 clients on this file system.

We'll try limiting the lru_size and will continue to monitor, looking at LU-5727, I'm not sure how much this will give us.

Considering that we have been cleaning the file system, it is also entirely possible that we hit something similar to LU-5726, i.e. we almost certainly have run 'rm -rf' or similar in parallel on multiple clients. I will try to reproduce this tomorrow.

Generated at Sat Feb 10 01:52:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.