[LU-4498] MDT thread hung, ls fails on directory Created: 16/Jan/14  Updated: 21/Mar/14  Resolved: 21/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Kit Westneat (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File 2014-01-09-run4-client.out     File bt.tgz    
Severity: 3
Rank (Obsolete): 12303

 Description   

IU is running into an issue where running ls on a certain file causes clients to get evicted. It appears as if there is a hung MDT thread holding a lock on the file. After the MDT is rebooted, listing the directory and file works fine.

We were able to capture client debug logs and a backtrace of all the threads from the running system, but due to an issue with STONITH, we were unable to get a good vmcore from the system. Also when we tried to get debug logs from the MDT, the log overflowed, even with a 20GB buffer.

We are currently waiting for the issue to reappear and will get debug logs on a quiesced system, as well as a good vmcore.

I'll upload the logs we have. Is there anything else we should be looking to get?



 Comments   
Comment by Kit Westneat (Inactive) [ 16/Jan/14 ]

Here's an example of the directory listing. I forgot to mention it takes hours for the MDT to actually evict the client.

ls -al
/N/dc2/scratch/kmoriya/bggen_8.4_9.0-2013-12-11/0000/000100

Begin 2013-12-20_13:37:32
total 164
drwxr-xr-x 34 1012412 401 4096 Dec 11 20:59 .
drwxr-xr-x 202 1012412 401 36864 Dec 16 09:32 ..
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:23 00003200
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003201
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003202
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003203
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003204
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003205
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003206
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003207
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003208
?--------- ? ? ? ? ? 00003209
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003210
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003211
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003212
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003213
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003214
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003215
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003216
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003217
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003218
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003219
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003220
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003221
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003222
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003223
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003224
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003225
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003226
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003227
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003228
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003229
drwxr-xr-x 2 1012412 401 4096 Dec 12 13:24 00003230
drwxr-xr-x 2 1012412 401 4096 Dec 12 07:42 00003231
End 2013-12-20_15:44:49

Comment by Cliff White (Inactive) [ 16/Jan/14 ]

Is there anything else you can tell us about the 'certain file' ?
Is it always the same file, or type of file?
Have you run fsck on the MDT lately?

Comment by Kit Westneat (Inactive) [ 17/Jan/14 ]

It's happened three times on seemingly unrelated files. It looks like the last e2fsck was on June 12. FWIW there don't seem to be any ldisk errors in the logs anytime recently.

Comment by Peter Jones [ 17/Jan/14 ]

Bobijam

Could you please help with this one?

Thanks

Peter

Comment by Zhenyu Xu [ 20/Jan/14 ]

Fanyong,

does it look like dir hash collision issue?

Comment by nasf (Inactive) [ 20/Jan/14 ]

According to the client side log, every getattr RPC is for different file, so it cannot to say that the "ls" fall into hash collision. On the other hand, the MDT side shows that during the "ls" processing, there are some "open-create" operations. What they are for?

Comment by Zhenyu Xu [ 20/Jan/14 ]

the client log shows that ls happened during 2014/01/09 23:46:11 to 2014/01/09 23:46:12, which took 1 seconds and does not match to the 1st comment report (from 2013-12-20_13:37:32 to 2013-12-20_15:44:49). Do you have logs which covers the issue time span?

Comment by Zhenyu Xu [ 21/Jan/14 ]

would you mind trying this patch http://review.whamcloud.com/8936 , it is a backport of dir hash collision fix patch.

Comment by Peter Jones [ 20/Mar/14 ]

Bobijam

Could this be related to LU-4616, the root cause of which has now been established?

Peter

Comment by Zhenyu Xu [ 21/Mar/14 ]

yes, I think it's related to LU-4616

Comment by Peter Jones [ 21/Mar/14 ]

ok so then let's mark it as a duplicate of LU-4616 and reopen or open a new ticket if it manifests itself again now that LU-4616 seems to have been addressed.

Generated at Sat Feb 10 01:43:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.