Hi Hongchao,
We've run into a problem with this patch. After upgrading the MDT, we are starting to get data corruption on the system. Here is a description of the issue:
Another admin has been working on the test file system this morning.
First, she could execute normal commands on the file system. Then
normal commands (like vi) added 112 MB rubbish to files. Even a touch
on a non-existing file created a file of 112 MB.
Another admin logged in and his .Xauthority was increased. He looked
at the binary data ant it seemed like the additional data came from a
software package which is installed on the same Lustre file system,
i.e. the rubbish seems to be no arbitrary data but seems to come from
another location.
root@iccn999:/software/all/tsm/sbin# touch gaga1
Wed Sep 11-14:38:21 (14/1012) - ACTIVE
root@iccn999:/software/all/tsm/sbin# ls -l gaga1
rw-rr- 1 root root 116430464 Sep 11 14:38 gaga1
On another client the behaviour is different:
root@iccn996:/software/all/tsm/sbin# touch gaga2
touch: setting times of `gaga2': No such file or directory
Wed Sep 11-14:39:15 (5/41)
root@iccn996:/software/all/tsm/sbin# ls -l gaga2
rw-rr- 1 root root 0 Sep 11 14:39 gaga2
I will upload the lctl dk logs (with vfstrace and rpctrace). Is there any other information we should get?
I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system:
I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.