Hi Hongchao,
We've run into a problem with this patch. After upgrading the MDT, we are starting to get data corruption on the system. Here is a description of the issue:
Another admin has been working on the test file system this morning.
First, she could execute normal commands on the file system. Then
normal commands (like vi) added 112 MB rubbish to files. Even a touch
on a non-existing file created a file of 112 MB.
Another admin logged in and his .Xauthority was increased. He looked
at the binary data ant it seemed like the additional data came from a
software package which is installed on the same Lustre file system,
i.e. the rubbish seems to be no arbitrary data but seems to come from
another location.
root@iccn999:/software/all/tsm/sbin# touch gaga1
Wed Sep 11-14:38:21 (14/1012) - ACTIVE
root@iccn999:/software/all/tsm/sbin# ls -l gaga1
rw-rr- 1 root root 116430464 Sep 11 14:38 gaga1
On another client the behaviour is different:
root@iccn996:/software/all/tsm/sbin# touch gaga2
touch: setting times of `gaga2': No such file or directory
Wed Sep 11-14:39:15 (5/41)
root@iccn996:/software/all/tsm/sbin# ls -l gaga2
rw-rr- 1 root root 0 Sep 11 14:39 gaga2
I will upload the lctl dk logs (with vfstrace and rpctrace). Is there any other information we should get?
The fix that worked on my test system didn't work on the customer system, so it might have been a fluke.
I haven't seen that message in the new logs. Here is the latest server dk's:
00000100:00100000:14.0:1381243911.199943:0:4598:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff8808181fb400 x1448249214211642/t0( 0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211637:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211638:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211639:172.26.17.3@o2ib:-1 00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400 00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13 00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301