[LU-653] recovery-mds-scale (FLAVOR=OSS): dbench write failed on handle 11108 (Cannot send after transport endpoint shutdown) Created: 31/Aug/11  Updated: 16/Sep/11  Resolved: 16/Sep/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Tag: v2_1_0_0_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

MGS/MDS Nodes: fat-amd-1-ib

OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (active in fat-amd-3-ib)
OST2 (active in fat-amd-4-ib)
OST3 (active in fat-amd-3-ib)
OST4 (active in fat-amd-4-ib)
OST5 (active in fat-amd-3-ib)
OST6 (active in fat-amd-4-ib)
fat-amd-2-ib(OST7)

Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4

Network Addresses:
fat-amd-1-ib: 192.168.4.132
fat-amd-2-ib: 192.168.4.133
fat-amd-3-ib: 192.168.4.134
fat-amd-4-ib: 192.168.4.135
client-1-ib: 192.168.4.1
client-2-ib: 192.168.4.2
client-4-ib: 192.168.4.4
client-5-ib: 192.168.4.5
client-12-ib: 192.168.4.12
client-13-ib: 192.168.4.13
client-15-ib: 192.168.4.15
fat-intel-4-ib: 192.168.4.131


Attachments: File recovery-mds-scale.1314780314.log.tar.bz2     File recovery-oss-scale.1314962198.log.tar.bz2     File recovery-oss-scale.1315388587.log.tar.bz2     File recovery-oss-scale.1315558735.log.tar.bz2     File recovery-oss-scale.1315560362.log.tar.bz2    
Severity: 3
Rank (Obsolete): 4903

 Description   

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

<~snip~>
Client load failed during failover. Exiting
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-15-ib
Client load failed on node client-15-ib

client client-15-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug
2011-08-31 01:45:08 Terminating clients loads ...
Duration:                43200
Server failover period: 600 seconds
Exited after:           0 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=1

/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib:

<~snip~>
   2      5122     0.00 MB/sec  execute  40 sec  latency 144211.791 ms
   2      5122     0.00 MB/sec  execute  41 sec  latency 145211.965 ms
   2      5122     0.00 MB/sec  execute  42 sec  latency 146212.143 ms
[5748] write failed on handle 11108 (Cannot send after transport endpoint shutdown)
Child failed with status 1

/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug:

2011-08-31 01:40:59: dbench run starting
+ mkdir -p /mnt/lustre/d0.dbench-client-15-ib
+ load_pid=3602
+ wait 3602
+ rundbench -D /mnt/lustre/d0.dbench-client-15-ib 2
touch: missing file operand
Try `touch --help' for more information.
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2011-08-31 01:43:43: dbench failed'
+ echo '2011-08-31 01:43:43: dbench failed'
2011-08-31 01:43:43: dbench failed

Syslog on client node client-15-ib showed that:

Aug 31 01:42:51 client-15 kernel: Lustre: 2534:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
Aug 31 01:42:51 client-15 kernel:  req@ffff880308397400 x1378646783695240/t908(908) o-1->lustre-OST0000_UUID@192.168.4.135@o2ib:6/4 lens 512/400 e 1 to 0 dl 1314780217 ref 2 fl
 Interpret:R/ffffffff/ffffffff rc -75/-1Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378646783702455 sent from lustre-OST0002-osc-ffff880326e33c00 to NID 
192.168.4.134@o2ib has failed due to network error: [sent 1314780175] [real_sent 1314780175] [current 1314780175] [deadline 26s] [delay -26s]  req@ffff8803118e7000 x13786467837
02455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.134@o2ib:28/4 lens 368/512 e 0 to 1 dl 1314780201 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Aug 31 01:43:10 client-15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_connect operation failed with -19Aug 31 01:43:10 client-15 kernel: LustreError: Skipped 6 previous similar messagesAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) lustre-OST0002-osc-ffff880326e33c00: tried all connections, increasing latency to 21sAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) Skipped 6 previous similar messages
Aug 31 01:43:32 client-15 kernel: Lustre: lustre-OST0002-osc-ffff880326e33c00: Connection restored to service lustre-OST0002 using nid 192.168.4.135@o2ib.
Aug 31 01:43:36 client-15 kernel: INFO: task flush-lustre-1:2690 blocked for more than 120 seconds.
Aug 31 01:43:36 client-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 31 01:43:36 client-15 kernel: flush-lustre- D 0000000000000002     0  2690      2 0x00000080Aug 31 01:43:36 client-15 kernel: ffff88030e35d9a0 0000000000000046 0000000000000000 ffffffffa0303424
Aug 31 01:43:36 client-15 kernel: 0000000000000000 ffff88031f3fdc00 ffff88030e35d930 00000001002932fc
Aug 31 01:43:36 client-15 kernel: ffff880325e47078 ffff88030e35dfd8 000000000000f598 ffff880325e47078
Aug 31 01:43:36 client-15 kernel: Call Trace:Aug 31 01:43:36 client-15 kernel: [<ffffffffa0303424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d320>] ? sync_page+0x0/0x50Aug 31 01:43:36 client-15 kernel: [<ffffffff814db3c3>] io_schedule+0x73/0xc0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d35d>] sync_page+0x3d/0x50
Aug 31 01:43:36 client-15 kernel: [<ffffffff814dbada>] __wait_on_bit_lock+0x5a/0xc0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d2f7>] __lock_page+0x67/0x70
Aug 31 01:43:36 client-15 kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d17>] ? __writepage+0x17/0x40
Aug 31 01:43:36 client-15 kernel: [<ffffffff81122272>] write_cache_pages+0x392/0x4a0
Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d00>] ? __writepage+0x0/0x40
Aug 31 01:43:36 client-15 kernel: [<ffffffff811223a4>] generic_writepages+0x24/0x30
Aug 31 01:43:36 client-15 kernel: [<ffffffff811223d1>] do_writepages+0x21/0x40
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bbdd>] writeback_single_inode+0xdd/0x2c0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bfde>] writeback_sb_inodes+0xce/0x180
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c13b>] writeback_inodes_wb+0xab/0x1b0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c4db>] wb_writeback+0x29b/0x3f0
Aug 31 01:43:36 client-15 kernel: [<ffffffff814dac27>] ? thread_return+0x4e/0x777
Aug 31 01:43:36 client-15 kernel: [<ffffffff8107a1a2>] ? del_timer_sync+0x22/0x30
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c7c9>] wb_do_writeback+0x199/0x240
Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c8d3>] bdi_writeback_task+0x63/0x1b0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dfc7>] ? bit_waitqueue+0x17/0xd0
Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
Aug 31 01:43:36 client-15 kernel: [<ffffffff81130c56>] bdi_start_fn+0x86/0x100
Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Aug 31 01:43:43 client-15 kernel: Lustre: 2534:0:(import.c:1160:completed_replay_interpret()) lustre-OST0000-osc-ffff880326e33c00: version recovery fails, reconnecting
Aug 31 01:43:43 client-15 kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.

Maloo report: https://maloo.whamcloud.com/test_sets/e68e0d04-d3b4-11e0-8d02-52540025f9af

Please refer to the attached recovery-mds-scale.1314780314.log.tar.bz2 for more logs.



 Comments   
Comment by Peter Jones [ 31/Aug/11 ]

Oleg is going to look into this one

Comment by Oleg Drokin [ 01/Sep/11 ]

After reviewing the logs I had a discussion with Mike about this about some strange stuff in there.

First of all there seems to be a problem where we wait for clients to replay long-committed transactions. This should not introduce a problem like visible here.

It seems we need more logs, in particular "+inode +rpctrace +ha"

YuJian can you please run this test several times with the additional logging specified above (make sure it propagates to OSTs)
and with patch for the incorrect gap in sequence that is located at http://review.whamcloud.com/1318

Comment by Jian Yu [ 01/Sep/11 ]

YuJian can you please run this test several times with the additional logging specified above (make sure it propagates to OSTs) and with patch for the incorrect gap in sequence that is located at http://review.whamcloud.com/1318

Lustre Build: http://build.whamcloud.com/job/lustre-reviews/1967/
Distro/Arch: RHEL6/x86_64 (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale(FLAVOR=OSS) failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
Client load failed on node client-12-ib, rc=1
Client load failed during failover. Exiting
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-15-ib
Client load failed on node client-15-ib

client client-15-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib
              /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug

/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib:

<~snip~>
   2      3122     0.00 MB/sec  execute 211 sec  latency 283720.897 ms
   2      3122     0.00 MB/sec  execute 212 sec  latency 284720.990 ms
[3811] unlink ./clients/client0/~dmtmp/PM/PMC184.TMP failed (Input/output error) - expected NT_STATUS_OK
ERROR: child 0 failed at line 3811
Child failed with status 1

/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug:

<~snip~>
2011-09-02 04:10:10: dbench run starting
+ mkdir -p /mnt/lustre/d0.dbench-client-15-ib
+ load_pid=15556
+ wait 15556
+ rundbench -D /mnt/lustre/d0.dbench-client-15-ib 2
touch: missing file operand
Try `touch --help' for more information.
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2011-09-02 04:15:44: dbench failed'
+ echo '2011-09-02 04:15:44: dbench failed'
2011-09-02 04:15:44: dbench failed

Syslog on client-15-ib showed that:

Sep  2 04:06:11 client-15 kernel: Lustre: 13220:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
Sep  2 04:06:11 client-15 kernel:  req@ffff880311f5e000 x1378836125083746/t16967(16967) o-1->lustre-OST0005_UUID@192.168.4.135@o2ib:28/4 lens 408/400 e 0 to 0 dl 1314961626 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
Sep  2 04:06:11 client-15 kernel: Lustre: 13220:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
Sep  2 04:06:11 client-15 kernel:  req@ffff8803127f4000 x1378836125084042/t16981(16981) o-1->lustre-OST0005_UUID@192.168.4.135@o2ib:28/4 lens 408/400 e 0 to 0 dl 1314961634 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
Sep  2 04:07:19 client-15 kernel: INFO: task flush-lustre-1:13377 blocked for more than 120 seconds.
Sep  2 04:07:19 client-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  2 04:07:19 client-15 kernel: flush-lustre- D 0000000000000001     0 13377      2 0x00000080
Sep  2 04:07:19 client-15 kernel: ffff8803139c39a0 0000000000000046 0000000000000000 ffffffffa08d7424
Sep  2 04:07:19 client-15 kernel: 0000000000000000 ffff88030a086300 ffff8803139c3930 0000000100cf4d29
Sep  2 04:07:19 client-15 kernel: ffff8803139c1b38 ffff8803139c3fd8 000000000000f598 ffff8803139c1b38
Sep  2 04:07:19 client-15 kernel: Call Trace:
Sep  2 04:07:19 client-15 kernel: [<ffffffffa08d7424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
Sep  2 04:07:19 client-15 kernel: [<ffffffff8110d320>] ? sync_page+0x0/0x50
Sep  2 04:07:19 client-15 kernel: [<ffffffff814db3c3>] io_schedule+0x73/0xc0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8110d35d>] sync_page+0x3d/0x50
Sep  2 04:07:19 client-15 kernel: [<ffffffff814dbada>] __wait_on_bit_lock+0x5a/0xc0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8110d2f7>] __lock_page+0x67/0x70
Sep  2 04:07:19 client-15 kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
Sep  2 04:07:19 client-15 kernel: [<ffffffff81120d17>] ? __writepage+0x17/0x40
Sep  2 04:07:19 client-15 kernel: [<ffffffff81122272>] write_cache_pages+0x392/0x4a0
Sep  2 04:07:19 client-15 kernel: [<ffffffff81120d00>] ? __writepage+0x0/0x40
Sep  2 04:07:19 client-15 kernel: [<ffffffff811223a4>] generic_writepages+0x24/0x30
Sep  2 04:07:19 client-15 kernel: [<ffffffff811223d1>] do_writepages+0x21/0x40
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119bbdd>] writeback_single_inode+0xdd/0x2c0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119bfde>] writeback_sb_inodes+0xce/0x180
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119c13b>] writeback_inodes_wb+0xab/0x1b0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119c4db>] wb_writeback+0x29b/0x3f0
Sep  2 04:07:19 client-15 kernel: [<ffffffff814dac27>] ? thread_return+0x4e/0x777
Sep  2 04:07:19 client-15 kernel: [<ffffffff8107a1a2>] ? del_timer_sync+0x22/0x30
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119c7c9>] wb_do_writeback+0x199/0x240
Sep  2 04:07:19 client-15 kernel: [<ffffffff8119c8d3>] bdi_writeback_task+0x63/0x1b0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8108dfc7>] ? bit_waitqueue+0x17/0xd0
Sep  2 04:07:19 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
Sep  2 04:07:19 client-15 kernel: [<ffffffff81130c56>] bdi_start_fn+0x86/0x100
Sep  2 04:07:19 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100
Sep  2 04:07:19 client-15 kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Sep  2 04:07:19 client-15 kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
Sep  2 04:07:19 client-15 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Sep  2 04:08:16 client-15 kernel: Lustre: 13220:0:(import.c:1160:completed_replay_interpret()) lustre-OST0005-osc-ffff88030d16ec00: version recovery fails, reconnecting
Sep  2 04:08:16 client-15 kernel: LustreError: 167-0: This client was evicted by lustre-OST0005; in progress operations using this service will fail.

<~snip~>

Sep  2 04:14:22 client-15 kernel: Lustre: 13220:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378836125097599 sent from lustre-OST0003-osc-ffff88030d16ec00 to NID 192.168.4.135@o2ib has failed due to network error: [sent 1314962062] [real_sent 1314962062] [current 1314962062] [deadline 26s] [delay -26s]  req@ffff8803145b7000 x1378836125097599/t0(0) o-1->lustre-OST0003_UUID@192.168.4.135@o2ib:28/4 lens 368/512 e 0 to 1 dl 1314962088 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Sep  2 04:14:22 client-15 kernel: Lustre: 13220:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 92 previous similar messages
Sep  2 04:14:40 client-15 kernel: Lustre: 13220:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay
Sep  2 04:14:40 client-15 kernel:  req@ffff88031387c800 x1378836125088753/t24049(24049) o-1->lustre-OST0003_UUID@192.168.4.134@o2ib:6/4 lens 512/400 e 0 to 0 dl 1314962107 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1
Sep  2 04:15:44 client-15 kernel: Lustre: 13220:0:(import.c:1160:completed_replay_interpret()) lustre-OST0003-osc-ffff88030d16ec00: version recovery fails, reconnecting
Sep  2 04:15:44 client-15 kernel: LustreError: 167-0: This client was evicted by lustre-OST0003; in progress operations using this service will fail.
Sep  2 04:15:44 client-15 kernel: LustreError: 13219:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff880312372400 x1378836125097748/t0(0) o-1->lustre-OST0003_UUID@192.168.4.134@o2ib:28/4 lens 296/352 e 0 to 0 dl 0 ref 1 fl Rpc:/ffffffff/ffffffff rc 0/-1
Sep  2 04:15:44 client-15 kernel: LustreError: 13219:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
Sep  2 04:15:44 client-15 kernel: LustreError: 15565:0:(namei.c:1111:ll_objects_destroy()) obd destroy objid 0x466 error -5
Sep  2 04:15:44 client-15 kernel: Lustre: lustre-OST0003-osc-ffff88030d16ec00: Connection restored to service lustre-OST0003 using nid 192.168.4.134@o2ib.

Maloo report: https://maloo.whamcloud.com/test_sets/d53a0c1a-d55d-11e0-8d02-52540025f9af

Please refer to the attached recovery-oss-scale.1314962198.log.tar.bz2 for more logs.

Comment by Jian Yu [ 07/Sep/11 ]

Lustre Build: http://build.whamcloud.com/job/lustre-reviews/2042/
Distro/Arch: RHEL6/x86_64 (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale(FLAVOR=OSS) failed: https://maloo.whamcloud.com/test_sets/bade20c4-d93e-11e0-8d02-52540025f9af

Please refer to the attached recovery-oss-scale.1315388587.log.tar.bz2 for more logs.

Comment by Jessica A. Popp (Inactive) [ 08/Sep/11 ]

Oleg,

Can you take a look at the updated logs from Yu Jian so that we can make a decision on RC2 tomorrow?

Thanks,

Jessica

Comment by Mikhail Pershin [ 08/Sep/11 ]

Jessica, Oleg, I did look already, there is another patch was pushed to gerrit to get more info. It will be tested today

Comment by Jian Yu [ 09/Sep/11 ]

Lustre Build: http://build.whamcloud.com/job/lustre-reviews/2087/
Distro/Arch: RHEL6/x86_64 (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

recovery-mds-scale(FLAVOR=OSS) failed:
https://maloo.whamcloud.com/test_sets/a7591bee-dac9-11e0-8d02-52540025f9af
https://maloo.whamcloud.com/test_sets/57731e6e-dac8-11e0-8d02-52540025f9af

Please refer to the attached recovery-oss-scale.1315558735.log.tar.bz2 and recovery-oss-scale.1315560362.log.tar.bz2 for more logs.

Comment by Mikhail Pershin [ 13/Sep/11 ]

I found the issue causing that. Let's look at fsfilt version get code:

static __u64 get_i_version(struct inode *inode)
{
#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,27)) && defined(HAVE_EXT4_LDISKFS)
        return inode->i_version;
#else
        return EXT3_I(inode)->i_fs_version;
#endif
}

With RHEL6+ext4 it uses inode->i_version field to store versions. Meanwhile this field is used internally by ext4 and can be changed. This breaks recovery and cause client eviction.

We need to use the same i_fs_version in RHEL6 too, moreover the series contains already patch to support it. I've updated review tracker with the fix and new build showed no signs of issue.

Comment by Peter Jones [ 14/Sep/11 ]

The latest version of this patch is confirmed as resolving this issue - http://review.whamcloud.com/#change,1342 . It should land subject to inspections.

Comment by Build Master (Inactive) [ 14/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 14/Sep/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 14/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 14/Sep/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » i686,client,el5,ofa #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 15/Sep/11 ]

Integrated in lustre-master » i686,server,el5,ofa #280
LU-653 Ignore last transno from clients with no outstanding transactions

Oleg Drokin : 280d8b6a1538f4ad9d2acdd045b970811e895c43
Files :

  • lustre/ldlm/ldlm_lib.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,client,el5,ofa #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Build Master (Inactive) [ 16/Sep/11 ]

Integrated in lustre-master » i686,server,el5,ofa #281
LU-653 i_version shouldn't be used for VBR

Oleg Drokin : 1e9326917af52f3d01920411465476154b2807d0
Files :

  • lustre/lvfs/fsfilt_ext3.c
Comment by Peter Jones [ 16/Sep/11 ]

Landed for 2.1

Generated at Sat Feb 10 01:09:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.