-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repeated "Read timed out" errors when recovering a large sized shards from S3 repository #149
Comments
With ES 1.4.x and plugin AWS 2.4.1, we have retries at Snapshot/Restore level in ES, at plugin AWS level (max_retries), and internally in the AWS SDK used by the plugin (max_retries if set, otherwise it will try each request 3 times). Setting a large number of retries ends up with max_retries * max_retries maximum tries for each request. With such number of retries and if read timeouts happen, I think this could be network or DNS issues. |
I've experienced the same problem and very often but with no so big snapshots, in fact much smaller around 15MB. And with ES 1.4.1 and plugin AWS 2.4.1 looks like the timeout just takes more time to arise, but it's the same:
|
for the sake of reference |
I encountered this yesterday while restoring a 20TB index with 128 shards from S3. S3 was plenty fast, but frequent timeouts (probably due to saturating our AWS Direct Connect link) caused shard recovery to reset immediately. No shards ever recovered successfully in several hours. We were using ES 1.4.2 and cloud-aws 2.4.1. |
elasticsearch-cloud-aws 2.4.1 uses AWS SDK version 1.7.13. In AWS SDK v1.8.10.2 a bug causing socket timeouts has been fixed as indicated in the changelog:
The version 2.5.0 of the plugin uses AWS SDK version 1.9.3, so please let us know if the bug still exist once you upgrade elasticsearch & the plugin to a newer version. |
@tlrx The plan is also to release a new version of AWS for 1.4 containing also this fix. |
Still seeing timeouts with ES 1.5.1 and cloud-aws 2.5.0.
These are coming in pretty regularly, same as with 1.4.4 and 2.4.1. This is a 20TB index with 128 shards. The restore is ongoing, but so far there have been 4 timeouts and zero recovered shards. |
All the primary shards did eventually recover after about 19 hours. During this time there were 90 socket timeouts. |
Tested restore with ES 1.5.0 & AWS plugin 2.5.0. |
@tlrx are you sure AWS plugin 2.5.0 uses AWS SDK version 1.9.3? I just installed the plugin
Whereas the latest available from amazon is I'm having the same issues as everyone else. Running ES 1.5.2 and AWS plugin 2.5.0 and facing persistent read timeouts from s3 restoring a 350GB index for the last 24 hrs. |
@esetnik it is version 1.9.23, there's a typo in my comment. |
Do you have any suggested workarounds for other ways to recover from s3? I can download the files manually but i'm not sure how to serve them locally to the recovery process. |
@esetnik I did not try it but I think that you can download the files (including all files snapshot-, metadata- and all folders) from S3 and store them on a local disk. Then you can try to create a new FS repository that points to the root folder. |
i tried that workaround and it works. |
Can you elaborate please? |
sorry i should have been more clear. and i could be wrong. so what i should have wrote is that: i think the limit is every node should be able to store the whole snapshot data |
I wonder if we should try to add more options to the client we are using. http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/section-client-configuration.html Like what we are doing in elastic/elasticsearch#15080 for azure. |
I had a problem with elasticsearch 1.7.1 using AWS plugin 2.7.1 restoring about 300 GB of snapshots. My cluster is running in a VPC and was accessing S3 via a t1.micro NAT instance. The restore was going to take a few days. The NAT instance was a bottleneck on my network bandwidth. I created a separate cluster in the same VPC but put them into a public subnet. Now S3 was accessed through an Internet Gateway, thus bypassing the NAT instance. Then I saw network I/O and restore times inline with what I would have expected from S3--about 3-4 hours to restore the data. |
Closing the loop here (timeout settings for AWS, please track this ticket): elastic/elasticsearch#15854 |
When restoring a large size index (150GB splitted to 5 shards) from S3, "Read timed out" errors are raised from S3 input stream repeatedly.
This issue is somewhat of a duplicates of elastic/elasticsearch#8280,
which led me to test the recovery process using elasticsearch 1.4.0 & AWS plugin 2.4.1.
The test has failed using a large range of 'max_retries' values.
error log:
The text was updated successfully, but these errors were encountered: