I've been using the AWS Elastic Load Balancers (ELB) for some time now. The ease of use and scalability appeal.
But I learned something unpleasant about them: you can easily end up with misdirected traffic.
When you use an ELB, your users access it via a CNAME (setup by you and served by your DNS servers), which points to a DNS name owned by Amazon (in the elb.amazonaws.com DNS zone). When your client does a lookup of that name, Amazon then returns the IP address of an ELB machine that routes the traffic to your EC2 instances. From time to time, in response to traffic conditions, those IP addresses change. To accomodate such changes, the TTL on the domain name is low (60 seconds).
The problem is that some clients will not honour the TTL, and may continue using the old IP address after it has been disassociated with your ELB, and may have been given to someone else's ELB. Which means that your ELB may be receiving traffic for someone else, or that traffic meant for you goes to someone else.
I found out about this because my instances behind an ELB suddenly started receiving 15K requests per minute destined for ping.syndic8.com (judging by the Host header), sent by Ping-O-Matic (judging by the User-Agent header) and other clients. A DNS lookup of that name showed their ELB address, not mine. This persisted for many hours, and impacted my application.
This has been discussed in the forums at times: 28 Dec 2010, 28 Jan 2011, 21 Mar 2011. It looks like it doesn't happen (or gets noticed) all that often, but that it's a known issue you are meant to be aware of somehow, and you should just accept or escalate to support.
I think this is a big deal:
- Clients get service errors, or have their traffic exposed to a third party. Sure, it's "their fault" for not honouring the TTL, but depending on the kind of application that you serve this could be a common bug, or outside the control of the person running the client. I imagine this is especially likely with scripted API clients.
- Being on the receiving end of misdirected traffic can affect your level of service. Sure, if you're using AutoScaling this may help your load and latency, but at a financial cost.
I think that at the very least AWS should wait longer before recycling IP addresses. Ideally they wouldn't recycle IPs at all -- roll on IPv6. Maybe they could add an ability to associate several elastic IPs, but I imagine that is hard.
I have various other issues with ELB:
- it is slow to add new servers,
- you can't have 2 ELBs pointing to the same instance (during ELB migration)
- you can't inspect traffic before it hits the ELB,
- you can't block client IP addresses before they hit your servers
- your connection closes after 60 seconds of idle time
- you can't limit your instances to only receive traffic from my ELB (only all ELBs)
- it's hard to know what logic the ELB applies to your connection
but this one is pushing me over the edge -- time to reconsider some of the alternatives.