Always Get Better

Never stop looking for ways to improve

April 24th, 2011

There is a lot of debate among newcomers to Azure whether to use Azure SQL or file-based Table Storage for data persistence. Azure itself does not make the differences very clear, so let’s take a closer look at each option and try to understand the implications of each.

Azure SQL
Azure SQL is very similar to SQL Server Express. It is meant as a replacement for SQL Server Standard and Enterprise, but the feature set is not there yet. Reporting services, in particular, are in community technology preview (CTP) phase and at the time of writing are not ready for prime time.

Programmatically, Azure SQL is very similar to SQL in existing web applications; any ODBC classes you already wrote will work directly on Azure with no changes.

Size and cost are the major limitations with Azure SQL in its current incarnation. The largest supported data size is 50GB which runs for $499/month. Any databases larger than 50GB would need to be split across multiple instances – this requires knowledge of sharing at the application level and is not easy surgery to perform.

Table Storage
Table storage uses a key-value pair to retrieve data stored on Azure’s disk system, similar to the way memcache and MySQL work together to provide the requested data at fast speeds. Each storage container supports 100TB of data at incredibly cheap ($0.15/GB) rates.

Working with table storage involves accessing them directly from your application differently than you may be accustomed to with SQL. Going all-in ties you to the Azure platform – which is probably not a problem if you’re already developing for Azure as you will likely be trying to squeeze every ounce of performance out of all areas of the platform anyway.

Table storage does not support foreign key references, joins, or any of the other SQL-stuff we usually use. It is up to the programmer to compensate by making wide de-normalized tables and build their lists in memory. If you’re already building clustered applications, this is not a new design pattern as developers typically want to cache data in this manner.

Besides the larger space limits, table storage affords us automatic failover. Microsoft’s SLA guarantees we will always be able to access the data, and this is accomplished by replicating everything across at least three nodes. Compared to self-managing replication and failover with the SQL service, this is a huge advantage as it keeps the complexity out of our hands.

Difference in Evolution
If Azure SQL seems somewhat stunted compared to Table Storage, it’s not an accident: it is a newcomer who was not planned during the original construction of the Azure platform. Microsoft carefully considered the high-availabilty trends used for application development and found that the NoSQL way would most easily scale to their platform. Developer outrage prompted the company to develop Azure SQL, and its service offerings are improving rapidly.

Depending on your storage needs, your course of action may be to store as much data for cheap in Table Storage, and use SQL to index everything. Searching the SQL database will be incredibly fast, and can be done in parallel with any loads against persistent tables – everybody wins.

April 23rd, 2011
Fixed
Creative Commons License photo credit: Don Fulano

Amazon is in the news today for the failure their Elastic Block Storage (EBS) service suffered, resulting in loss of service and/or extreme latency for hundreds of sites including some of their largest customers like Foursquare and reddit. AWS has been widely regarded as the most stable and overall leader of the cloud providers, so it was a great shock to many observers that they were able to suffer such a large failure.

I think the failure is not surprising, but rather the fact that it hasn’t happened before now is surprising. It underscores my message that cloud computing is not magical but is in fact an abstraction over very real hardware. There are bound to be flaws and issues just as with “real” hosting options, the difference is the end customer has less control over the hardware, hosting and networking environment.

Not every business can afford the overhead of a large dedicated solution, so what to do?

Spread the Load
The key is redundancy. Start by spreading your content across the internet rather than relying on single server to cough up all of your visitors’ needs. Things like content delivery networks (CDNs) will reduce the incoming load on the server and help it stay online.

How can we tell if a website is offloading the right amount of content? Perform regular speed testing and identify problem areas using tools like YSlow.

Redundancy! Eliminate Single Points of Failure
Whenever you have a single system servicing part of your application, you expose the entire application to failure.

For example, suppose you have four Apache servers and a load balancer sending equal traffic to each. If one of the Apache servers fails, the other three are able to compensate for the loss with no downtime for your visitors. But what happens if the load balancer fails? Even though all four web servers are in fine working order, your site is knocked offline.

Some systems are difficult to cluster: replication schemes in the various SQL servers are a huge drain on performance – newer solutions like MySQL Cluster or DrizzleDB aim to solve this problem, but at extra expense in terms of configuration and application design.

The key to successful redundancy is in scripting your software in such a way that failures can be recovered from fast and automatically. Having a hot spare in the group isn’t useful if you need to reach an administrator at 4am to activate – by that point you’ve already lost your overseas customers for the day.

Twilio has an excellent summary of the engineering process that goes into creating a scalable cloud-ready infrastructure.

Avoid the Cloud? Never
Despite some public failures, cloud computing has not suffered any kind of blow. Large organizations will always want their own private non-cloud hosting, small sites will always be looking for an inexpensive VWS. The middle-tier which is serviced by the cloud will continue to see cost savings that greatly outweigh any physical hosting options available at that level.

Because of the low server cost, cloud computing allows smart customers the freedom to build necessary redundancy without breaking the bank. Even though this pays off big time when catastrophic failures happen, there are longer term benefits of improved overall response times to the end users even when the hosting is working well.

April 22nd, 2011

If your testing process involves copying files over to a development environment and hoping you have the most up-to-date version of everything, stop now. There is a better way, and it is continuous integration.

CI works by constantly and automatically testing and publishing everyone’s code into a staging environment. Any missing files or major problems are caught immediately so the team doesn’t run into major conflicts at the end of the development cycle. Although continuous integration does not eliminate code bugs, it improves the software quality by allowing developers to identify and fix problems immediately, resulting in a final product with much fewer defects.

Although there are a lot of different parts to a successful CI process, it is possible to start up slowly and add to the regime as your team becomes more comfortable with their improved success rates.

1. Establish a centralized source code repository
A solid source code manager (SCM) is the backbone of a professional development setup. There are so many tools available for every possible organization, both paid and free. I like git, but Subversion is used in a lot of environments. The key thing to look for is something that can track and rollback code changes over time and is shared between everyone working on the project.

2. Automate the build
After a good SCM has become part of development procedure, automate the build What this means, is every time someone commits to the SCM, have a tool like buildbot pull the fills and deploy them to a testing environment. From this stage on disallow anyone from making changes directly to the testing environment.

3. Automate Testing
Next add unit testing. When the build agent runs the unit tests, it should prevent the source code from going into the testing environment if any error conditions exist in the code. This will let the team make more daring changes to the source code without having to worry about time-consuming reversion testing.

4. Automate Deployment
At this point we’re already have an automated way of deploying our code to a testing area – take it a step further and automate your deployment process. This will let you get your work into the hands of your customers faster and involve them in the planning process earlier – all resulting in an improved product and higher likelihood of staying on-time and on-budget.

April 21st, 2011

The best way to keep visitors engaged in your website is by delivering your experience in as little time as possible. The average visitor will only stick around for a few seconds, so it is important to get them interacting with your content fast. The first thing to check for, of course, is any bottlenecks in the initial page generation. Once the web page is being generated quickly, we can turn our attention to the next biggest culprit: the connection to your client.

Downloading files directly from a web server is costly, even if you’re using an efficient server like nginx for static files.

A content delivery network (CDN) can help speed up the process by storing your content in data centres around the world so they get served to your visitors from locations that are physically close to them. This results in fewer network hops which makes the files download faster, and reduces the overall load on your web server so you can focus on doing more interesting dynamic application stuff.

At one point, CDN services were only available to companies with deep pockets and huge websites, but these days anyone can set up and use an inexpensive service with their regular hosting provider.

Check with your host to see if they offer a content delivery solution. The two providers I use for my blog, Media Temple and Rackspace both have excellent services. If you are using a WordPress site, check out W3 Total Cache, which provides an all-in-one package for managing your files and optimizing the overall speed of your site.

April 20th, 2011

Maintaining database schemas across development environments (especially in teams) and in production can be a real nightmare. Fortunately there are a number of solutions which make database management easier.

Migrations
This can be done manually or automatically. As database changes are made by developers, scripts are generated which can be run against a master database to bring it in line with the developer’s version. The most basic way to accomplish this is by writing a script manually, but frameworks like Django and Rails have built-in migration tools which manage this process. Rails in particular allows developers to move back and forth between snapshots of database schemas.

Evolutions
Evolutionary systems detect database schema changes against program code definitions. As of April 2011, Play Framework supports Evolutions.

Schema Versions
Microsoft SQL Server supports schema versions; wherein the underlying data remains the same, but multiple versions of the database schema rest on top and can be accessed simultaneously. This keeps older versions of the application or supporting clients working with the existing data set.

Keep Tracking…
Managing database changes can be a challenge for organizations of any size. The correct tool depends on a wide range of factors including your project size, number of team members, release schedule.

What kind of tools and processes do you use to manage database changes?

April 19th, 2011
Tortoise
Creative Commons License photo credit: Eric Kilby

Why is my website loading so slowly?!?

There are a few common culprits behind website speed issues. When diagnosing problems, the best bet is to start at the worst performers and move up. Some suggestions, in order from slowest to fastest, are:

1. Internet Traffic
If your web page is downloading anything over the internet during each page request, stop right now. This is the most expensive operation you can perform. Example: Downloading a photo from Flickr and loading it into memory in order to determine its width and height dimensions.

2. Network Traffic
Local network traffic is generally very fast, but still involves transmitting information outside your computer. In some cases, such as web clusters with a shared session cache, the network performance cost is worth it for the overall application.

3. Database
Databases are fast, particularly when the data you need is already stored in a memory cache – which you generally can’t control. When paired with a key-value memory store like memcache, the majority of your database calls can come straight from memory.

4. Disk I/O
Even with the incredible access times found in today’s hard drives, reading and writing from the disk is an expensive operation (and why databases lose points, except for their memory caching abilities). Sometimes reading from disk is the better choice – YMMV.

5. Script Caching
Implement a tool like xcache (PHP). This will keep your code in binary bytecode format which is much faster to execute since it doesn’t have to be re-processed by the web server.

April 18th, 2011

MySQL+Memcache have been bedfellows for awhile and at this point are the de facto standard for highly-available, scalable websites. Even with other SQL and NoSQL solutions starting to become popular, this pair holds on as the winner for LAMP programmers. Is the complexity of working with this technology pair worth the investment?

Read vs Write
Traditional relational databases place the burden of computation on read operations. In mainframe environments with powerful servers and relatively few users, this made sense. Database normalization prevents redundancy, and data can be joined together when needed to produce the desired results.

In a web application with 1,000,000 users, the normalized transactional model does not perform. Generally speaking it is way faster to make two queries to a small subset of data rather than attempt expensive joins in a client-facing web site.

Enter memcache: by storing the result of our SQL queries in memory, we improve the speed of subsequent requests by pulling the data from memory as well as avoiding a hit to the database entirely, freeing it to process urgent or real-time requests.

Anatomy of an SQL Query
When we run an SQL query, we are actually asking the server to perform a lot of work:

  1. Break down the query into object references: The DBMS needs to understand which tables, columns, and filters you are using by tokenizing your SQL (by breaking out names from the keywords like SELECT, FROM, WHERE).
  2. Identify which indexes (if any) are most relevant to the data: This is harder with more complex queries which must be broken down or which depend on outer tables for their values.
  3. Read the source tables from the hard drive: Most DBMS implementations include some kind of memory caching which partially avoids this expensive read step, but some disk IO is a normal part of operation
  4. Join Columns: If we specify a join, especially a LEFT or RIGHT join, the DBMS has to create a pseudo table from the joined sources in memory before it can do any additional processing.
  5. Sub-selects: Any sub-select statements need to be processed. Depending on how the statement was written, this needs to be done for every row returned in the result set.
  6. Filter and sort: Anything in the ‘WHERE’ clause needs to be filtered out of the result set. This is where we are going to start seeing performance improvements by narrowing our result set.
  7. Aggregation: Once the database has its final result set it can do all of the aggregation we ask of it, both calculations and grouping

As we can imagine, this can be a time-consuming process. If it is repeated thousands of times in a short period, we will see significant performance loss.

Anatomy of a Cache Request
By contrast, when we perform a request for data from a cache like memcache, we do this:

  1. Check the index for the presence of the supplied key
  2. If a result exists, return it

In the case of memcache, this happens entirely in memory with no hit to the database whatsoever, resulting in a blindingly fast result set.

Speed Over Persistance
The reason “>memcache and MySQL work well as a pair is because they provide the tools needed to have reliable, persistent transactional storage (through MySQL) along with lightning fast data retrieval (through Memcache) especially for rarely-changing results.