Scalability Rules – Review

Just read the book, “Scalability Rules, 50 Principles for Scaling Web Sites” by Martin L. Abbott and Michael T. Fisher. I’d like to start out saying that having the opportunity to meet the authors of this book was an honor. I only wish that I had read the book before meeting them. I’m inspired by this book; the length of this blog post should be a testament to that.

The book was an easy read and spot on. You can read the whole book in one sitting, I did – one Sunday afternoon. Many times I marveled at how we’re all coming to the same realizations, at different companies. I’ve been living this through experience working at very fast growing Internet company with millions of customers, dozens of SaaS based services, and several data centers, some International. What I liked about this book was the affirmation of beliefs I share with those I work with. I could demonstrate example of nearly all of these rules across our array of online services. There were plenty of aha moments! This book is a great introduction to many (all the important ones?) advanced web application scalability topics. If you think you already know them all, think again. Give this book a read. If you’re already an advanced level web app architect, you’ll breeze over much of it, then get an eye-opening surprise or three.

I’d like to reinforce how much I enjoyed the affirmation of my own beliefs, and the eye openers. Never before have I seen all of these principles/rules/beliefs (whatever you want to call them) together in one easily reference-able book. I’m going to buy many copies of this and hand them out at work, with the instruction: We should all know these rules, inside and out, through our combined experiences, and this book sums them all up. This is a must have reference to have on the desk.

Scalability Rules is very modern, in that it discusses the very latest in large scale web application trends. These aren’t the principles from 2000 or 2005, this is culmination of all the latest, up to 2010 and 2011, trends. Seriously, back in 2005, this stuff hadn’t surfaced yet. Some of the horizontal scaling principles existed, but none of the more modern sharding, noSQL, page-cache, object-cache, CDN, and more had enough sustained experience for all of us to know if it’s all really worth the trouble. Very few sites in 2005 required much more than 2 or 3 web servers behind a load balancer and a database. I anticipated the growth that was about to happen, but it was hard to really know what it’s like until you live it.

Here are my brief comments on each of the rules:

Rule 1 Don’t Over Engineer The Solution
Reminds me of the KISS principle. So true.

Rule 2 Design Scale Into the Solution (D-I-D Process)
DID is a good rule of thumb to tell you how far out to design your system for scale. Have to consider how fast your system is going to grow, so you know how much to design/buy for upon deployment. Always need to ask yourself how well the solution will work at 1000 users, 10K users, 100K users, 1M+ users. If it’s going to take 5+ years to reach 100K users, you don’t have to design every scalability rule into your system at launch. Refer to Rule #1.

Rule 3 Simplify the Solution 3 Times Over
Keep in mind, the business usually has more optimistic growth expectations. The Pareto Principle, or 80/20 rule, says you can get 80% of the business value from 20% of the requirements. When time to delivery is important, remember the Pareto Principle. Earmark areas of your system that need attention in the future to scale, and don’t paint yourself into a corner. Make sure to use modular design principles, so you can scale later.

Rule 4 Reduce DNS Lookups
This is a kind of contradiction to the area in the book where is says to use multiple hostnames for CDN, like cache1.cdnprovider.tld and cache2.cdnprovider.tld. If you were to take this principle too far, or at face value, you might have engineers/architects trying to use IP addresses instead of hostnames. This would lead to portability problems for your application and I would not recommend this. DNS lookups are inevitable. I’ll change this rule for my own use to “Be conscious about the DNS architecture around your application”. Always use a DNS system that is distributed with anycast technology with many POPs around the world. The last thing you need is for a delayed DNS query response to hang your application!

Also, if you’re application is relying on 3rd party for Mashup style page, try to make sure these are loaded post-page, even better is with AJAX post page load, so they are not slowing down the main load of your site. I put this piece of advice here, because the DNS lookup on this 3rd party could be the cause of the delay. It might even be fast for you, and slow for others, depending on your proximity to the DNS authority for the 3rd party domain. If you’re using WordPress or some other CMS style application, it’s fun to enable all the social plugins and such. Watch out, though, the more 3rd party plugins you enable, the more susceptible you are to having long page loads due to these.

Rule 5 Reduce Objects Where Possible
YES! The best way to handle the effects of latency is to reduce round trips. This is the Yslow, PageSpeed, WebPagetest rule #1. Proper use of combining js, css and image sprites will give your app the biggest bang for the buck in web app performance. If you’re using WordPress, you can use the W3 Total Cache plugin to automate some of this for you. If you’re designing your own large scale web application, then you need to design this yourself. I really liked how the authors addressed the concern of the browsers doing 2 or 6 concurrent connections. This points out the fact that if you combine 50 images into 1 sprite, and it’s the only page dep, you’re not taking advantage of the concurrent download capability of the browsers! Seeing as how different browsers have different concurrency settings, and differing latency conditions from different users, there is no perfect formula. The best way is to experiment with different browsers from different locations until you find the happy medium. Here is a good tool to do this: http://www.webpagetest.org/. In fact, if you click on the “Page Speed” tab after running a test from here, many of the rules from this book appear on the list. This is yet another example of multiple companies coming to the same realizations about web application performance and scalability at the same time. I’d like to reiterate that the rules in this book combine nearly every rule I’ve heard (and experienced) over the years into a single, short, easily digestible reference.

Rule 6 Use Homogenous Networks
This was a bit of an aha moment for me. It seems like common sense now that I hear it, but I hadn’t thought of this before. Not exactly, anyway. I’m sure others I work with have. All of the network equipment comes from the same vendor. Often tempted to stray by cost reduction or shiny new features, but the convenience of single vendor from a supportability perspective has maintained loyalty. I like the author’s new perspective (at least for me), that homogenous network equipment get along much better. I liked the analogy to the browsers, which all implement the same HTTP standards, but somehow seem to display just a little differently. It’s never quite as interoperable as you might think.

Rule 7 Design to Clone Things (X Axis)
Been doing this for years. Write to the master and read from the slaves is old school. It seems to come more naturally for the Linux side of the house, but the Windows guys are doing it too.

Rule 8 Design to Split Different Things (Y axis)
Breaking your application apart by service and/or resource is becoming more natural now. The teachings of SOA have been ringing for years. I know I need modular designs. Sometimes to bring down costs for launch, you might have several services running on a single logical layer (i.e. the admin GUI and the display running on the same web server(s)), and this design up front makes for splitting these out later a snap. When you have the revenue stream to justify the capital expense of more servers, for example.

Rule 9 Design to Split Similar Things (Z axis)
Now we’re getting into some advanced topic! Whoot whoot, this was an exciting rule. For most, the concept of sharding their database is not instinctive. We need to find a way to turn this around, because it’s a lot easier to design for sharding up front, than it is to go back and retro fit your application years later! I’ve done both. In some cases, when users are independent of each other and not sharing or interacting with each others data, you can POD your system. 1M users per POD, for example. This provides a level of isolation protection. Say you have 10M users across 10 PODs. If something goes wrong in 1 POD, 10% of your users are affected, not 100%! Even if you don’t have 10M users, say you have 10K users, you may want a POD architecture for this isolation protection, even if it costs more. Depends on what’s important to you.

Rule 10 Design Your Solution to Scale Out – Not Just Up
This seems so clear, we’ve been living this rule for years. It’s funny, though, how even though we know this, there are still some areas we struggle to apply it correctly. Mainly in database systems that deal with customer sales data. It’s so much easier if all the data about your customers is in one place. For applications, the product we’re selling, though, absolutely, design to scale out every time.

Rule 11 Use Commodity Systems (Goldfish not Thoroughbreds)
Same comment as rule 10, but this time with Storage. It’s so hard to make the cut over from the high end vendor storage solution to grass roots commodity filesystems, like hdfs. For everything else, like web servers, database servers, caching servers, application servers – go commodity.

Rule 12 Scale Out Your Data Centers
This was an aha moment for me. Once pointed out, it seems like common sense, but I hadn’t though of it like this before. I thought about running an application in multiple data centers hot-hot-hot for performance reasons (get the app closer to the user), but had not thought about it from a risk reduction with cost saving perspective before. Here’s the basic premise: If you need 10 servers to handle peak load, and you need multi-dc reliability, then the inclination is to run hot-cold in DC1 and DC2 with 10 servers in each DC. That’s 20 servers. Instead, can you run hot-hot-hot with 5 servers in each DC!? If so, this is only 15 servers, same level of redundancy/reliability (can lose a DC and still run), and reduced cost, therefore greater scalability. Fabulous idea. The only con to this is running a web application in hot-hot-hot across 3 data centers is a kind of holy grail in computing these days. If it requires ring replication of the databases with eventual consistency, this comes with some trade-offs. In other words, easier said than done.

Rule 13 Design to Leverage the Cloud
This is a fabulous consideration, depending on the scale of your application. If you’re a startup, this is a great way to get off the ground and even scale up to a degree. Watch out though, there is a trade-off mentioned in another part of the book; a myth about the use of Virtualization. It’s in Rule #11 where vendors will try to sell you a super computer for a sharp mark-up and say you can virtualize and run many systems on one. The problem is with the overhead of the virtualization, and every CPU/cores added to a single system comes with some loss of performance. You know how you can’t necessarily get a job done 2x as fast with 10 people as you can with 5, because of the communication and coordination overhead? Same thing goes for virtualization. For max capacity, use lots of commodity systems, instead. This is the problem with Rule #13, and the authors of the book know it. This is why they recommend leveraging the Cloud for batch jobs, or other temporary computational needs, and not as the premise of your main application. This is true if you’re designing a massively scalable web application – world class. Back to the start up idea, I see nothing wrong with getting your new product live using Cloud services. It’s a great way to gain entry into the market on a minimal budget. It’s not until you reach the level that you need multiple racks of equipment full time that you need to switch over to colo or running your own data center. It’s going to be a major project to convert, but if you don’t have the hundreds of thousands or millions of dollars of capital to start, Cloud compute is the way to go.

Rule 14 Use Databases Appropriately
This is an area the needs some deep study. We’re getting into noSQL and other tactics, making appropriate choices for what technology to use where. If you’re a hammer, everything looks like a nail. If you’re a DBA, every problem gets solved with the relational database. If you’re an architect, you need to understand all the choices, have lots of tools in the toolbelt, and pick the right solution for the right problem. Knowing when to use NoSQL tactics, and which one to use, is key. Getting to know the NoSQL tactics and how to use them appropriately is one area that many developers need study in, especially considering most of us were schooled before 2005, when web application development consisted of a programming language on the web server, talking to a database. It’s hard to break that paradigm and think in terms of NoSQL.

Rule 15 Firewalls, Firewalls, Everywhere!
I loved the house analogy, with no protection on the items in the front yard, deadbolt on the front door, privacy locks on the interior bedroom and bathroom doors, no locks on the closets.

Rule 16 Actively Use Log Files
Sometimes the amount of engineering and maintenance that goes into centralizing log files seems like overkill, but then when you need it for troubleshooting, it’s certainly convenient. This is often a second thought to a system. You start out with a few servers and it isn’t hard to admin it with a few putty windows tailing all the log files. Then, when the number of servers grows into the hundreds, this is no longer feasible. I think it’s ok to postpone the design of this until your system reaches that critical size, but good to remember that you’re going to need to make time for it when it does. Could easily be an oversight, and then an “oh crap” moment when you realize it needs to be done.

Rule 17 Don’t Check Your Work
I never insert into a database, then select to make sure it worked. Never, it never even crossed my mind to do this. Where I do get into trouble, though, is with systems that write to the master and read from the slaves. Let’s say you do an “UPDATE tblename SET col=col+1 WHERE col2=1”, and now you need to display the resulting table! We’re not checking our work, we’re selecting data we need for immediate display after doing an update. “SELECT col FROM tblename WHERE col>2” from one of the slaves might not be replicated yet! To accommodate this problem, design the application to be able to read from the master too. The design principle is to read from the slaves, unless you just wrote to the database. One of the better design patterns I’ve seen is where this is abstracted in the DB abstraction layer. Once a write hits a database, all subsequent reads to this same database for this page load will hit the master. This is SO much more simple than trying to make the main application logic pick which database tier to use. Refer to rule #1. You might be able to squeeze a tad more performance by putting this into the application logic, because your app logic can know whether subsequent SELECTs are affected by the previous write, but the complexity trade-off of tracking this isn’t worth it. In fact, the application logic shouldn’t know anything about the master/slave tiers.

Rule 18 Stop Redirecting Traffic
I like how they said there are perfectly valid uses of redirects. I’ve used the POST-REDIRECT-GET method many many times. It lets users hit the back button w/o the browser asking if you’d like to repost data. This “Would you like to repost” can be detrimental in some cases. For example, right after posting to make a purchase, reposting might purchase it again! In other cases, it’s just highly annoying to the user to get bothered with that question. The meta tag refresh, or worse javascript redirect, should almost never be used. Good point.

I had an aha moment in this section! I’ve always been a fan of .htaccess, because it’s so simple and flexible – can add rules on the fly w/o restarting apache! The aha moment came when the authors said that enabling .htaccess means the server has to look for it, in every directory, on every page load, and then, if found, load the rules into memory. This is overhead! I’ve already got a few systems in mind that I need to go adjust, but first to the load test lab to put this theory to the test. It makes sense logically that this overhead would affect overall capacity and performance, but I wonder if it’s a measurable difference. Probably depends on the directory structure depth and the number of files getting loaded per page. Still, I wonder. Have I overlooked an obvious performance detractor all these years!?

Rule 19 Relax Temporal Constraints
This is the holy grail. Facebook is doing it, right? Have you ever noticed that you can post something, then have it disappear, reappear, disappear, then reappear, on subsequent page loads? They don’t care. Eventual consistency is fine. They’re replicating data to nodes all over the world. There is no way they could rely on ACID compliance. It’s really really hard to figure out how to relax temporal constraints, but I get the point. This is one area I need to study to better understand how to do it right.

Rule 20 Leverage CDNs
Yes, yes and more yes. Have to. Been doing this for years, including the cache1.cdnprovider.tld and cache2.cdnprovider.tld trick. It’s interesting how some lower cost CDN options are starting to pop up that start-ups and even individuals or small business with low budget can take advantage of. It’s simple and second nature for us with enterprise level Akamai accounts…. however, even now, with dozens of web applications, we’re still converting apps to leverage CDN years later, or updating apps to further leverage CDN.

One of the difficulties with this is in the deployment. If you have a deployment, and your static content has to be deployed to the origin/CDN first, then your app, think about the complexities that arise here. If you’re deploying a new version of a static file to the CDN, you need to either expire the CDN cache with a tool, rename the file, else use a whatever.jpg?version=1.1 style in the src attribute. I’d recommend using the ?version=1.1 style, because this can be automated to match your release/version number in your application. Now, if you deploy to the origin/CDN first, then deploy your application, you have to consider the interim. Somebody is going to load the new file in the old application. ARGH! Because of this complexity, I’d recommend

Rule 21 Use Expires Headers
This is Yslow/Pagespeed 101, yet when examining almost any web application, it’s easy to find inadvertent oversights in this area. This is a big bang for the buck exercise, because it’s usually really easy to correct.

Rule 22 Cache Ajax Calls
I have a neat example of this: Imagine a Web-Based Email system that fetches message headers with AJAX for display. Click on one, and AJAX loads the message into display. Next the user clicks to go back to the message index. Wouldn’t it be great if the message index was cached in memory, so it doesn’t have to go back to the server to get it again!?

Rule 23 Leverage Page Caches
Absolutely, especially where the pages are dynamic, but don’t change frequently. For example, on a blog, where new articles and comments only come in slowly.

Rule 24 Utilize Application Caches
This area confused me a bit at first. What is application cache? Is it offline browser cache? Or is it opcode cache, like APC for PHP? Or does it just mean getting application data loaded into memory? I’ll have to come back to this topic later to get a better understanding. This may not be the authors intention, but I’d highly recommend APC for byte code caching in PHP applications. It’s amazing the performance improvement to be had with this, especially in high concurrency conditions.

Rule 25 Make Use of Object Caches
This can be done with APC also, to store things like DB query answers that don’t change frequently. For example, if you have app configuration data in the database, use APC to hold these values in memory on the web servers and only update once a minute in a separate process. This removes the need to do a round trip to the database to fetch values that rarely change. Under high concurrency conditions, querying the database over and over again for the same information is an aweful waste of round trip transactions. APC lets you store it in shared memory space on the web server for high speed fetch.

Rule 26 Put Object Caches on Their Own “Tier”
Hmmm, not sure I agree with this one. Take the example above of the application configuration values loaded into APC memory cache. Let’s say you have 10 web servers. Isn’t it going to be better to have these values in memory on each web server, rather than needing to do a round trip over the ethernet to fetch them? I get the independent scaling advantage, but at the cost of network round trip!?

Another problem with caches on their own tier is availability. With memcached, if you have say 3 servers, then the web servers and application servers are configured for the 3 servers and do a modulus calculation on the key to pick which of the 3 servers to use. If one of the 3 servers goes down, then all of the web servers and application servers are busted. You need to be able to quickly reconfigure the whole system for running on 2 memcached servers. This is a major availability flaw. Now, on the other hand, if each application server and web server maintains it’s own independent memcached cache, and one goes down, you can just take it out of rotation and the rest of the system is unaffected!

Rule 27 Learn Aggressively
This is a key philosophy for any team to succeed. Be hyper-critical, never content, there is always a better way. Burn me once, shame on you, burn me twice, shame on me. Don’t fall victim to the same problem over and over again. From ITIL training, separate incident management from problem management. Incident managements goal is to restore service as fast as possible. Problem management’s goal is to find root cause and resolve it, or work around it. Problem management is a must have discipline that I would recommend run in a centralized and consistent fashion.

Rule 28 Don’t Rely on QA to Find Mistakes
Amen. We’re all responsible for quality assurance. This means best practices like unit testing, frequent integration and peer reviews are paramount. Any developer that hands over code to the testers that doesn’t perform the core functionality should be subject to public humiliation. A bug with an odd sequence to produce is acceptable, after all, that’s why we have testers.

Rule 29 Failing to Design for Rollback Is Designing for Failure
Oh man, did I have about 3 deja vu experiences reading this section. Although, I have to say, full compliance with this rule is easier said than done. I’ve done deployments where even with the simple rules, still can’t convert the new data back to the old schema after hours of running. This run at night when an unexpected problem will have the least impact leading to a problem in the morning when load hits is very real. One of the biggest problems that arises from this is the engineers stay up all night to deploy it, then in the morning, have to wake them up super early to solve a very complex problem on very little sleep. Because of this, I would recommend that many deployments be done at 5am, if you’re on the West coast. This is 8am east coast, just before the east coast rush hour. What you do is send your teams home early the day before and have them all get a good nights sleep, and come in super early. Do the deployment, and stay on to monitor it throughout the day. If something goes wrong, the team is fresh and ready to handle it. The ability to roll back and lose up to an hours worth of data is a far better choice than to roll back and lose several hours worth of data.

Rule 30 Discuss and Learn from Failures
How is this different than Learn Aggressively? I’d recommend applying good ‘ol fashion ITIL incident and problem management here. It’s condusive to ensuring that incidents are resolved quickly, and problems are documented and take-aways are followed up on.

Rule 31 Be Aware of Costly Relationships
Interesting… what was once rule, fully normalized data, is now considered trade-off. In order to be able to shard data, you might need to relax some of the foreign key constraints and referential integrity.

Very interesting. Reminds me of a recent story in DB design. Lesson in DB design: Be very weary of auto-incrementing id columns and what data type it is. A signed integer is only good to 2B or 2^31 power. Even worse, some DB systems will actually roll the id to min-int once it hits max int! Think about what this is going to do to your application! So, assuming this happens to you by accident due to poor design choice, what do you do? You want to convert the column to a 64-bit unsigned integer, but can’t, because you now have negative values. So, first, convert to 64-bit signed integer. To do this, any foreign key constraints you have must be modified to match, so you have to remove them first, change the primary key in the main table, and then the foreign keys in other tables, then add the foreign key constraints back in. Yuck! Now what!? In order to get the negative values to positive, so you don’t have to live with this forever, have to change each id value to the next available positive integer. Hopefully you have the modify on update feature turned on in your database, so all the foreign keys change automatically. This is the kind of costly relationship the authors are talking about. They are very convenient in times like this. They’re also very convenient in many cases, like cascading deletes, in your application design. Lack of referential integrity almost always leads to data inconsistency. This is a pretty tough trade-off to live with.

Rule 32 Use the Right Type of Database Locks
Many moons ago, I got certified in MySQL, back in 4.1 days, when they only had 1 certification. As an application developer, this training came in handy SO many times. I can tell you, developers that lack a strong understanding of DB principles are at a significant disadvantage. Developers don’t need to be full DBAs, don’t need to have the intricacies of replication, for example, but do need to understand DB engines and SQL. In my experience, developers need to write their application without a full time DBA working side by side. In fact, systems that only allow developers to call stored procedures written by DBAs grossly slows down the application development. There is too much back and forth, the developer doesn’t understand the data structure, and the DBA doesn’t understand what the application is trying to do. I’m not saying it can’t be done, but if you don’t nearly pair program the developer and DBA, it’s going to be stumble after stumble.

Rule 33 Pass on Using Multiphase Commits
Totally agree. This is where it’s good for the DBA and Developer to be on the same page. Think about how easy it would be for a DBA to turn on multi-phase commits, and the developer have no idea what this even is. This is what happens when the DBA doesn’t have a full understanding of the application requirements and implements standards he *thinks* is right for the system. Yet another reason why Developers need an education on the DB systems they’re interacting with: gives them the right questions to ask the DBA and ability to pinpoint topics of debate.

Rule 34 Try Not to Use “Select For Update”
I’ve never personally been inclined to select for update. I’ll have to think about this one and examine systems, question developers and dbas to see if I can find use of this.

Rule 35 Don’t Select Everything
This is an old lesson…. and it amazes me how often I still see SELECT * FROM… When writing code, it’s just so much easier to type SELECT * than it is to type out a whole string of column names, so I understand why I still see it. I might argue that there are some cases where it’s ok…. for example, if code working on the result set needs nearly all the data returned, and the codebase uses fetch_assoc, there is no threat of having a new column poison the result set. I agree, best not to be in the practice of SELECT *, though. I’m just wondering, to what level of scrutiny should I enforce this in others? I tend to relax a bit on this, I have other battles to fight.

Rule 36 Design Using Fault Isolative “Swim Lanes”
I really like the call to standardize on our terminology. I use all the terms listed for nearly the same definition the authors gave, except swimlane. I’ve used swimlane to describe a sequence of interactions between components, but never to describe a POD. lol. I’m very familiar with the concept it describes, however. This is an area that can take a lifetime to master. These problems didn’t exist 6 years ago, at least not for me, and now it’s more prevalent than ever.

It’s really amazing how fast the world’s Internet computer systems have evolved. Phenomenal. I couldn’t have even dreamed of this 15 years ago. 10 years ago, I guess I may have fascinated about such a thing, but it’s just amazing the new world we live in. Anyway, had a moment there, back to the review:

Rule 37 Never Trust Single Points of Failure
Amazing how much effort goes into identifying and eliminating SPOFs, and yet we can still have crashes that send admins scrambling to figure out how “the unsinkable ship is sinking”! Have to continue to work super hard at this, and test the theories. Many of the applications that I oversee perform failure testing. If the app is supposed to gracefully degrade when services go offline, test it. Turn off the database (in the test env, of course) and see. A nifty trick: Inject an /etc/hosts file entry to force a db hostname to a bogus IP address. Do this for all of your dependent services, not just the DB. Do it often.

Rule 38 Avoid Putting Systems in Series
I don’t see how we can avoid this, but I had an aha moment reading this. Even though we can’t avoid it, we can work to minimize it, and be very aware of it. I really liked the web server -> app server -> db server example, because it enlightened me to the uptime SLA problem with this. If we’re shooting for say 99.9% uptime, and web server has 99.9%, app server has 99.9% and db server has 99.9%, then the overall system uptime is: .999*.999*.999 = 99.7%!

Rule 39 Ensure You Can Wire On and Off Functions
Use this concept a lot, and in every way described, the most common the db markdown.

Rule 40 Strive For Statelessness
For me, an application needs statelessness. The most common problem for needing state is the session on the web tier. I like noSQL, memcached, for session state. Recall the problem I described earlier with a multi-server memcached server and the configuration nightmare that occurs if you lose a memcached server. Yeah, that makes this problem harder.

Rule 41 Maintain Sessions in the Browser When Possible
I’m not even close to being convinced on this. In fact, I’ve seen in many apps this horrible thing called VIEWSTATE that gets passed around from page to page, and it seems to get bigger and bigger and bigger. This is usually found in performance review of an application and the Windows app developer is stunned and then figures out how to turn this off. Seeing as how it’s super important to have a very fast website, it’s rarely going to be a good idea to maintain sessions in the browser.

Rule 42 Make Use of a Distributed Cache for States
See comment on Rule 40. I would recommend this rule 42 over rule 41, b/c page response time is almost always more critical.

Rule 43 Communicate Asynchronously As Much As Possible
Yes, this is a great topic, and also a bit of a holy grail. Imagine you’re loading a web page, and to do so, you need to do 50 SQL queries. A typical web application will perform all 50 in sequence. Wouldn’t it be better if you could fork, create 5 threads, and complete all 50 queries 5 times faster!? See http://gearman.org/ for a study on this topic. Gearman is a little on the immature side, because it has no authentication mechanism and no built in SSL, but I’ve seen it used to solve a real world problem. I can describe this in detail, but won’t for sake of brevity.

Rule 44 Ensure Your Message Bus Can Scale
Yes. I’ve learned a few lessons the hard way on this one. Thanks for the reminder!

Rule 45 Avoid Overcrowding Your Message Bus
Good point.

Rule 46 Be Wary of Scaling Through 3rd Parties
Constant struggle between DIY and buy. The basic rule of thumb is: if it’s part of your core functionality, better DIY. There are certainly times it makes sense to buy. For example: servers. Sure, there are some out there building their own servers, but this requires a department dedicated to this, and then lots of haggling with the component vendors. There is no way out of dealing with vendors all together. Be wary is good advice. Have to remember the salesman are trying to make sales quotas, and will say just about anything to talk you into it.

Rule 47 Purge, Archive, and Cost-justify Storage
So true. Hard to do, takes a lot of engineering to get it just right, but there are opportunities for massive cost savings done right. I haven’t seen a proper implementation of the tiered storage concept, except for this: local disk -> NFS -> tape. I mean I’ve never seen one that keeps frequently accessed data on high speed SAN, medium on NFS, and infrequent on JBOD, and even older compressed. This would be a whole lot of moving data around and there is going to be overhead in the maintenance of all these different tiers and making sure the whole system is highly available. Certainly a kind of holy grail concept. Been dreaming of this for years!

Rule 48 Remove Business Intelligence from Transaction Processing
Agreed! You saw my previous statement about stored procedures. This is another example of hammer/nail. DBAs always want to solve every problem with the database. Can’t blame them, that’s what they know how to do. It’s always good to consider the option, sometimes it’s best to let the database do it, but I agree, most of the time, have the application do the business logic.

Rule 49 Design Your Application to Be Monitored
Too funny, I’ve used both, back in the startup days: Nagios and Cacti. Have since outgrown. Good points about app monitor-ability. A very interesting problem is identifying true root cause. Imagine a problem that has a chain reaction and triggers many alerts. Sometimes have to navigate this to patient zero.

Rule 50 Be Competent
What a great 50th rule! This is fantastic, because it reminds everyone to have knowledge around all the components. The true technical leaders will find a way to be generalists in all, specialists in many. Yet, as a company gets bigger, the roles get more and more granular to separate the duty. Product Marketing Manager, Product Manager, Dev Manager, QA Manager, Developer, QA, DBA, Network Architect, Network Engineer, Network Operator, Architect, Windows Engineering, Windows Operations, Linux Engineering, Linux Operations are all different people now. In the startup days, people wore many hats. I find the best technologists have some startup experience where they wore many hats and had their hands into everything. Once you grow and get into the big company structure, this separation of duty occurs. Software developer never touches the physical hardware, nor has access to the production server. The operators know the production systems, but not the software and communication flows. It can certainly be a challenge for all employees to remain competent in this environment. To address this, I’d suggest implementing a shadowing program where employees are required to learn the roles of other employees.

Wow, did this take a long time to write. Took me as long to write this blog post as it did to read the book. The book inspired me. It was very thought invoking and was right up the alley of what I work on every day. I very much enjoyed this book, in case you couldn’t tell. Thank you Marty and Fish! It’s 2AM AZ time, and I have work at 9! Gotta go – glad I got this review posted before the work week.

Dave

79 thoughts on “Scalability Rules – Review”

DaveK

September 17, 2011 at 5:00 pm

@Marty – You’re welcome. Thanks for writing this book! And thanks for commenting on my blog, it’s not every day I get a book’s author to write a message on my blog. This is better than an autograph!

@Bill – The 97 things is pretty neat and more geared towards software architecture. It’s a good compliment to scalability rules, IMO. The Architecture of OS is pretty neat, I’ve used most of these apps. Software architects should study these to have an arsenal of tactics to deploy when dreaming up new applications. Your examples are very software centric, whereas Marty’s book is very systems centric. I like both topics. Systems architecture is the 50K foot view, and Software architecture is the 10K foot view.

Comments are closed.