Archive for June 2008
As of today, we have had 88 responses. While I’m sure the sample size and sample method were far from perfect, it did produce some numbers that we found interesting that we wanted to share.
Note that the following graphs don’t represent all of the data collected from our survey. This is due to the fact that I made the questionable choice of making all the answers text fields. While this allowed us to see a lot of interesting and unexpected responses that a simple set of checkboxes would not have yielded, it also meant that processing the data was a pretty labor-intensive task. If you are curious about the answers from other questions, email me and I’ll crunch the data and post a graph.
Ultimately, these answers only led to more questions. If you’d like to help us out, we have another (much shorter!) tools survey you can fill out. Thanks!
Note: for the next two questions, I have excluded projects that have no unit tests
This last table is a bit strange, but I still think it’s interesting. You need to look at the relative number of 1-person, 2-person, etc, teams above to really understand it in context. The table shows code coverage by team size. So, for instance, the top left cell says that “37% of 1-person teams have code coverage equal or greater than 10%” while the top right cell says “4% of 1-person teams have code coverage equal to or greater than 100%”. Hopefully that makes some sense.
|code coverage %|
We have spent a bit of time looking into various Ruby messaging systems. We briefly posted about the speed of Ruby messaging in the past and promised some more detailed numbers. We will share a bit of code to run some basic tests on various Ruby messaging systems, and benchmark the performance. We are sure you can do more to get even more accurate results, but these were good enough for our purposes and we thought we should share them.
We decided that we would place and take X messages through each messaging system. Our baseline for the best performance was a standard Ruby in-memory queue. In an effort to reduce the effects of initial loading, we ran a small staging with each of the queues, and threw away the first set of results. We ended up comparing SQS, Starling, and Beanstalk. In the comparisons we ran tests using local queue servers running on the same machine as the tests (LocalStarlingQueue, BeanstalkClient). We also ran the queue servers remotely on an EC2 server (BeanstalkClientRemote, RemoteStarlingQueue). We are only showing results for 10 SQS messages, because it was so slow with any sizable amount of messages. We quickly found that both Beanstalk and Starling were faster than SQS by over 10x, on the remote servers and insanely faster on local servers. Surprisingly running Starling and Beanstalk between multiple EC2 instances is almost as fast as having local queue servers, and completely puts SQS in the dust (1000x faster).
When we started running larger tests we found some interesting results. Compare the last two sets of test in which we run remote queue servers. The first set shows that Starling is faster by a decent amount mostly because taking messages off the queue is significantly faster. This is actually because on Starling removing an item from a queue is a operation, while on Beanstalk it is two operations get and remove. This made Starling seem better for our needs since in our app any job taken should be considered completed. Once moving to messaging between EC2 instances though we can see that the overhead of the multiple messages disappears because the internal network is very fast between EC2 machines.
The speed of Beanstalk between EC2 was one of the reasons we eventually went with it as our messaging choice. The other isn’t really shown in our results here, but running tests on Starling over time show that the system begins to slow down with use. I am assuming that this is related to Starling persisting queues to disk, and the overhead related to persistence. In fact I found that if I killed the Starling server and cleared its persistent storage, Starling would return to its original performance. We currently have no need for persisting our queues, so taking a performance hit to support that feature was the final reason we went with Beanstalk.
We also looked at ActiveMessaging, but ultimately decided not to take the time to implement and test it. I would love to see others take our messaging test harness and put other messaging systems through the paces. We have decided to share the code we used to generate the results you see below, feel free to contact us if you have any questions, or if you add any other queues to the tests.
UPDATE: I updated the results to include runs with 10,000 and finally 100,000 messages, because there was some interest in seeing those numbers. There is some interesting discussion about this going on at the beanstalk Google group, about the results.
Running 10 messages on all systems Queue type user system total real MemoryQueue: 0.000000 0.000000 0.000000 ( 0.000067) LocalStarlingQueue: 0.000000 0.010000 0.010000 ( 0.015040) BeanstalkClient: 0.010000 0.000000 0.010000 ( 0.005700) SQS1: 0.120000 0.050000 0.170000 ( 10.608450) BeanstalkClientRemote: 0.020000 0.030000 0.050000 ( 4.263844) RemoteStarlingQueue: 0.010000 0.020000 0.030000 ( 3.366750)
MemoryQueue::::: mean time for put : 0.000020 std dev for put: 0.000014 mean time for take: 0.000013 std dev for take: 0.000001 put mean is 1.0 slower than MemoryQueue take mean is 1.0 slower than MemoryQueue LocalStarlingQueue::::: mean time for put : 0.000867 std dev for put: 0.0002 mean time for take: 0.001340 std dev for take: 0.000639 put mean is 43.3659117997616 slower than MemoryQueue take mean is 101.259459459459 slower than MemoryQueue BeanstalkClient::::: mean time for put : 0.000166 std dev for put: 0.000026 mean time for take: 0.000288 std dev for take: 0.000020 put mean is 8.2777115613826 slower than MemoryQueue take mean is 21.7963963963964 slower than MemoryQueue SQS1::::: mean time for put : 1.081836 std dev for put: 2.264929 mean time for take: 0.670880 std dev for take: 0.045992 put mean is 54082.8140643623 slower than MemoryQueue take mean is 50700.4522522523 slower than MemoryQueue BeanstalkClientRemote::::: mean time for put : 0.101100 std dev for put: 0.007356 mean time for take: 0.202382 std dev for take: 0.012278 put mean is 5054.16805721097 slower than MemoryQueue take mean is 15294.6 slower than MemoryQueue RemoteStarlingQueue::::: mean time for put : 0.111742 std dev for put: 0.008853 mean time for take: 0.155392 std dev for take: 0.110507 put mean is 5586.18355184744 slower than MemoryQueue take mean is 11743.4378378378 slower than MemoryQueue 100 messages, Remote Queue servers on EC2 (client and tests running locally) Queue type user system total real MemoryQueue: 0.000000 0.000000 0.000000 ( 0.000165) BeanstalkClientRemote: 0.130000 0.250000 0.380000 ( 33.909095) RemoteStarlingQueue: 0.080000 0.170000 0.250000 ( 22.677569)
MemoryQueue::::: mean time for put : 0.000015 std dev for put: 0.000006 mean time for take: 0.000013 std dev for take: 0.000002 put mean is 1.0 slower than MemoryQueue take mean is 1.0 slower than MemoryQueue BeanstalkClientRemote::::: mean time for put : 0.113252 std dev for put: 0.004145 mean time for take: 0.227982 std dev for take: 0.031950 put mean is 7479.36954810266 slower than MemoryQueue take mean is 17757.1864438254 slower than MemoryQueue RemoteStarlingQueue::::: mean time for put : 0.132354 std dev for put: 0.188740 mean time for take: 0.112539 std dev for take: 0.003803 put mean is 8740.89418989135 slower than MemoryQueue take mean is 8765.5156917363 slower than MemoryQueue 100 messages, Remote Queue servers on EC2 (client and tests running on a separate EC2 instance) Queue type user system total real MemoryQueue: 0.020000 0.000000 0.020000 ( 0.006392) BeanstalkClientRemote: 0.010000 0.000000 0.010000 ( 0.793841) RemoteStarlingQueue: 0.010000 0.000000 0.010000 ( 1.067932)
MemoryQueue::::: mean time for put : 0.000009 std dev for put: 0.000102 mean time for take: 0.000030 std dev for take: 0.000769 put mean is 1.0 slower than MemoryQueue take mean is 1.0 slower than MemoryQueue BeanstalkClientRemote::::: mean time for put : 0.000353 std dev for put: 0.002849 mean time for take: 0.000527 std dev for take: 0.002545 put mean is 38.4510216814849 slower than MemoryQueue take mean is 17.5400444232905 slower than MemoryQueue RemoteStarlingQueue::::: mean time for put : 0.000773 std dev for put: 0.004554 mean time for take: 0.000805 std dev for take: 0.006630 put mean is 84.2330369677118 slower than MemoryQueue take mean is 26.7816119308266 slower than MemoryQueue 10,000 messages, Remote Queue servers on EC2 (client and tests running on a separate EC2 instance) Queue type user system total real MemoryQueue: 0.040000 0.000000 0.040000 ( 0.127432) BeanstalkClientRemote: 0.390000 0.090000 0.480000 ( 7.646054) RemoteStarlingQueue: 0.070000 0.020000 0.090000 ( 10.685410)
MemoryQueue::::: mean time for put : 0.000024 std dev for put: 0.001314 mean time for take: 0.000015 std dev for take: 0.000685 put mean is 1.0 slower than MemoryQueue take mean is 1.0 slower than MemoryQueue BeanstalkClientRemote::::: mean time for put : 0.000283 std dev for put: 0.002114 mean time for take: 0.000526 std dev for take: 0.002925 put mean is 11.7913700313271 slower than MemoryQueue take mean is 35.8050617861459 slower than MemoryQueue RemoteStarlingQueue::::: mean time for put : 0.000602 std dev for put: 0.004539 mean time for take: 0.000546 std dev for take: 0.004831 put mean is 25.0511140001964 slower than MemoryQueue take mean is 37.2107648228277 slower than MemoryQueue 100,000 messages, Remote Queue servers on EC2 (client and tests running on a separate EC2 instance) Queue type user system total real MemoryQueue: 0.260000 0.040000 0.300000 ( 0.677368) BeanstalkClientRemote: 3.200000 0.940000 4.140000 ( 76.989950) RemoteStarlingQueue: 0.820000 0.240000 1.060000 (110.507879)
MemoryQueue::::: mean time for put : 0.000019 std dev for put: 0.000915 mean time for take: 0.000018 std dev for take: 0.001125 put mean is 1.0 slower than MemoryQueue take mean is 1.0 slower than MemoryQueue BeanstalkClientRemote::::: mean time for put : 0.000274 std dev for put: 0.002125 mean time for take: 0.000531 std dev for take: 0.003302 put mean is 14.6932435272862 slower than MemoryQueue take mean is 30.0050748059956 slower than MemoryQueue RemoteStarlingQueue::::: mean time for put : 0.000592 std dev for put: 0.006346 mean time for take: 0.000577 std dev for take: 0.004349 put mean is 31.7037894886494 slower than MemoryQueue take mean is 32.5781596463523 slower than MemoryQueue
Tip 6: Don’t be dogmatic
There are a lot of best practices for testing that may or may not apply to your situation. Should you have one assertion per test? Should you use mocks and stubs? Should you use Test Driven Development? Or Behavior Driven Development? Should you do interaction or state-based testing? While all of these practices have real benefits, remember that their applicability and value depends largely on your project, schedule, and team.
Don’t be afraid to play, but don’t feel like you need to convert everything to the one, true way to test. It’s fine to have a suite that mixes and matches these best practices. In other words, context is king.
Tip 5: Improve your tests over time
Here’s a terrible idea – decide you are going to spend a whole week building a test suite for your project. First of all, you’ll likely just get frustrated and burn out on testing. Secondly, you’ll probably write bad tests at first, so even if you get a bunch of tests written, you’re going to need to go back and rewrite them one you figure out how slow, brittle, or unreadable they are.
As they say, the best writing is rewriting. You should try out new techniques (and rewrite) old test code. But it’s OK to have patchwork tests.
You just found out fixtures suck? (they do). Or that those ‘setup’ methods make your tests less readable? Are you excited about using mocks? Great, apply your new technique to some new tests, rewrite a few old tests, and call it a day. Don’t try to rewrite your whole suite, because you’ll be kicking yourself when you rewrite your suite again after you decide technique X isn’t perfect in all cases.
Just like in production code, good practices take awhile to bake and prove themselves. See how maintainable, easy to understand, easy to read a new technique is. You can always move more tests over.
Tip 4: Always write one test
When writing new code, it’s easy to avoid testing because it seems so daunting to test all the functionality. Rather than thinking of testing as an all-or-nothing proposition, try to write just one good test for the new functionality.
You’ll find that having just one test is much, much better than having no tests at all. Why? First of all, it’ll catch catastrophic errors, even if it doesn’t catch bugs in edge cases. Secondly, writing even one test may force you to refactor your production code slightly to make it more testable (which in turn, makes future tests easier to write). Finally, it gives you “test momentum”. If you have no tests, you’ll be inclined to delay testing, since there is more overhead to get started. But if you already have just one test in place, it’ll be much easier to add tests as you think of them (and to write regression tests as you find bugs).
By the way, don’t worry about testing at exactly the right level. Having one functional test is way better than having no tests at all. You can always come back and break the “bigger” test down into more targeted, precise tests.
Tip 3: Test code isn’t production code
Another common mistake is to treat test code just like production code. For instance, you’d like your code production code to be as dry as possible. But in test code, it’s actually more important for tests to be readable and independent than to be dry. As a result, you’ll want your tests to be more “moist” than dry. Specifically, you’ll want to use literals a lot more in test code than you would in production.
In general, the most important properties of good tests are:
Independent – No test should affect the outcome of any other test. Put another way, you should be able to run your tests in any order and always have the same outcome. A corollary of this is that setup/teardown methods are evil (both because they increase dependence and they decrease readability)
Readable – The intent of each test should be immediately obvious (both by it’s name and by its code).
Fast – Each test should run as quickly as possible, so the entire suite is also fast. The faster the suite, the more you’ll run the tests, and the greater benefit you’ll get (because you’ll catch regressions quickly)
Precise – Each test should focus on testing one thing (and only one thing) well*. Ideally, if a test fails, you should know exactly what part of your production code broke by just glancing at the name of the test. Also, if your tests are precise, it’s less likely that a change in your code will require you to change many different tests. In practice, precise tests are short and only have one assertion or expectation per test.
*Note: this doesn’t apply to integration tests, which should make sure all components play nicely together.
We’re trying to get a better feel for what types of web-service tools would be useful to Ruby hackers. You can help us out by filling out our survey. Thanks!
Tip 2: Most code is not written to be tested
Another surprising thing you’ll find when you start testing is that your production code is not very testable. This isn’t surprising – if there were no tests previously, there was no reason to design for testability. This will make your first tests way harder to write and less valuable (i.e. they are less likely to catch real bugs)
There are a few tricks to get around this. First, try testing only new code or just test a smaller side project to start to get the hang of it. When you’re ready to start testing your legacy application, try the following.
1. Write a few very high-level tests. These tests will likely exercise almost the whole system and will interact with the application at the highest-level interface.
2. Refactor out one component of the application so it is more decoupled and testable
3. Continually run your high-level tests to make sure you haven’t broken anything major
4. Write more focused tests for the component you pulled out in step #2
5. Go back to step #2
Again, stick with it. As you write more tests, your application will be more testable (bonus: it’s likely be easier to understand, more loosely coupled, easier to refactor, and more DRY as well!). As it becomes more testable, it’ll be easier to write additional tests. This creates a positive loop where things get better and easier as you go.
We’re big on automated testing here at Devver, but I know a lot of companies aren’t as into it. There’s been plenty written about all the reasons you should be writing tests, but over the next week or so, I’ll give you some tips on how to get started (and if you’ve already got some tests, how to improve and expand your test suite).
I can’t claim to have come up with these best practices, so I’ll litter this post with links to those resources that have taught me something.
A quick word about terminology. When I say “tests” I mean any type of automated tests, which may include unit, functional, integration or any other types of tests. When I say “production code” I simply mean the code goes into the actual product – i.e. the code being tested.
Tip 1: You’ll probably suck at testing
Writing tests can be frustrating at first. It is usually a lot harder and more time consuming than you’d expect. Unfortunately, some developers assume that the cost of writing tests is fixed and conclude that the benefits can’t possible justify the time spent – so they quit writing tests.
Writing test code is an art unto itself. There are a whole new set of tricks and skills to learn and it’s difficult to do correcty right away. Stick with it. The better you get, the faster you’ll write tests, and the more your tests will pay off.
Jay Fields recently wrote something interesting:
“Problems with tests are often handled by creating band-aids such as your own test case subclass that hides an underlying problem, testing frameworks that run tests in parallel, etc. To be clear, running tests in parallel is a good thing. However, if you have a long running build because of underlying issues and you solve it by running the tests in parallel.. that’s a band-aid, not a solution. The poorly written tests may take 10 minutes right now. If you run the tests in parallel it might take 2 minutes today, but when you are back to 10 minutes you now have ~5 times as many problematic tests. That’s not a good position to be in.”
This is really relevant to Devver since our first tool will be a easy-to-use, distributed unit test runner.
So, if you have long-running test suite, is using Devver a solution or a band-aid? It depends on the reason your tests take a long time to execute.
The key phrase Jay uses is “if you have a long running build because of underlying issues.” Clearly, in some cases, having a long test suite is justified. On my machine, the Rubinius suite takes about one minute to run. That’s not a bad thing – they have tons and tons of specs (5675 examples, 20924 expectations, to be exact).
Another example is Rails tests – your integration and functional tests may be slow because they hit the DB and the full Rails stack. It doesn’t make sense to speed them up using stubs, as you should in Rails unit tests, because the whole point of those tests it to make sure your entire application works together.
In these cases, your suite would truly provide less value if you cut or changed tests, but using a distributed test runner like Devver is a huge win – you get feedback much, much more quickly.
On the other hand, as Jay mentions, there are cases where running your tests in parallel might give you a temporary speed up, but ultimately you’re just hiding real problems with your suite. For instance, your unit tests might be too high-level or you might be depending on some slow external system that could be stubbed out.
Can Devver be both a band-aid and a solution?
Let’s say that you have a test suite that really has underlying problems. What can Devver do for you?
Well, in the short term, we can be that band-aid. As long as you realize that using Devver’s test runner isn’t a fix for your suite’s underlying issues, there’s nothing wrong with getting faster feedback.
However, I think we can do more. As time goes on, we can build Devver tools that actually help you fix the underlying problems by measuring the tracking the quality of your tests over time. For instance, in the future we could integrate with RCov and Heckle and track metrics like application LOC to test LOC ratio, test execution speed, and average number of assertions/expectations per test. If you have ideas as to what other metrics and tools might help you fix the quality of your tests, let us know.