Modern lambda architectures based on Spark, or Storm and Hadoop allow both batch and stream processing to be done on huge quantities of data at incredible speeds. If you're fortunate enough to be doing greenfield development involving big data then you should definitely be looking at these.
But what if you have a big legacy batch system? Should you aim to migrate everything to a new system?
Well, maybe not. Or at least not all at once.
I'm Dan Hanley, CTO at ActiveStandards and we use these technologies to provide highly robust and scalable Digital Quality Management solutions to global enterprises.
I'd like to share a few things I learned on the way to getting our first Spark based project running in production.
Don't boil the ocean – yes it's important to have the big vision of where you want to be, but the real trick is in breaking the journey down into small steps that deliver value along the way. In our case this meant building a much scaled down version of our full production system. It delivers only a small subset of the functionality. But it let us develop and test various topologies and get our deployment processes ironed out.
Storage tiers are different. Not just different from relational, but very different within their own groups. The architecture and schema you'll use with a column DB like Cassandra will be very different to what you'll have with a document DB like MongoDB. Or maybe you just need a file system like HDFS or S3? You'll need to figure out which is the best fit for your purpose.
Use empirical evidence and keep asking questions. In our traditional batch system we shard the MySQL data across instances by UUID – this gives a uniform volume and load distribution across the cluster, with performance equal to that of the slowest instance. Initially we partitioned our Cassandra data the same way. After a conversation with Datastax we tried a different partitioning scheme – and the tests showed that for Cassandra it was far more appropriate to partition on a much higher level key – effectively siloing all data for a given client onto one or two boxes in the cluster.
Test your topologies. Should you use a few big machines? Many smaller ones? Co-locate the spark node with the data-storage nodes? Have a symmetric or asymmetric Spark to Data ratio? Secondary indices? Denormalised and redundant data? There is no set easy answer. What works for me will quite likely be different for you. The only way to be sure is to test your options rigorously and optimize for your own use case.