The amount of data businesses have to process is increasing every day. Eventually traditional approaches become impractical options to process this data due to time constraints or the sheer size of the dataset. However as datasets have grown, other ways to process and analyze them have come to light. One specific problem involved 14 datasets dumped nightly to a remote site where they are read in, converted and merged into a single dataset. A traditional approach took about 2 days to process one dataset. The new approach using Ruby, AMQP and RabbitMQ takes less than 24 hours to process, convert and merge all 14 datasets.
This faster approach uses RabbitMQ (written in Erlang) as our “middleman” and Ruby workers to publish the legacy data and another set of Ruby workers to convert and merge into a single dataset. The choice of using the Ruby programming language for the workers stemmed from being able to re-use conversion code from the old approach saving time in developing the new system.
When receiving the nightly load workers are started up to publish the necessary information from each dataset to its respective queue in RabbitMQ. The workers waiting to convert and merge the data are sent batches of messages with the information and begin their processing and continue to process data until there are no more messages left in the queue. If any worker goes down or fails for any reason, the messages it did not process successfully are recovered by the message queue making sure all queue items are processed.
When the next nightly dump comes around the message queues are filled again and the workers repeat their workflow. The new approach and system processes the nightly dataset dumps on time and allows the clients using the system to have up to date information.