Use Dataflows to concurrently process batches from multiple datasources

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Use Dataflows to concurrently process batches from multiple datasources

This post has NOT been accepted by the mailing list yet.
Hi folks,

My problem is similar to Prakash's earlier post. I am trying to increase the throughput of a system by using database replication, and then partitioning long running queries in order to get the results from multiple database nodes in parallel. I've tried using actors but after further research it seemed more appropriate to use the dataflowqueues and seen as they use actors under the hood anyway seemed like a more appropriate abstraction.

The trick is that I need to aggregate the results of multiple queries into one collection for processing as one set all together, but of course while this is happening the database is lying idle.

Seen as the bottleneck in this problem is database reading I want to keep the database busy all the time, so I would like to start off another parallel query while processing the results of the first one.

I started off with the DataflowOperatorTest example because I need to really write this in plain java as introducing groovy to the project is not really an option. But I'm struggling to figure out how to spawn multiple threads which populate the shared collection and then notify the parent thread when this batch job is complete.

I've attached a diagram of a simplified scenario for clarification.

I'd just like to say also that I'm very impressed with the gpars library and am delighted to have an opportunity to try it out. It has given me access to solutions I would never have otherwise been able to consider.

Thanks in advance for any ideas!