I noticed a post by Scott Davis a few weeks ago complaining of a performance issue with GPars:
http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194 I spent an entire weekend refactoring an application using GPars and found certain cases where the GPars code was significantly slower than the serial code, especially when handling supporting a large succession of very fast method calls. I think the problem may lie in withPool itself. Taking Scott's lead, I've distilled it down to a much simpler use case: SCRIPT === @Grab(group='org.codehaus.gpars', module='gpars', version='0.10') import static groovyx.gpars.GParsPool.withPool long start def nums = (1L..10000000L) withPool { println "serially summing numbers inside a withPool" start = System.currentTimeMillis() println nums.sum() println "time: ${System.currentTimeMillis()-start}ms" } println "" println "serially summing numbers outside a withPool" start = System.currentTimeMillis() println nums.sum() println "time: ${System.currentTimeMillis()-start}ms" withPool { println "" println "paralell summing numbers inside a withPool" start = System.currentTimeMillis() println nums.sumParallel() println "time: ${System.currentTimeMillis()-start}ms" } withPool { println "" println "serially summing numbers inside a withPool (again)" start = System.currentTimeMillis() println nums.sum() println "time: ${System.currentTimeMillis()-start}ms" } println "" println "serially summing numbers outside a withPool (again)" start = System.currentTimeMillis() println nums.sum() println "time: ${System.currentTimeMillis()-start}ms" OUTPUT === serially summing numbers inside a withPool 50000005000000 time: 8392ms serially summing numbers outside a withPool 50000005000000 time: 2717ms paralell summing numbers inside a withPool 50000005000000 time: 8750ms serially summing numbers inside a withPool (again) 50000005000000 time: 8171ms serially summing numbers outside a withPool (again) 50000005000000 time: 2936ms === As you'll see, there seems to be significant overhead when executing code inside a GParsPool (the code is almost 3x slower). While I recognize this is a micro-benchmark, I've had similar issues in more complicated code. After profiling the code a bit, I determined a significant amount of CPU time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls. This may just be a side effect of the nature of awaitNanos as reported to a profiling tool, but it could also have something to do with the way ReentrantLocks are being allocated and managed. Any insights are appreciated. |
Administrator
|
Hi Cedric,
I'm glad to see you using GPars. I can identify two independent questions in your email: 1. Why is sequential code inside the withPool() block slower than the same code used outside of it 2. Why is parallel code slower than sequential Both issues origin in some inevitable overhead associated with orchestrating parallelism and demonstrate themselves only when cheap operations are measured, like summing up numbers, Let's address the issues in turn. 1. We pay the price for having a Groovy category applied to the code. Try replacing the withPool() block with e.g. use(TimeCategory) block and so removing all GPars dependencies from the code and you'll see identical slowdown for sequential calculation - the cost of having a category-based DSL at your disposal. use(TimeCategory) { println "" println "serially summing numbers inside a TimeCategory block" start = System.currentTimeMillis() println nums.sum() println "time: ${System.currentTimeMillis() - start}ms" } OUTPUT: serially summing numbers inside a withPool 50000005000000 time: 6761ms serially summing numbers inside a TimeCategory block 50000005000000 time: 8966ms serially summing numbers outside a withPool 50000005000000 time: 3326ms 2. The parallel algorithm to work needs a tree-like parallel data-structure (Parallel Array) to be built from the number range you provide, while the sequential algorithm can start right-away. Thus the parallel algorithm is seriously handicapped. You not only measure the parallel algorithm itself, but also the time to build a tree out of the number range, cost of which is proportional to the size of the interval. This extra initial overhead can't be paid for later, if the operation we're parallelizing is cheap, like summing-up numbers. Separating the two steps shows the longed-for speed-up of parallel operations: withPool { println "" println "paralell summing numbers inside a withPool using PA ignoring PA build time" final def pnums = GParsPoolUtil.getParallel(nums) //building the PA start = System.currentTimeMillis() println pnums.sum() println "time: ${System.currentTimeMillis() - start}ms" } OUTPUT: serially summing numbers outside a withPool 50000005000000 time: 3326ms paralell summing numbers inside a withPool 50000005000000 time: 6511ms paralell summing numbers inside a withPool using PA ignoring PA build time 50000005000000 time: 1981ms This is the disadvantage of using a language with its default data structures being inherently sequential. Lists or hash sets can't be efficiently worked on concurrently directly. We can only envy Clojure developers here. The takeaway - parallelize operations on a collection only if: a) the operation you perform on each item is long-enough to justify the initial cost of data structure conversion or b) you'll chain multiple operations on the converted collection - aka nums.parallel.filter{...}.filter{...}.map{...}.sum() (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce) One more reason to avoid parallelization of fine-grained operations, like summing up numbers, in Groovy is tracked at http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current version Groovy internally synchronizes method calls making concurrent method invocations block one another. This is likely to be addressed in Groovy 1.8 and 2.0. I hope this helps clarify the issue. Regards, Vaclav On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]> wrote:
-- E-mail: [hidden email] Blog: http://www.jroller.com/vaclav Linkedin page: http://www.linkedin.com/in/vaclavpech |
Hi Vaclav,
I believe that one of clear consequences of your brilliant analysis is the fact that GPars should stop using TimeCategory silently. 4 times performance degradation reported by Cedric is crazy price for ability to use TimeCategoty, which we he doesn't need at all in his code. Alex On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote: > Hi Cedric, > > I'm glad to see you using GPars. I can identify two independent questions in > your email: > 1. Why is sequential code inside the withPool() block slower than the same > code used outside of it > 2. Why is parallel code slower than sequential > > Both issues origin in some inevitable overhead associated with orchestrating > parallelism and demonstrate themselves only when cheap operations are > measured, like summing up numbers, > > Let's address the issues in turn. > > 1. We pay the price for having a Groovy category applied to the code. Try > replacing the withPool() block with e.g. use(TimeCategory) block and so > removing all GPars dependencies from the code and you'll see identical > slowdown for sequential calculation - the cost of having a category-based > DSL at your disposal. > > use(TimeCategory) { > println "" > println "serially summing numbers inside a TimeCategory block" > start = System.currentTimeMillis() > println nums.sum() > println "time: ${System.currentTimeMillis() - start}ms" > } > > OUTPUT: > > serially summing numbers inside a withPool > 50000005000000 > time: 6761ms > > serially summing numbers inside a TimeCategory block > 50000005000000 > time: 8966ms > > serially summing numbers outside a withPool > 50000005000000 > time: 3326ms > > 2. The parallel algorithm to work needs a tree-like parallel data-structure > (Parallel Array) to be built from the number range you provide, while the > sequential algorithm can start right-away. Thus the parallel algorithm is > seriously handicapped. You not only measure the parallel algorithm itself, > but also the time to build a tree out of the number range, cost of which is > proportional to the size of the interval. This extra initial overhead can't > be paid for later, if the operation we're parallelizing is cheap, like > summing-up numbers. Separating the two steps shows the longed-for speed-up > of parallel operations: > > withPool { > println "" > println "paralell summing numbers inside a withPool using PA ignoring PA > build time" > final def pnums = GParsPoolUtil.getParallel(nums) //building the PA > start = System.currentTimeMillis() > println pnums.sum() > println "time: ${System.currentTimeMillis() - start}ms" > } > > > OUTPUT: > > serially summing numbers outside a withPool > 50000005000000 > time: 3326ms > > paralell summing numbers inside a withPool > 50000005000000 > time: 6511ms > > paralell summing numbers inside a withPool using PA ignoring PA build time > 50000005000000 > time: 1981ms > > This is the disadvantage of using a language with its default data > structures being inherently sequential. Lists or hash sets can't be > efficiently worked on concurrently directly. We can only envy Clojure > developers here. > > The takeaway - parallelize operations on a collection only if: > a) the operation you perform on each item is long-enough to justify the > initial cost of data structure conversion > or > b) you'll chain multiple operations on the converted collection - aka > nums.parallel.filter{...}.filter{...}.map{...}.sum() > (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce) > > One more reason to avoid parallelization of fine-grained operations, like > summing up numbers, in Groovy is tracked at > http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current > version Groovy internally synchronizes method calls making concurrent method > invocations block one another. This is likely to be addressed in Groovy 1.8 > and 2.0. > > I hope this helps clarify the issue. > > Regards, > > Vaclav > > > > > On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]> > wrote: >> >> I noticed a post by Scott Davis a few weeks ago complaining of a >> performance >> issue with GPars: >> >> >> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194 >> >> I spent an entire weekend refactoring an application using GPars and found >> certain cases where the GPars code was significantly slower than the >> serial >> code, especially when handling supporting a large succession of very fast >> method calls. I think the problem may lie in withPool itself. Taking >> Scott's lead, I've distilled it down to a much simpler use case: >> >> SCRIPT >> === >> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10') >> import static groovyx.gpars.GParsPool.withPool >> >> long start >> >> def nums = (1L..10000000L) >> withPool { >> println "serially summing numbers inside a withPool" >> start = System.currentTimeMillis() >> println nums.sum() >> println "time: ${System.currentTimeMillis()-start}ms" >> } >> >> println "" >> println "serially summing numbers outside a withPool" >> start = System.currentTimeMillis() >> println nums.sum() >> println "time: ${System.currentTimeMillis()-start}ms" >> >> >> withPool { >> println "" >> println "paralell summing numbers inside a withPool" >> start = System.currentTimeMillis() >> println nums.sumParallel() >> println "time: ${System.currentTimeMillis()-start}ms" >> } >> >> withPool { >> println "" >> println "serially summing numbers inside a withPool (again)" >> start = System.currentTimeMillis() >> println nums.sum() >> println "time: ${System.currentTimeMillis()-start}ms" >> } >> >> println "" >> println "serially summing numbers outside a withPool (again)" >> start = System.currentTimeMillis() >> println nums.sum() >> println "time: ${System.currentTimeMillis()-start}ms" >> >> OUTPUT >> === >> serially summing numbers inside a withPool >> 50000005000000 >> time: 8392ms >> >> serially summing numbers outside a withPool >> 50000005000000 >> time: 2717ms >> >> paralell summing numbers inside a withPool >> 50000005000000 >> time: 8750ms >> >> serially summing numbers inside a withPool (again) >> 50000005000000 >> time: 8171ms >> >> serially summing numbers outside a withPool (again) >> 50000005000000 >> time: 2936ms >> >> === >> >> As you'll see, there seems to be significant overhead when executing code >> inside a GParsPool (the code is almost 3x slower). While I recognize this >> is a micro-benchmark, I've had similar issues in more complicated code. >> >> After profiling the code a bit, I determined a significant amount of CPU >> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls. >> This may just be a side effect of the nature of awaitNanos as reported to >> a >> profiling tool, but it could also have something to do with the way >> ReentrantLocks are being allocated and managed. Any insights are >> appreciated. >> -- >> View this message in context: >> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html >> Sent from the GPars - user mailing list mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe from this list, please visit: >> >> http://xircles.codehaus.org/manage_email >> >> > > > > -- > E-mail: [hidden email] > Blog: http://www.jroller.com/vaclav > Linkedin page: http://www.linkedin.com/in/vaclavpech > --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
I assume Vaclav's point is not about TimeCategory itself but about the use of any category. As long as we use a category-based DSL we have to pay the performance price. Could withPool be implemented without categories?
Johannes Von meinem iPod gesendet Am 23.10.2010 um 09:19 schrieb Alex Tkachman <[hidden email]>: > Hi Vaclav, > > I believe that one of clear consequences of your brilliant analysis is > the fact that GPars should stop using TimeCategory silently. 4 times > performance degradation reported by Cedric is crazy price for ability > to use TimeCategoty, which we he doesn't need at all in his code. > > Alex > > On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote: >> Hi Cedric, >> >> I'm glad to see you using GPars. I can identify two independent questions in >> your email: >> 1. Why is sequential code inside the withPool() block slower than the same >> code used outside of it >> 2. Why is parallel code slower than sequential >> >> Both issues origin in some inevitable overhead associated with orchestrating >> parallelism and demonstrate themselves only when cheap operations are >> measured, like summing up numbers, >> >> Let's address the issues in turn. >> >> 1. We pay the price for having a Groovy category applied to the code. Try >> replacing the withPool() block with e.g. use(TimeCategory) block and so >> removing all GPars dependencies from the code and you'll see identical >> slowdown for sequential calculation - the cost of having a category-based >> DSL at your disposal. >> >> use(TimeCategory) { >> println "" >> println "serially summing numbers inside a TimeCategory block" >> start = System.currentTimeMillis() >> println nums.sum() >> println "time: ${System.currentTimeMillis() - start}ms" >> } >> >> OUTPUT: >> >> serially summing numbers inside a withPool >> 50000005000000 >> time: 6761ms >> >> serially summing numbers inside a TimeCategory block >> 50000005000000 >> time: 8966ms >> >> serially summing numbers outside a withPool >> 50000005000000 >> time: 3326ms >> >> 2. The parallel algorithm to work needs a tree-like parallel data-structure >> (Parallel Array) to be built from the number range you provide, while the >> sequential algorithm can start right-away. Thus the parallel algorithm is >> seriously handicapped. You not only measure the parallel algorithm itself, >> but also the time to build a tree out of the number range, cost of which is >> proportional to the size of the interval. This extra initial overhead can't >> be paid for later, if the operation we're parallelizing is cheap, like >> summing-up numbers. Separating the two steps shows the longed-for speed-up >> of parallel operations: >> >> withPool { >> println "" >> println "paralell summing numbers inside a withPool using PA ignoring PA >> build time" >> final def pnums = GParsPoolUtil.getParallel(nums) //building the PA >> start = System.currentTimeMillis() >> println pnums.sum() >> println "time: ${System.currentTimeMillis() - start}ms" >> } >> >> >> OUTPUT: >> >> serially summing numbers outside a withPool >> 50000005000000 >> time: 3326ms >> >> paralell summing numbers inside a withPool >> 50000005000000 >> time: 6511ms >> >> paralell summing numbers inside a withPool using PA ignoring PA build time >> 50000005000000 >> time: 1981ms >> >> This is the disadvantage of using a language with its default data >> structures being inherently sequential. Lists or hash sets can't be >> efficiently worked on concurrently directly. We can only envy Clojure >> developers here. >> >> The takeaway - parallelize operations on a collection only if: >> a) the operation you perform on each item is long-enough to justify the >> initial cost of data structure conversion >> or >> b) you'll chain multiple operations on the converted collection - aka >> nums.parallel.filter{...}.filter{...}.map{...}.sum() >> (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce) >> >> One more reason to avoid parallelization of fine-grained operations, like >> summing up numbers, in Groovy is tracked at >> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current >> version Groovy internally synchronizes method calls making concurrent method >> invocations block one another. This is likely to be addressed in Groovy 1.8 >> and 2.0. >> >> I hope this helps clarify the issue. >> >> Regards, >> >> Vaclav >> >> >> >> >> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]> >> wrote: >>> >>> I noticed a post by Scott Davis a few weeks ago complaining of a >>> performance >>> issue with GPars: >>> >>> >>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194 >>> >>> I spent an entire weekend refactoring an application using GPars and found >>> certain cases where the GPars code was significantly slower than the >>> serial >>> code, especially when handling supporting a large succession of very fast >>> method calls. I think the problem may lie in withPool itself. Taking >>> Scott's lead, I've distilled it down to a much simpler use case: >>> >>> SCRIPT >>> === >>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10') >>> import static groovyx.gpars.GParsPool.withPool >>> >>> long start >>> >>> def nums = (1L..10000000L) >>> withPool { >>> println "serially summing numbers inside a withPool" >>> start = System.currentTimeMillis() >>> println nums.sum() >>> println "time: ${System.currentTimeMillis()-start}ms" >>> } >>> >>> println "" >>> println "serially summing numbers outside a withPool" >>> start = System.currentTimeMillis() >>> println nums.sum() >>> println "time: ${System.currentTimeMillis()-start}ms" >>> >>> >>> withPool { >>> println "" >>> println "paralell summing numbers inside a withPool" >>> start = System.currentTimeMillis() >>> println nums.sumParallel() >>> println "time: ${System.currentTimeMillis()-start}ms" >>> } >>> >>> withPool { >>> println "" >>> println "serially summing numbers inside a withPool (again)" >>> start = System.currentTimeMillis() >>> println nums.sum() >>> println "time: ${System.currentTimeMillis()-start}ms" >>> } >>> >>> println "" >>> println "serially summing numbers outside a withPool (again)" >>> start = System.currentTimeMillis() >>> println nums.sum() >>> println "time: ${System.currentTimeMillis()-start}ms" >>> >>> OUTPUT >>> === >>> serially summing numbers inside a withPool >>> 50000005000000 >>> time: 8392ms >>> >>> serially summing numbers outside a withPool >>> 50000005000000 >>> time: 2717ms >>> >>> paralell summing numbers inside a withPool >>> 50000005000000 >>> time: 8750ms >>> >>> serially summing numbers inside a withPool (again) >>> 50000005000000 >>> time: 8171ms >>> >>> serially summing numbers outside a withPool (again) >>> 50000005000000 >>> time: 2936ms >>> >>> === >>> >>> As you'll see, there seems to be significant overhead when executing code >>> inside a GParsPool (the code is almost 3x slower). While I recognize this >>> is a micro-benchmark, I've had similar issues in more complicated code. >>> >>> After profiling the code a bit, I determined a significant amount of CPU >>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls. >>> This may just be a side effect of the nature of awaitNanos as reported to >>> a >>> profiling tool, but it could also have something to do with the way >>> ReentrantLocks are being allocated and managed. Any insights are >>> appreciated. >>> -- >>> View this message in context: >>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html >>> Sent from the GPars - user mailing list mailing list archive at >>> Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe from this list, please visit: >>> >>> http://xircles.codehaus.org/manage_email >>> >>> >> >> >> >> -- >> E-mail: [hidden email] >> Blog: http://www.jroller.com/vaclav >> Linkedin page: http://www.linkedin.com/in/vaclavpech >> > > --------------------------------------------------------------------- > To unsubscribe from this list, please visit: > > http://xircles.codehaus.org/manage_email > > --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
Thanks for the insights. This is exactly what I was looking for.
In regard to the first point, I guess this is where something like the compile-time traits used in Scala would come in handy. If I understood correctly, I think Guillaume mentioned they were looking into incorporating traits into a future Groovy release. I believe Groovy++ supports traits, as well, but I've not yet used them in practice so I'm not sure they would help. In regard to the second point, it makes sense that the PA construction would be expensive for such a large range. But it also seems this translation is currently both atomic and single-threaded. I'm not familiar with the internals of ParalellArrays, so this may be a completely left-field inquiry, but is there a way to buffer and stream the elements of a parallel array as its being constructed? For instance, if the parallel array was half-populated, it seems we should be able to start applying the parallel algorithm on the contents thusfar and stream the remaining elements into our algorithm via a buffer as they are made available. The difference would be this: [1, 2, .., 100000000] -> makeParallel([1, 2, .., 100000000]) -> sumParallel([1, 2, ..., 100000000]) versus [1, 2, .., 100000000] -> makeParallelStream(list: [1, 2, .., 100000000], bufferSize: 20000) -> sumParallel([1, 2, .., 20000]), sumParallel([20001, 20002, .., 40000]), ..., -> getSumOfSumParallelSums([200010000, 600010000...]) Don't know if this would be any faster, but it would at least light up the remainder of the cores earlier in the pipeline. Also, it might be worthwhile to provide some prefabricated parallel versions of the standard Groovy collections. If I declare ParallelIntRange out the box, for example, it might save some time in versus of creating a simple list and then copying it into a parallel version. On Sat, Oct 23, 2010 at 7:54 AM, Johannes Link [via GPars - user mailing list] <[hidden email]> wrote: > > I assume Vaclav's point is not about TimeCategory itself but about the use of any category. As long as we use a category-based DSL we have to pay the performance price. Could withPool be implemented without categories? > > Johannes > > Von meinem iPod gesendet > > Am 23.10.2010 um 09:19 schrieb Alex Tkachman <[hidden email]>: > > > Hi Vaclav, > > > > I believe that one of clear consequences of your brilliant analysis is > > the fact that GPars should stop using TimeCategory silently. 4 times > > performance degradation reported by Cedric is crazy price for ability > > to use TimeCategoty, which we he doesn't need at all in his code. > > > > Alex > > > > On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote: > >> Hi Cedric, > >> > >> I'm glad to see you using GPars. I can identify two independent questions in > >> your email: > >> 1. Why is sequential code inside the withPool() block slower than the same > >> code used outside of it > >> 2. Why is parallel code slower than sequential > >> > >> Both issues origin in some inevitable overhead associated with orchestrating > >> parallelism and demonstrate themselves only when cheap operations are > >> measured, like summing up numbers, > >> > >> Let's address the issues in turn. > >> > >> 1. We pay the price for having a Groovy category applied to the code. Try > >> replacing the withPool() block with e.g. use(TimeCategory) block and so > >> removing all GPars dependencies from the code and you'll see identical > >> slowdown for sequential calculation - the cost of having a category-based > >> DSL at your disposal. > >> > >> use(TimeCategory) { > >> println "" > >> println "serially summing numbers inside a TimeCategory block" > >> start = System.currentTimeMillis() > >> println nums.sum() > >> println "time: ${System.currentTimeMillis() - start}ms" > >> } > >> > >> OUTPUT: > >> > >> serially summing numbers inside a withPool > >> 50000005000000 > >> time: 6761ms > >> > >> serially summing numbers inside a TimeCategory block > >> 50000005000000 > >> time: 8966ms > >> > >> serially summing numbers outside a withPool > >> 50000005000000 > >> time: 3326ms > >> > >> 2. The parallel algorithm to work needs a tree-like parallel data-structure > >> (Parallel Array) to be built from the number range you provide, while the > >> sequential algorithm can start right-away. Thus the parallel algorithm is > >> seriously handicapped. You not only measure the parallel algorithm itself, > >> but also the time to build a tree out of the number range, cost of which is > >> proportional to the size of the interval. This extra initial overhead can't > >> be paid for later, if the operation we're parallelizing is cheap, like > >> summing-up numbers. Separating the two steps shows the longed-for speed-up > >> of parallel operations: > >> > >> withPool { > >> println "" > >> println "paralell summing numbers inside a withPool using PA ignoring PA > >> build time" > >> final def pnums = GParsPoolUtil.getParallel(nums) //building the PA > >> start = System.currentTimeMillis() > >> println pnums.sum() > >> println "time: ${System.currentTimeMillis() - start}ms" > >> } > >> > >> > >> OUTPUT: > >> > >> serially summing numbers outside a withPool > >> 50000005000000 > >> time: 3326ms > >> > >> paralell summing numbers inside a withPool > >> 50000005000000 > >> time: 6511ms > >> > >> paralell summing numbers inside a withPool using PA ignoring PA build time > >> 50000005000000 > >> time: 1981ms > >> > >> This is the disadvantage of using a language with its default data > >> structures being inherently sequential. Lists or hash sets can't be > >> efficiently worked on concurrently directly. We can only envy Clojure > >> developers here. > >> > >> The takeaway - parallelize operations on a collection only if: > >> a) the operation you perform on each item is long-enough to justify the > >> initial cost of data structure conversion > >> or > >> b) you'll chain multiple operations on the converted collection - aka > >> nums.parallel.filter{...}.filter{...}.map{...}.sum() > >> (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce) > >> > >> One more reason to avoid parallelization of fine-grained operations, like > >> summing up numbers, in Groovy is tracked at > >> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current > >> version Groovy internally synchronizes method calls making concurrent method > >> invocations block one another. This is likely to be addressed in Groovy 1.8 > >> and 2.0. > >> > >> I hope this helps clarify the issue. > >> > >> Regards, > >> > >> Vaclav > >> > >> > >> > >> > >> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]> > >> wrote: > >>> > >>> I noticed a post by Scott Davis a few weeks ago complaining of a > >>> performance > >>> issue with GPars: > >>> > >>> > >>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194 > >>> > >>> I spent an entire weekend refactoring an application using GPars and found > >>> certain cases where the GPars code was significantly slower than the > >>> serial > >>> code, especially when handling supporting a large succession of very fast > >>> method calls. I think the problem may lie in withPool itself. Taking > >>> Scott's lead, I've distilled it down to a much simpler use case: > >>> > >>> SCRIPT > >>> === > >>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10') > >>> import static groovyx.gpars.GParsPool.withPool > >>> > >>> long start > >>> > >>> def nums = (1L..10000000L) > >>> withPool { > >>> println "serially summing numbers inside a withPool" > >>> start = System.currentTimeMillis() > >>> println nums.sum() > >>> println "time: ${System.currentTimeMillis()-start}ms" > >>> } > >>> > >>> println "" > >>> println "serially summing numbers outside a withPool" > >>> start = System.currentTimeMillis() > >>> println nums.sum() > >>> println "time: ${System.currentTimeMillis()-start}ms" > >>> > >>> > >>> withPool { > >>> println "" > >>> println "paralell summing numbers inside a withPool" > >>> start = System.currentTimeMillis() > >>> println nums.sumParallel() > >>> println "time: ${System.currentTimeMillis()-start}ms" > >>> } > >>> > >>> withPool { > >>> println "" > >>> println "serially summing numbers inside a withPool (again)" > >>> start = System.currentTimeMillis() > >>> println nums.sum() > >>> println "time: ${System.currentTimeMillis()-start}ms" > >>> } > >>> > >>> println "" > >>> println "serially summing numbers outside a withPool (again)" > >>> start = System.currentTimeMillis() > >>> println nums.sum() > >>> println "time: ${System.currentTimeMillis()-start}ms" > >>> > >>> OUTPUT > >>> === > >>> serially summing numbers inside a withPool > >>> 50000005000000 > >>> time: 8392ms > >>> > >>> serially summing numbers outside a withPool > >>> 50000005000000 > >>> time: 2717ms > >>> > >>> paralell summing numbers inside a withPool > >>> 50000005000000 > >>> time: 8750ms > >>> > >>> serially summing numbers inside a withPool (again) > >>> 50000005000000 > >>> time: 8171ms > >>> > >>> serially summing numbers outside a withPool (again) > >>> 50000005000000 > >>> time: 2936ms > >>> > >>> === > >>> > >>> As you'll see, there seems to be significant overhead when executing code > >>> inside a GParsPool (the code is almost 3x slower). While I recognize this > >>> is a micro-benchmark, I've had similar issues in more complicated code. > >>> > >>> After profiling the code a bit, I determined a significant amount of CPU > >>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls. > >>> This may just be a side effect of the nature of awaitNanos as reported to > >>> a > >>> profiling tool, but it could also have something to do with the way > >>> ReentrantLocks are being allocated and managed. Any insights are > >>> appreciated. > >>> -- > >>> View this message in context: > >>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html > >>> Sent from the GPars - user mailing list mailing list archive at > >>> Nabble.com. > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe from this list, please visit: > >>> > >>> http://xircles.codehaus.org/manage_email > >>> > >>> > >> > >> > >> > >> -- > >> E-mail: [hidden email] > >> Blog: http://www.jroller.com/vaclav > >> Linkedin page: http://www.linkedin.com/in/vaclavpech > >> > > > > --------------------------------------------------------------------- > > To unsubscribe from this list, please visit: > > > > http://xircles.codehaus.org/manage_email > > > > > --------------------------------------------------------------------- > To unsubscribe from this list, please visit: > > http://xircles.codehaus.org/manage_email > > > > > ________________________________ > View message @ http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1757878.html > To unsubscribe from serious withPool performance slowdown, click here. |
Administrator
|
In reply to this post by Alex Tkachman
Hi Alex, Cedric and Johannes,
the time category is no longer in use in withPool blocks. It's the GParsPoolUtil class holding all the xxxParallel method that is being applied as a category and causes the slow-down of the sequential algorithm. Unfortunately, I can't currently think of a better way to provide a scoped parallel collection DSL. It is perhaps worth mentioning that the category doesn't cause any slowdown to the parallel algorithm, just the sequential one, since the parallel collection machinery is implemented in Java (JSR-166y lib) and so is not influenced by a Groovy category being in scope. Also, since categories are thread-local, the pooled threads processing our collection in parallel do not hold the category penalty. It is only the surrounding sequential Groovy code that suffers. Trying with the ParallelEnhancer class, which uses MOP instead of category, gives about same performance characteristics as the category-based algorithm. Vaclav On Sat, Oct 23, 2010 at 9:19 AM, Alex Tkachman <[hidden email]> wrote: Hi Vaclav, -- E-mail: [hidden email] Blog: http://www.jroller.com/vaclav Linkedin page: http://www.linkedin.com/in/vaclavpech |
Administrator
|
In reply to this post by Cedric Hurst
Hi Cedric,
On Sat, Oct 23, 2010 at 5:26 PM, Cedric Hurst <[hidden email]> wrote:
I'm glad to be of service.
Well, traits are still something to long for in Groovy, I'm afraid. In regard to the second point, it makes sense that the PA construction This is definitely a valid thought. GPars uses the JSR-166y (aka Java 7 concurrency) library to do the weight-lifting. So we can't really influence the way the ParallelArray gets build. It simply takes an array and builds a PA out of it. I've done some further investigation only to realize that it is most likely the range-to-list conversion that causes the big up-front delay. In order to build a parallel array, JSR-166y needs to start off from a Java array. Its up to us (GPars) to build an array out of the range, which unfortunately is an inherently sequential process. After changing the benchmark so as to have the nums variable hold a list instead of a range, the parallel processing gets much faster. They get on par in speed with the "sequential algorithm outside a withPool", which looks like a good result given the PA creation overhead. Now, I also realized we could lower a bit the GPars overhead for some scenarios (without impact on the discussed case), which I'll look into hopefully next week.
Such data structures would certainly be very benefitial for concurrency in Groovy. Especially if they were built around the JSR-166y parallel collections, so that we could avoid reimplementing parallel collection machinery. Cheers, Vaclav
-- E-mail: [hidden email] Blog: http://www.jroller.com/vaclav Linkedin page: http://www.linkedin.com/in/vaclavpech |
Free forum by Nabble | Edit this page |