serious withPool performance slowdown

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

serious withPool performance slowdown

Cedric Hurst
I noticed a post by Scott Davis a few weeks ago complaining of a performance issue with GPars:

http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194

I spent an entire weekend refactoring an application using GPars and found certain cases where the GPars code was significantly slower than the serial code, especially when handling supporting a large succession of very fast method calls.  I think the problem may lie in withPool itself.  Taking Scott's lead, I've distilled it down to a much simpler use case:

SCRIPT
===
@Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
import static groovyx.gpars.GParsPool.withPool

long start

def nums = (1L..10000000L)
withPool {
        println "serially summing numbers inside a withPool"
        start = System.currentTimeMillis()
        println nums.sum()
        println "time: ${System.currentTimeMillis()-start}ms"
}

println ""
println "serially summing numbers outside a withPool"
start = System.currentTimeMillis()
println nums.sum()
println "time: ${System.currentTimeMillis()-start}ms"


withPool {
        println ""
        println "paralell summing numbers inside a withPool"
        start = System.currentTimeMillis()
        println nums.sumParallel()
        println "time: ${System.currentTimeMillis()-start}ms"
}

withPool {
        println ""
        println "serially summing numbers inside a withPool (again)"
        start = System.currentTimeMillis()
        println nums.sum()
        println "time: ${System.currentTimeMillis()-start}ms"
}

println ""
println "serially summing numbers outside a withPool (again)"
start = System.currentTimeMillis()
println nums.sum()
println "time: ${System.currentTimeMillis()-start}ms"

OUTPUT
===
serially summing numbers inside a withPool
50000005000000
time: 8392ms

serially summing numbers outside a withPool
50000005000000
time: 2717ms

paralell summing numbers inside a withPool
50000005000000
time: 8750ms

serially summing numbers inside a withPool (again)
50000005000000
time: 8171ms

serially summing numbers outside a withPool (again)
50000005000000
time: 2936ms

===

As you'll see, there seems to be significant overhead when executing code inside a GParsPool (the code is almost 3x slower).  While I recognize this is a micro-benchmark, I've had similar issues in more complicated code.

After profiling the code a bit, I determined a significant amount of CPU time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.  This may just be a side effect of the nature of awaitNanos as reported to a profiling tool, but it could also have something to do with the way ReentrantLocks are being allocated and managed.  Any insights are appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Vaclav
Administrator
Hi Cedric,

I'm glad to see you using GPars. I can identify two independent questions in your email:
1. Why is sequential code inside the withPool() block slower than the same code used outside of it
2. Why is parallel code slower than sequential

Both issues origin in some inevitable overhead associated with orchestrating parallelism and demonstrate themselves only when cheap operations are measured, like summing up numbers,

Let's address the issues in turn.

1. We pay the price for having a Groovy category applied to the code. Try replacing the withPool() block with e.g. use(TimeCategory) block and so removing all GPars dependencies from the code and you'll see identical slowdown for sequential calculation - the cost of having a category-based DSL at your disposal.

use(TimeCategory) {
    println ""
    println "serially summing numbers inside a TimeCategory block"
    start = System.currentTimeMillis()
    println nums.sum()
    println "time: ${System.currentTimeMillis() - start}ms"
}

OUTPUT:

serially summing numbers inside a withPool
50000005000000
time: 6761ms

serially summing numbers inside a TimeCategory block
50000005000000
time: 8966ms

serially summing numbers outside a withPool
50000005000000
time: 3326ms

2. The parallel algorithm to work needs a tree-like parallel data-structure (Parallel Array) to be built from the number range you provide, while the sequential algorithm can start right-away. Thus the parallel algorithm is seriously handicapped. You not only measure the parallel algorithm itself, but also the time to build a tree out of the number range, cost of which is proportional to the size of the interval. This extra initial overhead can't be paid for later, if the operation we're parallelizing is cheap, like summing-up numbers. Separating the two steps shows the longed-for speed-up of parallel operations:

withPool {
    println ""
    println "paralell summing numbers inside a withPool using PA ignoring PA build time"
    final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
    start = System.currentTimeMillis()
    println pnums.sum()
    println "time: ${System.currentTimeMillis() - start}ms"
}


OUTPUT:

serially summing numbers outside a withPool
50000005000000
time: 3326ms

paralell summing numbers inside a withPool
50000005000000
time: 6511ms

paralell summing numbers inside a withPool using PA ignoring PA build time
50000005000000
time: 1981ms

This is the disadvantage of using a language with its default data structures being inherently sequential. Lists or hash sets can't be efficiently worked on concurrently directly. We can only envy Clojure developers here.

The takeaway - parallelize operations on a collection only if:
a) the operation you perform on each item is long-enough to justify the initial cost of data structure conversion
or
b) you'll chain multiple operations on the converted collection - aka nums.parallel.filter{...}.filter{...}.map{...}.sum()  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)

One more reason to avoid parallelization of fine-grained operations, like summing up numbers, in Groovy is tracked at http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current version Groovy internally synchronizes method calls making concurrent method invocations block one another. This is likely to be addressed in Groovy 1.8 and 2.0.

I hope this helps clarify the issue.

Regards,

Vaclav




On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]> wrote:

I noticed a post by Scott Davis a few weeks ago complaining of a performance
issue with GPars:

http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194

I spent an entire weekend refactoring an application using GPars and found
certain cases where the GPars code was significantly slower than the serial
code, especially when handling supporting a large succession of very fast
method calls.  I think the problem may lie in withPool itself.  Taking
Scott's lead, I've distilled it down to a much simpler use case:

SCRIPT
===
@Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
import static groovyx.gpars.GParsPool.withPool

long start

def nums = (1L..10000000L)
withPool {
       println "serially summing numbers inside a withPool"
       start = System.currentTimeMillis()
       println nums.sum()
       println "time: ${System.currentTimeMillis()-start}ms"
}

println ""
println "serially summing numbers outside a withPool"
start = System.currentTimeMillis()
println nums.sum()
println "time: ${System.currentTimeMillis()-start}ms"


withPool {
       println ""
       println "paralell summing numbers inside a withPool"
       start = System.currentTimeMillis()
       println nums.sumParallel()
       println "time: ${System.currentTimeMillis()-start}ms"
}

withPool {
       println ""
       println "serially summing numbers inside a withPool (again)"
       start = System.currentTimeMillis()
       println nums.sum()
       println "time: ${System.currentTimeMillis()-start}ms"
}

println ""
println "serially summing numbers outside a withPool (again)"
start = System.currentTimeMillis()
println nums.sum()
println "time: ${System.currentTimeMillis()-start}ms"

OUTPUT
===
serially summing numbers inside a withPool
50000005000000
time: 8392ms

serially summing numbers outside a withPool
50000005000000
time: 2717ms

paralell summing numbers inside a withPool
50000005000000
time: 8750ms

serially summing numbers inside a withPool (again)
50000005000000
time: 8171ms

serially summing numbers outside a withPool (again)
50000005000000
time: 2936ms

===

As you'll see, there seems to be significant overhead when executing code
inside a GParsPool (the code is almost 3x slower).  While I recognize this
is a micro-benchmark, I've had similar issues in more complicated code.

After profiling the code a bit, I determined a significant amount of CPU
time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
This may just be a side effect of the nature of awaitNanos as reported to a
profiling tool, but it could also have something to do with the way
ReentrantLocks are being allocated and managed.  Any insights are
appreciated.
--
View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
Sent from the GPars - user mailing list mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email





--
E-mail: [hidden email]
Blog: http://www.jroller.com/vaclav
Linkedin page: http://www.linkedin.com/in/vaclavpech
Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Alex Tkachman
Hi Vaclav,

I believe that one of clear consequences of your brilliant analysis is
the fact that GPars should stop using TimeCategory silently. 4 times
performance degradation reported by Cedric is crazy price for ability
to use TimeCategoty, which we he doesn't need at all in his code.

Alex

On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote:

> Hi Cedric,
>
> I'm glad to see you using GPars. I can identify two independent questions in
> your email:
> 1. Why is sequential code inside the withPool() block slower than the same
> code used outside of it
> 2. Why is parallel code slower than sequential
>
> Both issues origin in some inevitable overhead associated with orchestrating
> parallelism and demonstrate themselves only when cheap operations are
> measured, like summing up numbers,
>
> Let's address the issues in turn.
>
> 1. We pay the price for having a Groovy category applied to the code. Try
> replacing the withPool() block with e.g. use(TimeCategory) block and so
> removing all GPars dependencies from the code and you'll see identical
> slowdown for sequential calculation - the cost of having a category-based
> DSL at your disposal.
>
> use(TimeCategory) {
>     println ""
>     println "serially summing numbers inside a TimeCategory block"
>     start = System.currentTimeMillis()
>     println nums.sum()
>     println "time: ${System.currentTimeMillis() - start}ms"
> }
>
> OUTPUT:
>
> serially summing numbers inside a withPool
> 50000005000000
> time: 6761ms
>
> serially summing numbers inside a TimeCategory block
> 50000005000000
> time: 8966ms
>
> serially summing numbers outside a withPool
> 50000005000000
> time: 3326ms
>
> 2. The parallel algorithm to work needs a tree-like parallel data-structure
> (Parallel Array) to be built from the number range you provide, while the
> sequential algorithm can start right-away. Thus the parallel algorithm is
> seriously handicapped. You not only measure the parallel algorithm itself,
> but also the time to build a tree out of the number range, cost of which is
> proportional to the size of the interval. This extra initial overhead can't
> be paid for later, if the operation we're parallelizing is cheap, like
> summing-up numbers. Separating the two steps shows the longed-for speed-up
> of parallel operations:
>
> withPool {
>     println ""
>     println "paralell summing numbers inside a withPool using PA ignoring PA
> build time"
>     final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
>     start = System.currentTimeMillis()
>     println pnums.sum()
>     println "time: ${System.currentTimeMillis() - start}ms"
> }
>
>
> OUTPUT:
>
> serially summing numbers outside a withPool
> 50000005000000
> time: 3326ms
>
> paralell summing numbers inside a withPool
> 50000005000000
> time: 6511ms
>
> paralell summing numbers inside a withPool using PA ignoring PA build time
> 50000005000000
> time: 1981ms
>
> This is the disadvantage of using a language with its default data
> structures being inherently sequential. Lists or hash sets can't be
> efficiently worked on concurrently directly. We can only envy Clojure
> developers here.
>
> The takeaway - parallelize operations on a collection only if:
> a) the operation you perform on each item is long-enough to justify the
> initial cost of data structure conversion
> or
> b) you'll chain multiple operations on the converted collection - aka
> nums.parallel.filter{...}.filter{...}.map{...}.sum()
>  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)
>
> One more reason to avoid parallelization of fine-grained operations, like
> summing up numbers, in Groovy is tracked at
> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current
> version Groovy internally synchronizes method calls making concurrent method
> invocations block one another. This is likely to be addressed in Groovy 1.8
> and 2.0.
>
> I hope this helps clarify the issue.
>
> Regards,
>
> Vaclav
>
>
>
>
> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]>
> wrote:
>>
>> I noticed a post by Scott Davis a few weeks ago complaining of a
>> performance
>> issue with GPars:
>>
>>
>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194
>>
>> I spent an entire weekend refactoring an application using GPars and found
>> certain cases where the GPars code was significantly slower than the
>> serial
>> code, especially when handling supporting a large succession of very fast
>> method calls.  I think the problem may lie in withPool itself.  Taking
>> Scott's lead, I've distilled it down to a much simpler use case:
>>
>> SCRIPT
>> ===
>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
>> import static groovyx.gpars.GParsPool.withPool
>>
>> long start
>>
>> def nums = (1L..10000000L)
>> withPool {
>>        println "serially summing numbers inside a withPool"
>>        start = System.currentTimeMillis()
>>        println nums.sum()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> println ""
>> println "serially summing numbers outside a withPool"
>> start = System.currentTimeMillis()
>> println nums.sum()
>> println "time: ${System.currentTimeMillis()-start}ms"
>>
>>
>> withPool {
>>        println ""
>>        println "paralell summing numbers inside a withPool"
>>        start = System.currentTimeMillis()
>>        println nums.sumParallel()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> withPool {
>>        println ""
>>        println "serially summing numbers inside a withPool (again)"
>>        start = System.currentTimeMillis()
>>        println nums.sum()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> println ""
>> println "serially summing numbers outside a withPool (again)"
>> start = System.currentTimeMillis()
>> println nums.sum()
>> println "time: ${System.currentTimeMillis()-start}ms"
>>
>> OUTPUT
>> ===
>> serially summing numbers inside a withPool
>> 50000005000000
>> time: 8392ms
>>
>> serially summing numbers outside a withPool
>> 50000005000000
>> time: 2717ms
>>
>> paralell summing numbers inside a withPool
>> 50000005000000
>> time: 8750ms
>>
>> serially summing numbers inside a withPool (again)
>> 50000005000000
>> time: 8171ms
>>
>> serially summing numbers outside a withPool (again)
>> 50000005000000
>> time: 2936ms
>>
>> ===
>>
>> As you'll see, there seems to be significant overhead when executing code
>> inside a GParsPool (the code is almost 3x slower).  While I recognize this
>> is a micro-benchmark, I've had similar issues in more complicated code.
>>
>> After profiling the code a bit, I determined a significant amount of CPU
>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
>> This may just be a side effect of the nature of awaitNanos as reported to
>> a
>> profiling tool, but it could also have something to do with the way
>> ReentrantLocks are being allocated and managed.  Any insights are
>> appreciated.
>> --
>> View this message in context:
>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
>> Sent from the GPars - user mailing list mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe from this list, please visit:
>>
>>    http://xircles.codehaus.org/manage_email
>>
>>
>
>
>
> --
> E-mail: [hidden email]
> Blog: http://www.jroller.com/vaclav
> Linkedin page: http://www.linkedin.com/in/vaclavpech
>

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Johannes Link
I assume Vaclav's point is not about TimeCategory itself but about the use of any category. As long as we use a category-based DSL we have to pay the performance price. Could withPool be implemented without categories?

Johannes

Von meinem iPod gesendet

Am 23.10.2010 um 09:19 schrieb Alex Tkachman <[hidden email]>:

> Hi Vaclav,
>
> I believe that one of clear consequences of your brilliant analysis is
> the fact that GPars should stop using TimeCategory silently. 4 times
> performance degradation reported by Cedric is crazy price for ability
> to use TimeCategoty, which we he doesn't need at all in his code.
>
> Alex
>
> On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote:
>> Hi Cedric,
>>
>> I'm glad to see you using GPars. I can identify two independent questions in
>> your email:
>> 1. Why is sequential code inside the withPool() block slower than the same
>> code used outside of it
>> 2. Why is parallel code slower than sequential
>>
>> Both issues origin in some inevitable overhead associated with orchestrating
>> parallelism and demonstrate themselves only when cheap operations are
>> measured, like summing up numbers,
>>
>> Let's address the issues in turn.
>>
>> 1. We pay the price for having a Groovy category applied to the code. Try
>> replacing the withPool() block with e.g. use(TimeCategory) block and so
>> removing all GPars dependencies from the code and you'll see identical
>> slowdown for sequential calculation - the cost of having a category-based
>> DSL at your disposal.
>>
>> use(TimeCategory) {
>>     println ""
>>     println "serially summing numbers inside a TimeCategory block"
>>     start = System.currentTimeMillis()
>>     println nums.sum()
>>     println "time: ${System.currentTimeMillis() - start}ms"
>> }
>>
>> OUTPUT:
>>
>> serially summing numbers inside a withPool
>> 50000005000000
>> time: 6761ms
>>
>> serially summing numbers inside a TimeCategory block
>> 50000005000000
>> time: 8966ms
>>
>> serially summing numbers outside a withPool
>> 50000005000000
>> time: 3326ms
>>
>> 2. The parallel algorithm to work needs a tree-like parallel data-structure
>> (Parallel Array) to be built from the number range you provide, while the
>> sequential algorithm can start right-away. Thus the parallel algorithm is
>> seriously handicapped. You not only measure the parallel algorithm itself,
>> but also the time to build a tree out of the number range, cost of which is
>> proportional to the size of the interval. This extra initial overhead can't
>> be paid for later, if the operation we're parallelizing is cheap, like
>> summing-up numbers. Separating the two steps shows the longed-for speed-up
>> of parallel operations:
>>
>> withPool {
>>     println ""
>>     println "paralell summing numbers inside a withPool using PA ignoring PA
>> build time"
>>     final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
>>     start = System.currentTimeMillis()
>>     println pnums.sum()
>>     println "time: ${System.currentTimeMillis() - start}ms"
>> }
>>
>>
>> OUTPUT:
>>
>> serially summing numbers outside a withPool
>> 50000005000000
>> time: 3326ms
>>
>> paralell summing numbers inside a withPool
>> 50000005000000
>> time: 6511ms
>>
>> paralell summing numbers inside a withPool using PA ignoring PA build time
>> 50000005000000
>> time: 1981ms
>>
>> This is the disadvantage of using a language with its default data
>> structures being inherently sequential. Lists or hash sets can't be
>> efficiently worked on concurrently directly. We can only envy Clojure
>> developers here.
>>
>> The takeaway - parallelize operations on a collection only if:
>> a) the operation you perform on each item is long-enough to justify the
>> initial cost of data structure conversion
>> or
>> b) you'll chain multiple operations on the converted collection - aka
>> nums.parallel.filter{...}.filter{...}.map{...}.sum()
>>  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)
>>
>> One more reason to avoid parallelization of fine-grained operations, like
>> summing up numbers, in Groovy is tracked at
>> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current
>> version Groovy internally synchronizes method calls making concurrent method
>> invocations block one another. This is likely to be addressed in Groovy 1.8
>> and 2.0.
>>
>> I hope this helps clarify the issue.
>>
>> Regards,
>>
>> Vaclav
>>
>>
>>
>>
>> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]>
>> wrote:
>>>
>>> I noticed a post by Scott Davis a few weeks ago complaining of a
>>> performance
>>> issue with GPars:
>>>
>>>
>>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194
>>>
>>> I spent an entire weekend refactoring an application using GPars and found
>>> certain cases where the GPars code was significantly slower than the
>>> serial
>>> code, especially when handling supporting a large succession of very fast
>>> method calls.  I think the problem may lie in withPool itself.  Taking
>>> Scott's lead, I've distilled it down to a much simpler use case:
>>>
>>> SCRIPT
>>> ===
>>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
>>> import static groovyx.gpars.GParsPool.withPool
>>>
>>> long start
>>>
>>> def nums = (1L..10000000L)
>>> withPool {
>>>        println "serially summing numbers inside a withPool"
>>>        start = System.currentTimeMillis()
>>>        println nums.sum()
>>>        println "time: ${System.currentTimeMillis()-start}ms"
>>> }
>>>
>>> println ""
>>> println "serially summing numbers outside a withPool"
>>> start = System.currentTimeMillis()
>>> println nums.sum()
>>> println "time: ${System.currentTimeMillis()-start}ms"
>>>
>>>
>>> withPool {
>>>        println ""
>>>        println "paralell summing numbers inside a withPool"
>>>        start = System.currentTimeMillis()
>>>        println nums.sumParallel()
>>>        println "time: ${System.currentTimeMillis()-start}ms"
>>> }
>>>
>>> withPool {
>>>        println ""
>>>        println "serially summing numbers inside a withPool (again)"
>>>        start = System.currentTimeMillis()
>>>        println nums.sum()
>>>        println "time: ${System.currentTimeMillis()-start}ms"
>>> }
>>>
>>> println ""
>>> println "serially summing numbers outside a withPool (again)"
>>> start = System.currentTimeMillis()
>>> println nums.sum()
>>> println "time: ${System.currentTimeMillis()-start}ms"
>>>
>>> OUTPUT
>>> ===
>>> serially summing numbers inside a withPool
>>> 50000005000000
>>> time: 8392ms
>>>
>>> serially summing numbers outside a withPool
>>> 50000005000000
>>> time: 2717ms
>>>
>>> paralell summing numbers inside a withPool
>>> 50000005000000
>>> time: 8750ms
>>>
>>> serially summing numbers inside a withPool (again)
>>> 50000005000000
>>> time: 8171ms
>>>
>>> serially summing numbers outside a withPool (again)
>>> 50000005000000
>>> time: 2936ms
>>>
>>> ===
>>>
>>> As you'll see, there seems to be significant overhead when executing code
>>> inside a GParsPool (the code is almost 3x slower).  While I recognize this
>>> is a micro-benchmark, I've had similar issues in more complicated code.
>>>
>>> After profiling the code a bit, I determined a significant amount of CPU
>>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
>>> This may just be a side effect of the nature of awaitNanos as reported to
>>> a
>>> profiling tool, but it could also have something to do with the way
>>> ReentrantLocks are being allocated and managed.  Any insights are
>>> appreciated.
>>> --
>>> View this message in context:
>>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
>>> Sent from the GPars - user mailing list mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe from this list, please visit:
>>>
>>>    http://xircles.codehaus.org/manage_email
>>>
>>>
>>
>>
>>
>> --
>> E-mail: [hidden email]
>> Blog: http://www.jroller.com/vaclav
>> Linkedin page: http://www.linkedin.com/in/vaclavpech
>>
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Cedric Hurst
Thanks for the insights.  This is exactly what I was looking for.

In regard to the first point, I guess this is where something like the
compile-time traits used in Scala would come in handy.  If I
understood correctly, I think Guillaume mentioned they were looking
into incorporating traits into a future Groovy release.  I believe
Groovy++ supports traits, as well, but I've not yet used them in
practice so I'm not sure they would help.

In regard to the second point, it makes sense that the PA construction
would be expensive for such a large range.  But it also seems this
translation is currently both atomic and single-threaded.  I'm not
familiar with the internals of ParalellArrays, so this may be a
completely left-field inquiry, but is there a way to buffer and stream
the elements of a parallel array as its being constructed?  For
instance, if the parallel array was half-populated, it seems we should
be able to start applying the parallel algorithm on the contents
thusfar and stream the remaining elements into our algorithm via a
buffer as they are made available.  The difference would be this:

[1, 2, .., 100000000] -> makeParallel([1, 2, .., 100000000]) ->
sumParallel([1, 2, ..., 100000000])

versus

[1, 2, .., 100000000] -> makeParallelStream(list: [1, 2, ..,
100000000], bufferSize: 20000) -> sumParallel([1, 2, .., 20000]),
sumParallel([20001, 20002, .., 40000]), ..., ->
getSumOfSumParallelSums([200010000, 600010000...])

Don't know if this would be any faster, but it would at least light up
the remainder of the cores earlier in the pipeline.

Also, it might be worthwhile to provide some prefabricated parallel
versions of the standard Groovy collections.  If I declare
ParallelIntRange out the box, for example, it might save some time in
versus of creating a simple list and then copying it into a parallel
version.

On Sat, Oct 23, 2010 at 7:54 AM, Johannes Link [via GPars - user
mailing list] <[hidden email]> wrote:

>
> I assume Vaclav's point is not about TimeCategory itself but about the use of any category. As long as we use a category-based DSL we have to pay the performance price. Could withPool be implemented without categories?
>
> Johannes
>
> Von meinem iPod gesendet
>
> Am 23.10.2010 um 09:19 schrieb Alex Tkachman <[hidden email]>:
>
> > Hi Vaclav,
> >
> > I believe that one of clear consequences of your brilliant analysis is
> > the fact that GPars should stop using TimeCategory silently. 4 times
> > performance degradation reported by Cedric is crazy price for ability
> > to use TimeCategoty, which we he doesn't need at all in his code.
> >
> > Alex
> >
> > On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote:
> >> Hi Cedric,
> >>
> >> I'm glad to see you using GPars. I can identify two independent questions in
> >> your email:
> >> 1. Why is sequential code inside the withPool() block slower than the same
> >> code used outside of it
> >> 2. Why is parallel code slower than sequential
> >>
> >> Both issues origin in some inevitable overhead associated with orchestrating
> >> parallelism and demonstrate themselves only when cheap operations are
> >> measured, like summing up numbers,
> >>
> >> Let's address the issues in turn.
> >>
> >> 1. We pay the price for having a Groovy category applied to the code. Try
> >> replacing the withPool() block with e.g. use(TimeCategory) block and so
> >> removing all GPars dependencies from the code and you'll see identical
> >> slowdown for sequential calculation - the cost of having a category-based
> >> DSL at your disposal.
> >>
> >> use(TimeCategory) {
> >>     println ""
> >>     println "serially summing numbers inside a TimeCategory block"
> >>     start = System.currentTimeMillis()
> >>     println nums.sum()
> >>     println "time: ${System.currentTimeMillis() - start}ms"
> >> }
> >>
> >> OUTPUT:
> >>
> >> serially summing numbers inside a withPool
> >> 50000005000000
> >> time: 6761ms
> >>
> >> serially summing numbers inside a TimeCategory block
> >> 50000005000000
> >> time: 8966ms
> >>
> >> serially summing numbers outside a withPool
> >> 50000005000000
> >> time: 3326ms
> >>
> >> 2. The parallel algorithm to work needs a tree-like parallel data-structure
> >> (Parallel Array) to be built from the number range you provide, while the
> >> sequential algorithm can start right-away. Thus the parallel algorithm is
> >> seriously handicapped. You not only measure the parallel algorithm itself,
> >> but also the time to build a tree out of the number range, cost of which is
> >> proportional to the size of the interval. This extra initial overhead can't
> >> be paid for later, if the operation we're parallelizing is cheap, like
> >> summing-up numbers. Separating the two steps shows the longed-for speed-up
> >> of parallel operations:
> >>
> >> withPool {
> >>     println ""
> >>     println "paralell summing numbers inside a withPool using PA ignoring PA
> >> build time"
> >>     final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
> >>     start = System.currentTimeMillis()
> >>     println pnums.sum()
> >>     println "time: ${System.currentTimeMillis() - start}ms"
> >> }
> >>
> >>
> >> OUTPUT:
> >>
> >> serially summing numbers outside a withPool
> >> 50000005000000
> >> time: 3326ms
> >>
> >> paralell summing numbers inside a withPool
> >> 50000005000000
> >> time: 6511ms
> >>
> >> paralell summing numbers inside a withPool using PA ignoring PA build time
> >> 50000005000000
> >> time: 1981ms
> >>
> >> This is the disadvantage of using a language with its default data
> >> structures being inherently sequential. Lists or hash sets can't be
> >> efficiently worked on concurrently directly. We can only envy Clojure
> >> developers here.
> >>
> >> The takeaway - parallelize operations on a collection only if:
> >> a) the operation you perform on each item is long-enough to justify the
> >> initial cost of data structure conversion
> >> or
> >> b) you'll chain multiple operations on the converted collection - aka
> >> nums.parallel.filter{...}.filter{...}.map{...}.sum()
> >>  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)
> >>
> >> One more reason to avoid parallelization of fine-grained operations, like
> >> summing up numbers, in Groovy is tracked at
> >> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current
> >> version Groovy internally synchronizes method calls making concurrent method
> >> invocations block one another. This is likely to be addressed in Groovy 1.8
> >> and 2.0.
> >>
> >> I hope this helps clarify the issue.
> >>
> >> Regards,
> >>
> >> Vaclav
> >>
> >>
> >>
> >>
> >> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]>
> >> wrote:
> >>>
> >>> I noticed a post by Scott Davis a few weeks ago complaining of a
> >>> performance
> >>> issue with GPars:
> >>>
> >>>
> >>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194
> >>>
> >>> I spent an entire weekend refactoring an application using GPars and found
> >>> certain cases where the GPars code was significantly slower than the
> >>> serial
> >>> code, especially when handling supporting a large succession of very fast
> >>> method calls.  I think the problem may lie in withPool itself.  Taking
> >>> Scott's lead, I've distilled it down to a much simpler use case:
> >>>
> >>> SCRIPT
> >>> ===
> >>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
> >>> import static groovyx.gpars.GParsPool.withPool
> >>>
> >>> long start
> >>>
> >>> def nums = (1L..10000000L)
> >>> withPool {
> >>>        println "serially summing numbers inside a withPool"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sum()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> println ""
> >>> println "serially summing numbers outside a withPool"
> >>> start = System.currentTimeMillis()
> >>> println nums.sum()
> >>> println "time: ${System.currentTimeMillis()-start}ms"
> >>>
> >>>
> >>> withPool {
> >>>        println ""
> >>>        println "paralell summing numbers inside a withPool"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sumParallel()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> withPool {
> >>>        println ""
> >>>        println "serially summing numbers inside a withPool (again)"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sum()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> println ""
> >>> println "serially summing numbers outside a withPool (again)"
> >>> start = System.currentTimeMillis()
> >>> println nums.sum()
> >>> println "time: ${System.currentTimeMillis()-start}ms"
> >>>
> >>> OUTPUT
> >>> ===
> >>> serially summing numbers inside a withPool
> >>> 50000005000000
> >>> time: 8392ms
> >>>
> >>> serially summing numbers outside a withPool
> >>> 50000005000000
> >>> time: 2717ms
> >>>
> >>> paralell summing numbers inside a withPool
> >>> 50000005000000
> >>> time: 8750ms
> >>>
> >>> serially summing numbers inside a withPool (again)
> >>> 50000005000000
> >>> time: 8171ms
> >>>
> >>> serially summing numbers outside a withPool (again)
> >>> 50000005000000
> >>> time: 2936ms
> >>>
> >>> ===
> >>>
> >>> As you'll see, there seems to be significant overhead when executing code
> >>> inside a GParsPool (the code is almost 3x slower).  While I recognize this
> >>> is a micro-benchmark, I've had similar issues in more complicated code.
> >>>
> >>> After profiling the code a bit, I determined a significant amount of CPU
> >>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
> >>> This may just be a side effect of the nature of awaitNanos as reported to
> >>> a
> >>> profiling tool, but it could also have something to do with the way
> >>> ReentrantLocks are being allocated and managed.  Any insights are
> >>> appreciated.
> >>> --
> >>> View this message in context:
> >>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
> >>> Sent from the GPars - user mailing list mailing list archive at
> >>> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe from this list, please visit:
> >>>
> >>>    http://xircles.codehaus.org/manage_email
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> E-mail: [hidden email]
> >> Blog: http://www.jroller.com/vaclav
> >> Linkedin page: http://www.linkedin.com/in/vaclavpech
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe from this list, please visit:
> >
> >    http://xircles.codehaus.org/manage_email
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>     http://xircles.codehaus.org/manage_email
>
>
>
>
> ________________________________
> View message @ http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1757878.html
> To unsubscribe from serious withPool performance slowdown, click here.
Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Vaclav
Administrator
In reply to this post by Alex Tkachman
Hi Alex, Cedric and Johannes,

the time category is no longer in use in withPool blocks. It's the GParsPoolUtil class holding all the xxxParallel method that is being applied as a category and causes the slow-down of the sequential algorithm.
Unfortunately, I can't currently think of a better way to provide a scoped parallel collection DSL.
It is perhaps worth mentioning that the category doesn't cause any slowdown to the parallel algorithm, just the sequential one, since the parallel collection machinery is implemented in Java (JSR-166y lib) and so is not influenced by a Groovy category being in scope. Also, since categories are thread-local, the pooled threads processing our collection in parallel do not hold the category penalty. It is only the surrounding sequential Groovy code that suffers.
Trying with the ParallelEnhancer class, which uses MOP instead of category, gives about same performance characteristics as the category-based algorithm.

Vaclav


On Sat, Oct 23, 2010 at 9:19 AM, Alex Tkachman <[hidden email]> wrote:
Hi Vaclav,

I believe that one of clear consequences of your brilliant analysis is
the fact that GPars should stop using TimeCategory silently. 4 times
performance degradation reported by Cedric is crazy price for ability
to use TimeCategoty, which we he doesn't need at all in his code.

Alex

On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote:
> Hi Cedric,
>
> I'm glad to see you using GPars. I can identify two independent questions in
> your email:
> 1. Why is sequential code inside the withPool() block slower than the same
> code used outside of it
> 2. Why is parallel code slower than sequential
>
> Both issues origin in some inevitable overhead associated with orchestrating
> parallelism and demonstrate themselves only when cheap operations are
> measured, like summing up numbers,
>
> Let's address the issues in turn.
>
> 1. We pay the price for having a Groovy category applied to the code. Try
> replacing the withPool() block with e.g. use(TimeCategory) block and so
> removing all GPars dependencies from the code and you'll see identical
> slowdown for sequential calculation - the cost of having a category-based
> DSL at your disposal.
>
> use(TimeCategory) {
>     println ""
>     println "serially summing numbers inside a TimeCategory block"
>     start = System.currentTimeMillis()
>     println nums.sum()
>     println "time: ${System.currentTimeMillis() - start}ms"
> }
>
> OUTPUT:
>
> serially summing numbers inside a withPool
> 50000005000000
> time: 6761ms
>
> serially summing numbers inside a TimeCategory block
> 50000005000000
> time: 8966ms
>
> serially summing numbers outside a withPool
> 50000005000000
> time: 3326ms
>
> 2. The parallel algorithm to work needs a tree-like parallel data-structure
> (Parallel Array) to be built from the number range you provide, while the
> sequential algorithm can start right-away. Thus the parallel algorithm is
> seriously handicapped. You not only measure the parallel algorithm itself,
> but also the time to build a tree out of the number range, cost of which is
> proportional to the size of the interval. This extra initial overhead can't
> be paid for later, if the operation we're parallelizing is cheap, like
> summing-up numbers. Separating the two steps shows the longed-for speed-up
> of parallel operations:
>
> withPool {
>     println ""
>     println "paralell summing numbers inside a withPool using PA ignoring PA
> build time"
>     final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
>     start = System.currentTimeMillis()
>     println pnums.sum()
>     println "time: ${System.currentTimeMillis() - start}ms"
> }
>
>
> OUTPUT:
>
> serially summing numbers outside a withPool
> 50000005000000
> time: 3326ms
>
> paralell summing numbers inside a withPool
> 50000005000000
> time: 6511ms
>
> paralell summing numbers inside a withPool using PA ignoring PA build time
> 50000005000000
> time: 1981ms
>
> This is the disadvantage of using a language with its default data
> structures being inherently sequential. Lists or hash sets can't be
> efficiently worked on concurrently directly. We can only envy Clojure
> developers here.
>
> The takeaway - parallelize operations on a collection only if:
> a) the operation you perform on each item is long-enough to justify the
> initial cost of data structure conversion
> or
> b) you'll chain multiple operations on the converted collection - aka
> nums.parallel.filter{...}.filter{...}.map{...}.sum()
>  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)
>
> One more reason to avoid parallelization of fine-grained operations, like
> summing up numbers, in Groovy is tracked at
> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current
> version Groovy internally synchronizes method calls making concurrent method
> invocations block one another. This is likely to be addressed in Groovy 1.8
> and 2.0.
>
> I hope this helps clarify the issue.
>
> Regards,
>
> Vaclav
>
>
>
>
> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]>
> wrote:
>>
>> I noticed a post by Scott Davis a few weeks ago complaining of a
>> performance
>> issue with GPars:
>>
>>
>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194
>>
>> I spent an entire weekend refactoring an application using GPars and found
>> certain cases where the GPars code was significantly slower than the
>> serial
>> code, especially when handling supporting a large succession of very fast
>> method calls.  I think the problem may lie in withPool itself.  Taking
>> Scott's lead, I've distilled it down to a much simpler use case:
>>
>> SCRIPT
>> ===
>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
>> import static groovyx.gpars.GParsPool.withPool
>>
>> long start
>>
>> def nums = (1L..10000000L)
>> withPool {
>>        println "serially summing numbers inside a withPool"
>>        start = System.currentTimeMillis()
>>        println nums.sum()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> println ""
>> println "serially summing numbers outside a withPool"
>> start = System.currentTimeMillis()
>> println nums.sum()
>> println "time: ${System.currentTimeMillis()-start}ms"
>>
>>
>> withPool {
>>        println ""
>>        println "paralell summing numbers inside a withPool"
>>        start = System.currentTimeMillis()
>>        println nums.sumParallel()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> withPool {
>>        println ""
>>        println "serially summing numbers inside a withPool (again)"
>>        start = System.currentTimeMillis()
>>        println nums.sum()
>>        println "time: ${System.currentTimeMillis()-start}ms"
>> }
>>
>> println ""
>> println "serially summing numbers outside a withPool (again)"
>> start = System.currentTimeMillis()
>> println nums.sum()
>> println "time: ${System.currentTimeMillis()-start}ms"
>>
>> OUTPUT
>> ===
>> serially summing numbers inside a withPool
>> 50000005000000
>> time: 8392ms
>>
>> serially summing numbers outside a withPool
>> 50000005000000
>> time: 2717ms
>>
>> paralell summing numbers inside a withPool
>> 50000005000000
>> time: 8750ms
>>
>> serially summing numbers inside a withPool (again)
>> 50000005000000
>> time: 8171ms
>>
>> serially summing numbers outside a withPool (again)
>> 50000005000000
>> time: 2936ms
>>
>> ===
>>
>> As you'll see, there seems to be significant overhead when executing code
>> inside a GParsPool (the code is almost 3x slower).  While I recognize this
>> is a micro-benchmark, I've had similar issues in more complicated code.
>>
>> After profiling the code a bit, I determined a significant amount of CPU
>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
>> This may just be a side effect of the nature of awaitNanos as reported to
>> a
>> profiling tool, but it could also have something to do with the way
>> ReentrantLocks are being allocated and managed.  Any insights are
>> appreciated.
>> --
>> View this message in context:
>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
>> Sent from the GPars - user mailing list mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe from this list, please visit:
>>
>>    http://xircles.codehaus.org/manage_email
>>
>>
>
>
>
> --
> E-mail: [hidden email]
> Blog: http://www.jroller.com/vaclav
> Linkedin page: http://www.linkedin.com/in/vaclavpech
>

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email





--
E-mail: [hidden email]
Blog: http://www.jroller.com/vaclav
Linkedin page: http://www.linkedin.com/in/vaclavpech
Reply | Threaded
Open this post in threaded view
|

Re: serious withPool performance slowdown

Vaclav
Administrator
In reply to this post by Cedric Hurst
Hi Cedric,

On Sat, Oct 23, 2010 at 5:26 PM, Cedric Hurst <[hidden email]> wrote:

Thanks for the insights.  This is exactly what I was looking for.
I'm glad to be of service.

In regard to the first point, I guess this is where something like the
compile-time traits used in Scala would come in handy.  If I
understood correctly, I think Guillaume mentioned they were looking
into incorporating traits into a future Groovy release.  I believe
Groovy++ supports traits, as well, but I've not yet used them in
practice so I'm not sure they would help.

Well, traits are still something to long for in Groovy, I'm afraid.
 
In regard to the second point, it makes sense that the PA construction
would be expensive for such a large range.  But it also seems this
translation is currently both atomic and single-threaded.  I'm not
familiar with the internals of ParalellArrays, so this may be a
completely left-field inquiry, but is there a way to buffer and stream
the elements of a parallel array as its being constructed?  For
instance, if the parallel array was half-populated, it seems we should
be able to start applying the parallel algorithm on the contents
thusfar and stream the remaining elements into our algorithm via a
buffer as they are made available.  The difference would be this:

[1, 2, .., 100000000] -> makeParallel([1, 2, .., 100000000]) ->
sumParallel([1, 2, ..., 100000000])

versus

[1, 2, .., 100000000] -> makeParallelStream(list: [1, 2, ..,
100000000], bufferSize: 20000) -> sumParallel([1, 2, .., 20000]),
sumParallel([20001, 20002, .., 40000]), ..., ->
getSumOfSumParallelSums([200010000, 600010000...])

Don't know if this would be any faster, but it would at least light up
the remainder of the cores earlier in the pipeline.

This is definitely a valid thought. GPars uses the JSR-166y (aka Java 7 concurrency) library to do the weight-lifting. So we can't really influence the way the ParallelArray gets build. It simply takes an array and builds a PA out of it.
I've done some further investigation only to realize that it is most likely the range-to-list conversion that causes the big up-front delay. In order to build a parallel array, JSR-166y needs to start off from a Java array. Its up to us (GPars) to build an array out of the range, which unfortunately is an inherently sequential process.
After changing the benchmark so as to have the nums variable hold a list instead of a range, the parallel processing gets much faster. They get on par in speed with the "sequential algorithm outside a withPool", which looks like a good result given the PA creation overhead.

Now, I also realized we could lower a bit the GPars overhead for some scenarios (without impact on the discussed case), which I'll look into hopefully next week.


Also, it might be worthwhile to provide some prefabricated parallel
versions of the standard Groovy collections.  If I declare
ParallelIntRange out the box, for example, it might save some time in
versus of creating a simple list and then copying it into a parallel
version.

Such data structures would certainly be very benefitial for concurrency in Groovy. Especially if they were built around the JSR-166y parallel collections, so that we could avoid reimplementing parallel collection machinery.

Cheers,

Vaclav

 
On Sat, Oct 23, 2010 at 7:54 AM, Johannes Link [via GPars - user
mailing list] <[hidden email]> wrote:
>
> I assume Vaclav's point is not about TimeCategory itself but about the use of any category. As long as we use a category-based DSL we have to pay the performance price. Could withPool be implemented without categories?
>
> Johannes
>
> Von meinem iPod gesendet
>
> Am 23.10.2010 um 09:19 schrieb Alex Tkachman <[hidden email]>:
>
> > Hi Vaclav,
> >
> > I believe that one of clear consequences of your brilliant analysis is
> > the fact that GPars should stop using TimeCategory silently. 4 times
> > performance degradation reported by Cedric is crazy price for ability
> > to use TimeCategoty, which we he doesn't need at all in his code.
> >
> > Alex
> >
> > On Sat, Oct 23, 2010 at 8:12 AM, Vaclav Pech <[hidden email]> wrote:
> >> Hi Cedric,
> >>
> >> I'm glad to see you using GPars. I can identify two independent questions in
> >> your email:
> >> 1. Why is sequential code inside the withPool() block slower than the same
> >> code used outside of it
> >> 2. Why is parallel code slower than sequential
> >>
> >> Both issues origin in some inevitable overhead associated with orchestrating
> >> parallelism and demonstrate themselves only when cheap operations are
> >> measured, like summing up numbers,
> >>
> >> Let's address the issues in turn.
> >>
> >> 1. We pay the price for having a Groovy category applied to the code. Try
> >> replacing the withPool() block with e.g. use(TimeCategory) block and so
> >> removing all GPars dependencies from the code and you'll see identical
> >> slowdown for sequential calculation - the cost of having a category-based
> >> DSL at your disposal.
> >>
> >> use(TimeCategory) {
> >>     println ""
> >>     println "serially summing numbers inside a TimeCategory block"
> >>     start = System.currentTimeMillis()
> >>     println nums.sum()
> >>     println "time: ${System.currentTimeMillis() - start}ms"
> >> }
> >>
> >> OUTPUT:
> >>
> >> serially summing numbers inside a withPool
> >> 50000005000000
> >> time: 6761ms
> >>
> >> serially summing numbers inside a TimeCategory block
> >> 50000005000000
> >> time: 8966ms
> >>
> >> serially summing numbers outside a withPool
> >> 50000005000000
> >> time: 3326ms
> >>
> >> 2. The parallel algorithm to work needs a tree-like parallel data-structure
> >> (Parallel Array) to be built from the number range you provide, while the
> >> sequential algorithm can start right-away. Thus the parallel algorithm is
> >> seriously handicapped. You not only measure the parallel algorithm itself,
> >> but also the time to build a tree out of the number range, cost of which is
> >> proportional to the size of the interval. This extra initial overhead can't
> >> be paid for later, if the operation we're parallelizing is cheap, like
> >> summing-up numbers. Separating the two steps shows the longed-for speed-up
> >> of parallel operations:
> >>
> >> withPool {
> >>     println ""
> >>     println "paralell summing numbers inside a withPool using PA ignoring PA
> >> build time"
> >>     final def pnums = GParsPoolUtil.getParallel(nums)    //building the PA
> >>     start = System.currentTimeMillis()
> >>     println pnums.sum()
> >>     println "time: ${System.currentTimeMillis() - start}ms"
> >> }
> >>
> >>
> >> OUTPUT:
> >>
> >> serially summing numbers outside a withPool
> >> 50000005000000
> >> time: 3326ms
> >>
> >> paralell summing numbers inside a withPool
> >> 50000005000000
> >> time: 6511ms
> >>
> >> paralell summing numbers inside a withPool using PA ignoring PA build time
> >> 50000005000000
> >> time: 1981ms
> >>
> >> This is the disadvantage of using a language with its default data
> >> structures being inherently sequential. Lists or hash sets can't be
> >> efficiently worked on concurrently directly. We can only envy Clojure
> >> developers here.
> >>
> >> The takeaway - parallelize operations on a collection only if:
> >> a) the operation you perform on each item is long-enough to justify the
> >> initial cost of data structure conversion
> >> or
> >> b) you'll chain multiple operations on the converted collection - aka
> >> nums.parallel.filter{...}.filter{...}.map{...}.sum()
> >>  (http://www.gpars.org/guide/guide/3.%20Data%20Parallelism.html#3.2%20Map-Reduce)
> >>
> >> One more reason to avoid parallelization of fine-grained operations, like
> >> summing up numbers, in Groovy is tracked at
> >> http://jira.codehaus.org/browse/GPARS-40 . See the comments. At the current
> >> version Groovy internally synchronizes method calls making concurrent method
> >> invocations block one another. This is likely to be addressed in Groovy 1.8
> >> and 2.0.
> >>
> >> I hope this helps clarify the issue.
> >>
> >> Regards,
> >>
> >> Vaclav
> >>
> >>
> >>
> >>
> >> On Fri, Oct 22, 2010 at 4:31 PM, Cedric Hurst <[hidden email]>
> >> wrote:
> >>>
> >>> I noticed a post by Scott Davis a few weeks ago complaining of a
> >>> performance
> >>> issue with GPars:
> >>>
> >>>
> >>> http://gpars-user-mailing-list.19372.n3.nabble.com/I-m-clearly-doing-it-wrong-td889194.html#a889194
> >>>
> >>> I spent an entire weekend refactoring an application using GPars and found
> >>> certain cases where the GPars code was significantly slower than the
> >>> serial
> >>> code, especially when handling supporting a large succession of very fast
> >>> method calls.  I think the problem may lie in withPool itself.  Taking
> >>> Scott's lead, I've distilled it down to a much simpler use case:
> >>>
> >>> SCRIPT
> >>> ===
> >>> @Grab(group='org.codehaus.gpars', module='gpars', version='0.10')
> >>> import static groovyx.gpars.GParsPool.withPool
> >>>
> >>> long start
> >>>
> >>> def nums = (1L..10000000L)
> >>> withPool {
> >>>        println "serially summing numbers inside a withPool"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sum()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> println ""
> >>> println "serially summing numbers outside a withPool"
> >>> start = System.currentTimeMillis()
> >>> println nums.sum()
> >>> println "time: ${System.currentTimeMillis()-start}ms"
> >>>
> >>>
> >>> withPool {
> >>>        println ""
> >>>        println "paralell summing numbers inside a withPool"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sumParallel()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> withPool {
> >>>        println ""
> >>>        println "serially summing numbers inside a withPool (again)"
> >>>        start = System.currentTimeMillis()
> >>>        println nums.sum()
> >>>        println "time: ${System.currentTimeMillis()-start}ms"
> >>> }
> >>>
> >>> println ""
> >>> println "serially summing numbers outside a withPool (again)"
> >>> start = System.currentTimeMillis()
> >>> println nums.sum()
> >>> println "time: ${System.currentTimeMillis()-start}ms"
> >>>
> >>> OUTPUT
> >>> ===
> >>> serially summing numbers inside a withPool
> >>> 50000005000000
> >>> time: 8392ms
> >>>
> >>> serially summing numbers outside a withPool
> >>> 50000005000000
> >>> time: 2717ms
> >>>
> >>> paralell summing numbers inside a withPool
> >>> 50000005000000
> >>> time: 8750ms
> >>>
> >>> serially summing numbers inside a withPool (again)
> >>> 50000005000000
> >>> time: 8171ms
> >>>
> >>> serially summing numbers outside a withPool (again)
> >>> 50000005000000
> >>> time: 2936ms
> >>>
> >>> ===
> >>>
> >>> As you'll see, there seems to be significant overhead when executing code
> >>> inside a GParsPool (the code is almost 3x slower).  While I recognize this
> >>> is a micro-benchmark, I've had similar issues in more complicated code.
> >>>
> >>> After profiling the code a bit, I determined a significant amount of CPU
> >>> time in the withPool blocks are devoted to ForkJoinPool.awaitNanos calls.
> >>> This may just be a side effect of the nature of awaitNanos as reported to
> >>> a
> >>> profiling tool, but it could also have something to do with the way
> >>> ReentrantLocks are being allocated and managed.  Any insights are
> >>> appreciated.
> >>> --
> >>> View this message in context:
> >>> http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1752901.html
> >>> Sent from the GPars - user mailing list mailing list archive at
> >>> Nabble.com.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe from this list, please visit:
> >>>
> >>>    http://xircles.codehaus.org/manage_email
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> E-mail: [hidden email]
> >> Blog: http://www.jroller.com/vaclav
> >> Linkedin page: http://www.linkedin.com/in/vaclavpech
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe from this list, please visit:
> >
> >    http://xircles.codehaus.org/manage_email
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>     http://xircles.codehaus.org/manage_email
>
>
>
>
> ________________________________
> View message @ http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1757878.html
> To unsubscribe from serious withPool performance slowdown, click here.

--
View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/serious-withPool-performance-slowdown-tp1752901p1758526.html
Sent from the GPars - user mailing list mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email





--
E-mail: [hidden email]
Blog: http://www.jroller.com/vaclav
Linkedin page: http://www.linkedin.com/in/vaclavpech