Accelerating crawler using GPars

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Accelerating crawler using GPars

Alexander Fedulov_
I am writing a small Grails-based crawler. Essentially it parses some pages on the Internet, extracts objects of my own ProductItem type and stores them into the database using GORM. The process can be illustrated by the following image (also attached): 

Встроенное изображение 1

Obviously, network activity in form of HTTP requests represents a bottleneck – there have to be approximately 10 HTTP requests made in the background per each HTML parsing method call. Furthermore, as I have to work with a rather large number of URLs, I do not want to accumulate results in a single list of ProductItems, but rather would prefer calling  “productItem.save()” and therefore storing newly created items into the database as soon as they are being produced by the parser. The following code snippet describes my current approach: 

ProductInfoParser’s methods: 
getHtml(urlString) // makes an HTTP request and returns corresponding HTML 
parse(html) // parses provided HTML and returns an object of type ProductItem 

ProductItemController: 
def crawl(){ 
                def crawler = new Crawler() 
                def productLinks = crawler.getProductLinks() 
                productLinks.each{ url-> 
                        def productInfoParser = new ProductInfoParser() 
                        def html = productInfoParser.getHtml(url) 
                        def productItem = productInfoParser.parse(html) 
                        createOrUpdateProductItem(productItem) 
                        … 
                        //Saving productItem 
                } 


I am new to GPars and due to the versatility of available approaches I am a little bit lost not knowing which one to choose in my case. I do see that my problem fits, for instance, into the DataFlow methodology, but I do not understand how I can express that 10-to-1 relation of HTTP requests executions-to-parsing with it.  I'm hoping someone here can point me in the right direction. 



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email

Crawler.png (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Mario Garcia
I'm still a newbie in GPars but I think is more an Actor's scenario than a Dataflow's scenario. I would use dataflow if the final result depended on previous partial and isolated results coming at different times, but yours depend on the previous one, I see no gain here. 

Of course you can treat each iteration in parallel (As I think Russel mentioned).

But I see it more like an actor crawling the url's, each url's result sent to a pool of "HTML" actors, and every html processed sent to another pool of "ProductItem_x" actors. Of course this could be much better if those actors were in different machines but I think remote actors in GPars are still emerging (This should be explained a bit Vaclav ;) ) .

I hope this helps

2012/7/11 register me <[hidden email]>
I am writing a small Grails-based crawler. Essentially it parses some pages on the Internet, extracts objects of my own ProductItem type and stores them into the database using GORM. The process can be illustrated by the following image (also attached): 

Встроенное изображение 1

Obviously, network activity in form of HTTP requests represents a bottleneck – there have to be approximately 10 HTTP requests made in the background per each HTML parsing method call. Furthermore, as I have to work with a rather large number of URLs, I do not want to accumulate results in a single list of ProductItems, but rather would prefer calling  “productItem.save()” and therefore storing newly created items into the database as soon as they are being produced by the parser. The following code snippet describes my current approach: 

ProductInfoParser’s methods: 
getHtml(urlString) // makes an HTTP request and returns corresponding HTML 
parse(html) // parses provided HTML and returns an object of type ProductItem 

ProductItemController: 
def crawl(){ 
                def crawler = new Crawler() 
                def productLinks = crawler.getProductLinks() 
                productLinks.each{ url-> 
                        def productInfoParser = new ProductInfoParser() 
                        def html = productInfoParser.getHtml(url) 
                        def productItem = productInfoParser.parse(html) 
                        createOrUpdateProductItem(productItem) 
                        … 
                        //Saving productItem 
                } 


I am new to GPars and due to the versatility of available approaches I am a little bit lost not knowing which one to choose in my case. I do see that my problem fits, for instance, into the DataFlow methodology, but I do not understand how I can express that 10-to-1 relation of HTTP requests executions-to-parsing with it.  I'm hoping someone here can point me in the right direction. 



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Vaclav
Administrator
Hi,

In general, I would say than almost any of the abstractions in GPars can be utilized to help you with your problem. I can imagine a solution based on actors, CSP, parallel collections and others. I typically approach such problems with dataflow operators (you may like to check out one of my earlier posts on this http://www.jroller.com/vaclav/entry/dataflow_speculations).

Your case, however, is very linear - you take one url, process it and finally end up with a single entity to persist. In that perspective dataflow or actors seem like a slight overkill to me. You may consider replacing each() with eachParallel() in a context of a fixed-sized thread pool and be done.

The persistence layer may not like multiple threads storing concurrently data into the database, though. If that is the case, you can use a SyncDataflowQueue to submit ProductItems for persistence into a persisting thread:

def crawl(){
    def crawler = new Crawler() 
    def productLinks = crawler.getProductLinks() 

    final queueToPersist = new SyncDataflowQueue ()

    final t = Thread.start {
        initializeDBSessionForThread()
        while((final item = queueToPersist.val) != null) {
            createOrUpdateProductItem(item)        
        }
        destroyDBSession()
    }

    withPool 10, {
                productLinks.eachParallel{ url-> 
                        def productInfoParser = new ProductInfoParser() 
                        def html = productInfoParser.getHtml(url) 
                        def productItem = productInfoParser.parse(html)
                              queueToPersist << productItem
                        … 
                        //Saving productItem 
                }
                queueToPersist << null  //Indicate EOF
    }
    t.join()
}

Cheers,

Vaclav

Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Hi Vaclav,

Probably I was not clear enough explaining my problem, but I see a bit more parallelizm in it.

Vaclav wrote
Your case, however, is very linear - you take one url, process it and
finally end up with a single entity to persist. In that perspective
dataflow or actors seem like a slight overkill to me. You may consider
replacing each() with eachParallel() in a context of a fixed-sized thread
pool and be done.
Let me try to explain what I mean on the basis of the following image:



In order to make the context simpler I make the following assumptions: 1) HTTP requests are purely I/O bound and their CPU usage is negligibly small. 2) In the presented picture there is only one core that is busy parsing the pages.

Maybe I misunderstand the mechanics of the solution that you propose. As I understand, the piece of code below will in fact process URLs in parallel, but fetching and parsing operations will always be executed sequentially, therefore causing 200ms(request)+20ms(parsing)=220ms delay per URL and forsing every CPU core to basically idle for 200ms for every task they get. I want to design such solution, where CPUs are always fully loaded crunching the pages, not waiting for the network I/O.

 withPool 10, {
                productLinks.eachParallel{ url->
                        def productInfoParser = new ProductInfoParser()
                        def html = productInfoParser.getHtml(url)
                        def productItem = productInfoParser.parse(html)
                              queueToPersist << productItem
                        …
                        //Saving productItem
                }
                queueToPersist << null  //Indicate EOF
    }

I also had some doubts regarding the multithreaded persistency, but I think first need to check Grails (in particular GORM) documentation for clarifying this issue.
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Dierk König
Hi,

looks like a concurrent producer-consumer problem to me.
My personal preference would be to use kanbanflow ;-)

cheers
Dierk

Am 14.07.2012 um 14:30 schrieb Alexander Fedulov:

> Hi Vaclav,
>
> Probably I was not clear enough explaining my problem, but I see a bit more
> parallelizm in it.
>
>
> Vaclav wrote
>>
>>
>> Your case, however, is very linear - you take one url, process it and
>> finally end up with a single entity to persist. In that perspective
>> dataflow or actors seem like a slight overkill to me. You may consider
>> replacing each() with eachParallel() in a context of a fixed-sized thread
>> pool and be done.
>>
>>
>
> Let me try to explain what I mean on the basis of the following image:
>
> http://gpars-user-mailing-list.19372.n3.nabble.com/file/n4024713/Concurency_crawler.png 
>
> In order to make the context simpler I make the following assumptions: 1)
> HTTP requests are purely I/O bound and their CPU usage is negligibly small.
> 2) In the presented picture there is only one core that is busy parsing the
> pages.
>
> Maybe I misunderstand the mechanics of the solution that you propose. As I
> understand, the piece of code below will in fact process URLs in parallel,
> but fetching and parsing operations will always be executed sequentially,
> therefore causing 200ms(request)+20ms(parsing)=220ms delay per URL and
> forsing every CPU core to basically idle for 200ms for every task they get.
> I want to design such solution, where CPUs are always fully loaded crunching
> the pages, not waiting for the network I/O.
>
> withPool 10, {
>                productLinks.eachParallel{ url->
>                        def productInfoParser = new ProductInfoParser()
>                        def html = productInfoParser.getHtml(url)
>                        def productItem = productInfoParser.parse(html)
>                              queueToPersist << productItem
>                        …
>                        //Saving productItem
>                }
>                queueToPersist << null  //Indicate EOF
>    }
>
> I also had some doubts regarding the multithreaded persistency, but I think
> first need to check Grails (in particular GORM) documentation for clarifying
> this issue.
>
> --
> View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/Accelerating-crawler-using-GPars-tp4024709p4024713.html
> Sent from the GPars - user mailing list mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov_
Hehe. So much for an unbiased opinion  =)

I have seen the video of your presentation of Kanban - pretty cool visualisation - liked it a lot =). I have not read the usage documentation yet though. Which source would you recommend? GPars docs? And one more question: do I get it right, that the loop is optional and it can simply work in one direction producing the results?

Regards,
Alex

Dierk König wrote
Hi,

looks like a concurrent producer-consumer problem to me.
My personal preference would be to use kanbanflow ;-)

cheers
Dierk

Am 14.07.2012 um 14:30 schrieb Alexander Fedulov:

> Hi Vaclav,
>
> Probably I was not clear enough explaining my problem, but I see a bit more
> parallelizm in it.
>
>
> Vaclav wrote
>>
>>
>> Your case, however, is very linear - you take one url, process it and
>> finally end up with a single entity to persist. In that perspective
>> dataflow or actors seem like a slight overkill to me. You may consider
>> replacing each() with eachParallel() in a context of a fixed-sized thread
>> pool and be done.
>>
>>
>
> Let me try to explain what I mean on the basis of the following image:
>
> http://gpars-user-mailing-list.19372.n3.nabble.com/file/n4024713/Concurency_crawler.png 
>
> In order to make the context simpler I make the following assumptions: 1)
> HTTP requests are purely I/O bound and their CPU usage is negligibly small.
> 2) In the presented picture there is only one core that is busy parsing the
> pages.
>
> Maybe I misunderstand the mechanics of the solution that you propose. As I
> understand, the piece of code below will in fact process URLs in parallel,
> but fetching and parsing operations will always be executed sequentially,
> therefore causing 200ms(request)+20ms(parsing)=220ms delay per URL and
> forsing every CPU core to basically idle for 200ms for every task they get.
> I want to design such solution, where CPUs are always fully loaded crunching
> the pages, not waiting for the network I/O.
>
> withPool 10, {
>                productLinks.eachParallel{ url->
>                        def productInfoParser = new ProductInfoParser()
>                        def html = productInfoParser.getHtml(url)
>                        def productItem = productInfoParser.parse(html)
>                              queueToPersist << productItem
>                        …
>                        //Saving productItem
>                }
>                queueToPersist << null  //Indicate EOF
>    }
>
> I also had some doubts regarding the multithreaded persistency, but I think
> first need to check Grails (in particular GORM) documentation for clarifying
> this issue.
>
> --
> View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/Accelerating-crawler-using-GPars-tp4024709p4024713.html
> Sent from the GPars - user mailing list mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Hi Dierk,
forget about the loop - it was a silly question =). It seems I was more concentrated on looking at the visualization rather than listening to what you were talking about during that video.

Meanwhile I have figured out a preliminary solution based on the combination of withPool(){} DataflowQueue and Dataflow tasts. The problem is that it chokes after around a thousand requests. I need to do double check it, but it seems that I am exhausting TCP connections resources. If this is the case, KanbanFlow might indeed be a perfect solution. I will post my findings later.

Regards,
Alex

Alexander Fedulov_ wrote
Hehe. So much for an unbiased opinion  =)

I have seen the video of your presentation of Kanban - pretty cool visualisation - liked it a lot =). I have not read the usage documentation yet though. Which source would you recommend? GPars docs? And one more question: do I get it right, that the loop is optional and it can simply work in one direction producing the results?

Regards,
Alex

Dierk König wrote
Hi,

looks like a concurrent producer-consumer problem to me.
My personal preference would be to use kanbanflow ;-)

cheers
Dierk

Am 14.07.2012 um 14:30 schrieb Alexander Fedulov:

> Hi Vaclav,
>
> Probably I was not clear enough explaining my problem, but I see a bit more
> parallelizm in it.
>
>
> Vaclav wrote
>>
>>
>> Your case, however, is very linear - you take one url, process it and
>> finally end up with a single entity to persist. In that perspective
>> dataflow or actors seem like a slight overkill to me. You may consider
>> replacing each() with eachParallel() in a context of a fixed-sized thread
>> pool and be done.
>>
>>
>
> Let me try to explain what I mean on the basis of the following image:
>
> http://gpars-user-mailing-list.19372.n3.nabble.com/file/n4024713/Concurency_crawler.png 
>
> In order to make the context simpler I make the following assumptions: 1)
> HTTP requests are purely I/O bound and their CPU usage is negligibly small.
> 2) In the presented picture there is only one core that is busy parsing the
> pages.
>
> Maybe I misunderstand the mechanics of the solution that you propose. As I
> understand, the piece of code below will in fact process URLs in parallel,
> but fetching and parsing operations will always be executed sequentially,
> therefore causing 200ms(request)+20ms(parsing)=220ms delay per URL and
> forsing every CPU core to basically idle for 200ms for every task they get.
> I want to design such solution, where CPUs are always fully loaded crunching
> the pages, not waiting for the network I/O.
>
> withPool 10, {
>                productLinks.eachParallel{ url->
>                        def productInfoParser = new ProductInfoParser()
>                        def html = productInfoParser.getHtml(url)
>                        def productItem = productInfoParser.parse(html)
>                              queueToPersist << productItem
>                        …
>                        //Saving productItem
>                }
>                queueToPersist << null  //Indicate EOF
>    }
>
> I also had some doubts regarding the multithreaded persistency, but I think
> first need to check Grails (in particular GORM) documentation for clarifying
> this issue.
>
> --
> View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/Accelerating-crawler-using-GPars-tp4024709p4024713.html
> Sent from the GPars - user mailing list mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Dierk König
great!

keep us posted
Dierk

Am 16.07.2012 um 12:10 schrieb Alexander Fedulov:

> Hi Dierk,
> forget about the loop - it was a silly question =). It seems I was more
> concentrated on looking at the visualization rather than listening to what
> you were talking about during that video.
>
> Meanwhile I have figured out a preliminary solution based on the combination
> of withPool(){} DataflowQueue and Dataflow tasts. The problem is that it
> chokes after around a thousand requests. I need to do double check it, but
> it seems that I am exhausting TCP connections resources. If this is the
> case, KanbanFlow might indeed be a perfect solution. I will post my findings
> later.
>
> Regards,
> Alex
>
>
> Alexander Fedulov_ wrote
>>
>> Hehe. So much for an unbiased opinion  =)
>>
>> I have seen the video of your presentation of Kanban - pretty cool
>> visualisation - liked it a lot =). I have not read the usage documentation
>> yet though. Which source would you recommend? GPars docs? And one more
>> question: do I get it right, that the loop is optional and it can simply
>> work in one direction producing the results?
>>
>> Regards,
>> Alex
>>
>>
>> Dierk König wrote
>>>
>>> Hi,
>>>
>>> looks like a concurrent producer-consumer problem to me.
>>> My personal preference would be to use kanbanflow ;-)
>>>
>>> cheers
>>> Dierk
>>>
>>> Am 14.07.2012 um 14:30 schrieb Alexander Fedulov:
>>>
>>>> Hi Vaclav,
>>>>
>>>> Probably I was not clear enough explaining my problem, but I see a bit
>>>> more
>>>> parallelizm in it.
>>>>
>>>>
>>>> Vaclav wrote
>>>>>
>>>>>
>>>>> Your case, however, is very linear - you take one url, process it and
>>>>> finally end up with a single entity to persist. In that perspective
>>>>> dataflow or actors seem like a slight overkill to me. You may consider
>>>>> replacing each() with eachParallel() in a context of a fixed-sized
>>>>> thread
>>>>> pool and be done.
>>>>>
>>>>>
>>>>
>>>> Let me try to explain what I mean on the basis of the following image:
>>>>
>>>> http://gpars-user-mailing-list.19372.n3.nabble.com/file/n4024713/Concurency_crawler.png 
>>>>
>>>> In order to make the context simpler I make the following assumptions:
>>>> 1)
>>>> HTTP requests are purely I/O bound and their CPU usage is negligibly
>>>> small.
>>>> 2) In the presented picture there is only one core that is busy parsing
>>>> the
>>>> pages.
>>>>
>>>> Maybe I misunderstand the mechanics of the solution that you propose. As
>>>> I
>>>> understand, the piece of code below will in fact process URLs in
>>>> parallel,
>>>> but fetching and parsing operations will always be executed
>>>> sequentially,
>>>> therefore causing 200ms(request)+20ms(parsing)=220ms delay per URL and
>>>> forsing every CPU core to basically idle for 200ms for every task they
>>>> get.
>>>> I want to design such solution, where CPUs are always fully loaded
>>>> crunching
>>>> the pages, not waiting for the network I/O.
>>>>
>>>> withPool 10, {
>>>>               productLinks.eachParallel{ url->
>>>>                       def productInfoParser = new ProductInfoParser()
>>>>                       def html = productInfoParser.getHtml(url)
>>>>                       def productItem = productInfoParser.parse(html)
>>>>                             queueToPersist << productItem
>>>>                       …
>>>>                       //Saving productItem
>>>>               }
>>>>               queueToPersist << null  //Indicate EOF
>>>>   }
>>>>
>>>> I also had some doubts regarding the multithreaded persistency, but I
>>>> think
>>>> first need to check Grails (in particular GORM) documentation for
>>>> clarifying
>>>> this issue.
>>>>
>>>> --
>>>> View this message in context:
>>>> http://gpars-user-mailing-list.19372.n3.nabble.com/Accelerating-crawler-using-GPars-tp4024709p4024713.html
>>>> Sent from the GPars - user mailing list mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe from this list, please visit:
>>>>
>>>>   http://xircles.codehaus.org/manage_email
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe from this list, please visit:
>>>
>>>    http://xircles.codehaus.org/manage_email
>>>
>>
>
> --
> View this message in context: http://gpars-user-mailing-list.19372.n3.nabble.com/Accelerating-crawler-using-GPars-tp4024709p4024716.html
> Sent from the GPars - user mailing list mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>    http://xircles.codehaus.org/manage_email
>
>


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Hi all,
I finally got some time to work on that problem again. Here are some of my findings. First of all, the throttling problem was caused by the nature of how Apache HTTPClient normally works out of the box. What happened is that for every HTTP request a new TCP connection was established. When processing was finished and sockets were closed, these connections were not completely terminated and got stuck in the TIME_WAIT state. So basically at some point my system ran out of free ephemeral ports and the crawler got throttled.  For those who do not know that the TIME_WAIT is of want to find out why it exists or how find a way around it, here is an excellent article explaining all the details: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html .
My way around that problem was to simply make sure that a connection pool is used. For that purpose I used the AsyncHTTPBuilder implementation which internally uses MultiThreadedHttpConnectionManager from Apache HTTPClient. So, here is my shot at the problem:


import static groovyx.gpars.dataflow.Dataflow.task
import static groovyx.net.http.ContentType.TEXT
import groovyx.gpars.GParsPool
import groovyx.gpars.dataflow.DataflowQueue
import groovyx.gpars.dataflow.DataflowVariable
import groovyx.gpars.dataflow.operator.PoisonPill
import groovyx.net.http.AsyncHTTPBuilder

def final fileDirectory = System.getProperty("java.io.tmpdir")
//This file contains a list of item ids, that have to be crawled
def final fileName = "item_ids.txt"
def final file = new File(fileDirectory + fileName)
def final baseUri = 'http://www.somesite.com/products/'

def ids = new ArrayList()
file.eachLine{
        ids.add(it)
}
println "Overall ${ids.size} items to be processed"

def buffer = new DataflowQueue()

def fetcher = task {
        println "TASK1 started"
        try{
//Make sure threadPool option is assigned as java.util.concurrent.Executors.newFixedThreadPool( 20 ) and not just as 20 – this is a bug in HTTPBuilder
                def http = new AsyncHTTPBuilder(
                                threadPool:java.util.concurrent.Executors.newFixedThreadPool( 20 ),
                                uri : baseUri,
                                contentType : TEXT )
               
                GParsPool.withPool(20){
                        ids.eachParallel{
                                def resultFuture = http.get(path:it) {  resp, reader ->
                                        assert resp.statusLine.statusCode == 200
                                        def html = reader.text
                                        println " got async response for item ID: ${it}"
                                        def result = [uri:baseUri+'/'+it, content:html]
                                        return result
                                }
                                while (!resultFuture.done ) {
                                        Thread.sleep(20)
                                }
                                buffer << resultFuture.get()
                        }
                }
                println "TASK1 finished"
        }catch( ex ) {
                println "Unexpected exception while fetching: ${ex.class.name} : ${ex.message}"
                ex.printStackTrace()
        }
}


def parser = task {
        println "TASK2 started"
        int count = 0
        def parser = new ProductInfoParser()
        try{
                0.upto(ids.size() - 1) {
                        println "COUNT: ${count}"
                        def input = buffer.val
                        count++
                        parser.parse(input["content"], input["uri"])
                }
        }catch( ex ) {
                println "Unexpected exception while parsing: ${ex.class.name} : ${ex.message}"
                ex.printStackTrace()
        }
        println "TASK2 finished"
        println "Number of parsed products: $count"
}.join()

What bothers me a bit is that, as I understand, using an AsyncHTTPBuilder inside of the GPars.withPool will cause 20 redundant threads to start. There will be like 20 Gpars threads -> 20 AsyncHTTP Threads, communicating with each other pairwise sequentially. But I could not figure out how to avoid this.

And now I need some more help – I cannot figure out, why this program does not terminate after the processing is finished and the "Number of parsed products: X " has been printed =). Can anyone give a hint?
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Can anyone give me a reference to some extended guide that would explain the mechanichs of how Dataflow tasks are terminated and what might be the reason for them to run indefinitely? Info from the Dataflow userguide did not help, unfortunately.
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Russel Winder-3
On Sat, 2012-07-28 at 06:31 -0700, Alexander Fedulov wrote:
> Can anyone give me a reference to some extended guide that would explain the
> mechanichs of how Dataflow tasks are terminated and what might be the reason
> for them to run indefinitely? Info from the Dataflow userguide did not help,
> unfortunately.

Dataflow tasks will only run indefinitely if the computation in them is
either deadlocked or livelocked. Unless there is a bug in the GPars
framework.

Any chance of sharing the code so it can be looked at by another set of
eyes?

Or better still a small example of the problem that can be turned into
an integration test and entered into the GPars code base.

Thanks.

--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:[hidden email]
41 Buckmaster Road    m: +44 7770 465 077   xmpp: [hidden email]
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Hi Russel,

Here is the version that does not rely on my business logic but behaves exactly the same in terms of termination TestHanging.groovy. The output of the program ends like this:
...
TASK2 finished
Currently in: main
TASK1 finished

I have STS as my IDE. The version of HTTPBuilder is 0.5.1 - the one that comes with REST plugin installed in Grails. The weird part is that the same code executed not as a Groovy stript but ran in a Groovy console terminates just fine. The output in this case is:
...
TASK2 finished
Currently in: Thread-3
TASK1 finished

Regards,
Alex

Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Russel Winder-3
Alexander,

On Sun, 2012-07-29 at 08:31 -0700, Alexander Fedulov wrote:
[…]
> http://gpars-user-mailing-list.19372.n3.nabble.com/file/n4024727/TestHanging.groovy
> TestHanging.groovy . The output of the program ends like this:

Duly copied into a file locally.  Sadly trying to run it using an out of
the box Groovy 2.1.0-SNAPSHOT I get:

/home/users/russel/Progs/OddsByLanguage/GPars/AlexanderFedulov/TestHanging.groovy: 9: unable to resolve class groovyx.net.http.AsyncHTTPBuilder
 @ line 9, column 1.
   import groovyx.net.http.AsyncHTTPBuilder
   ^

/home/users/russel/Progs/OddsByLanguage/GPars/AlexanderFedulov/TestHanging.groovy: 4: unable to resolve class groovyx.net.http.ContentType
 @ line 4, column 1.
   import static groovyx.net.http.ContentType.TEXT
   ^

2 errors

Clearly I am either just missing a dependency or I am being daft.

> ...
> TASK2 finished
> Currently in: main
> TASK1 finished
>
> I have STS as my IDE. The version of HTTPBuilder is 0.5.1 - the one that
> comes with REST plugin installed in Grails. The weird part is that the same
> code executed not as a Groovy stript but ran in a Groovy console terminates
> just fine. The output in this case is:
> ...
> TASK2 finished
> Currently in: Thread-3
> TASK1 finished
>
> Regards,
> Alex
--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:[hidden email]
41 Buckmaster Road    m: +44 7770 465 077   xmpp: [hidden email]
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
Russel,

could you try to import a dependency from here?
http://repository.codehaus.org/org/codehaus/groovy/modules/http-builder/http-builder/0.5.1/
(the one that I use)

or from as describer here:
http://groovy.codehaus.org/modules/http-builder/download.html
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Russel Winder-3
On Sun, 2012-07-29 at 11:05 -0700, Alexander Fedulov wrote:
> Russel,
>
> could you try to import a dependency from here?
> http://repository.codehaus.org/org/codehaus/groovy/modules/http-builder/http-builder/0.5.1/
> (the one that I use)
>
> or from as describer here:
> http://groovy.codehaus.org/modules/http-builder/download.html

I added the lines:

@GrabResolver ( name = 'Codehaus' , root =
'http://repository.codehaus.org' )
@Grab ( 'org.codehaus.groovy.modules.http-builder:http-builder:0.5.1' )

to the script and now the program just seems to do nothing, but at least
it compiles…


org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: Error grabbing Grapes -- [download failed: xml-apis#xml-apis;1.3.03!xml-apis.jar]

java.lang.RuntimeException: Error grabbing Grapes -- [download failed: xml-apis#xml-apis;1.3.03!xml-apis.jar]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:77)
        at org.codehaus.groovy.reflection.CachedConstructor.doConstructorInvoke(CachedConstructor.java:71)
        at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrap.callConstructor(ConstructorSite.java:81)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:57)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:182)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:190)
        at groovy.grape.GrapeIvy.getDependencies(GrapeIvy.groovy:411)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSite.invoke(PogoMetaMethodSite.java:231)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:52)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at groovy.grape.GrapeIvy.resolve(GrapeIvy.groovy:546)
        at groovy.grape.GrapeIvy$resolve$0.callCurrent(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:153)
        at groovy.grape.GrapeIvy.resolve(GrapeIvy.groovy:515)
        at groovy.grape.GrapeIvy$resolve.callCurrent(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149)
        at groovy.grape.GrapeIvy.grab(GrapeIvy.groovy:254)
        at groovy.grape.Grape.grab(Grape.java:141)
        at groovy.grape.GrabAnnotationTransformation.visit(GrabAnnotationTransformation.java:291)
        at org.codehaus.groovy.transform.ASTTransformationVisitor$3.call(ASTTransformationVisitor.java:319)
        at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:900)
        at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:564)
        at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:540)
        at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:517)
        at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:283)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:260)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:244)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:185)
        at groovy.lang.GroovyShell$2.run(GroovyShell.java:206)
        at groovy.lang.GroovyShell$2.run(GroovyShell.java:204)
        at java.security.AccessController.doPrivileged(Native Method)
        at groovy.lang.GroovyShell.run(GroovyShell.java:204)
        at groovy.lang.GroovyShell.run(GroovyShell.java:150)
        at groovy.ui.GroovyMain.processOnce(GroovyMain.java:557)
        at groovy.ui.GroovyMain.run(GroovyMain.java:344)
        at groovy.ui.GroovyMain.process(GroovyMain.java:330)
        at groovy.ui.GroovyMain.processArgs(GroovyMain.java:119)
        at groovy.ui.GroovyMain.main(GroovyMain.java:99)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:106)
        at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:128)

1 error




--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:[hidden email]
41 Buckmaster Road    m: +44 7770 465 077   xmpp: [hidden email]
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Vaclav
Administrator
Hi Alexander,

your app cannot stop since there are still active threads in the thread pool that you created for your AsyncHttpBuilder. Either there is a way to close/stop the AsyncHttpBuilder class or you have to call shutdown on the pool itself.
GroovyConsole doesn't stop the owning OS process and thus no aparent problems were observed.

Vaclav



On Sun, Jul 29, 2012 at 8:40 PM, Russel Winder <[hidden email]> wrote:
On Sun, 2012-07-29 at 11:05 -0700, Alexander Fedulov wrote:
> Russel,
>
> could you try to import a dependency from here?
> http://repository.codehaus.org/org/codehaus/groovy/modules/http-builder/http-builder/0.5.1/
> (the one that I use)
>
> or from as describer here:
> http://groovy.codehaus.org/modules/http-builder/download.html

I added the lines:

@GrabResolver ( name = 'Codehaus' , root =
'http://repository.codehaus.org' )
@Grab ( 'org.codehaus.groovy.modules.http-builder:http-builder:0.5.1' )

to the script and now the program just seems to do nothing, but at least
it compiles…


org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: Error grabbing Grapes -- [download failed: xml-apis#xml-apis;1.3.03!xml-apis.jar]

java.lang.RuntimeException: Error grabbing Grapes -- [download failed: xml-apis#xml-apis;1.3.03!xml-apis.jar]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:77)
        at org.codehaus.groovy.reflection.CachedConstructor.doConstructorInvoke(CachedConstructor.java:71)
        at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrap.callConstructor(ConstructorSite.java:81)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:57)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:182)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:190)
        at groovy.grape.GrapeIvy.getDependencies(GrapeIvy.groovy:411)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSite.invoke(PogoMetaMethodSite.java:231)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:52)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at groovy.grape.GrapeIvy.resolve(GrapeIvy.groovy:546)
        at groovy.grape.GrapeIvy$resolve$0.callCurrent(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:153)
        at groovy.grape.GrapeIvy.resolve(GrapeIvy.groovy:515)
        at groovy.grape.GrapeIvy$resolve.callCurrent(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149)
        at groovy.grape.GrapeIvy.grab(GrapeIvy.groovy:254)
        at groovy.grape.Grape.grab(Grape.java:141)
        at groovy.grape.GrabAnnotationTransformation.visit(GrabAnnotationTransformation.java:291)
        at org.codehaus.groovy.transform.ASTTransformationVisitor$3.call(ASTTransformationVisitor.java:319)
        at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:900)
        at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:564)
        at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:540)
        at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:517)
        at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:283)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:260)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:244)
        at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:185)
        at groovy.lang.GroovyShell$2.run(GroovyShell.java:206)
        at groovy.lang.GroovyShell$2.run(GroovyShell.java:204)
        at java.security.AccessController.doPrivileged(Native Method)
        at groovy.lang.GroovyShell.run(GroovyShell.java:204)
        at groovy.lang.GroovyShell.run(GroovyShell.java:150)
        at groovy.ui.GroovyMain.processOnce(GroovyMain.java:557)
        at groovy.ui.GroovyMain.run(GroovyMain.java:344)
        at groovy.ui.GroovyMain.process(GroovyMain.java:330)
        at groovy.ui.GroovyMain.processArgs(GroovyMain.java:119)
        at groovy.ui.GroovyMain.main(GroovyMain.java:99)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:106)
        at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:128)

1 error




--
Russel.
=============================================================================
Dr Russel Winder      t: <a href="tel:%2B44%2020%207585%202200" value="+442075852200">+44 20 7585 2200   voip: [hidden email]
41 Buckmaster Road    m: <a href="tel:%2B44%207770%20465%20077" value="+447770465077">+44 7770 465 077   xmpp: [hidden email]
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder



--
E-mail: [hidden email]
Blog: http://www.jroller.com/vaclav
Linkedin page: http://www.linkedin.com/in/vaclavpech
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
In reply to this post by Russel Winder-3
Hi Russel,

unfortunately due to the way the inner dependencies of HTTPBuilder were structured (versioning) it is almost impossible to make it work with Grab. Just in case you would need to use it, I would recommend to import libraries directly. It is a pity, because it is a really nice wrapper around the HTTPClient's boilerplate and it also offers some cool additional features, but I get the feeling that Thom Nichols does not work on this project anymore. It might be cool, if someone in the community would take this project over.
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Alexander Fedulov
In reply to this post by Vaclav
Hi Vaclav,

thank you for the hint. Quickly skimming through the documentation and adding http.shutdown() to my code solves the problem. Unfortunately resources release is not shown in the "getting started" example of the AsyncHTTPBuilder.

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Accelerating crawler using GPars

Vaclav
Administrator


On Mon, Jul 30, 2012 at 11:42 AM, Alexander Fedulov <[hidden email]> wrote:
Hi Vaclav,

thank you for the hint. Quickly skimming through the documentation and
adding http.shutdown() to my code solves the problem. Unfortunately
resources release is not shown in the "getting started" example of the
http://groovy.codehaus.org/modules/http-builder/doc/async.html
AsyncHTTPBuilder .

Alex


Right. Their "getting started" example ends up hanging.

Vaclav