Discussion:
ShingleAnalyzerWrapper in PyLucene
marco turchi
2017-01-28 22:06:19 UTC
Permalink
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.

I have built the analyzer similar to Lucene:
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, " " ,
True, False, None)

and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is at on")

the output is:
source:welcome source:world source:is source:at source:on

I have run the same code in Java and the output is how I would expect it:
source:welcome source:welcome world source:welcome world is source:welcome
world is at source:world source:world is source:world is at source:world is
at on source:is content:is at source:is at on source:at source:at on
source:on

Do you have any ideas in what I'm doing wrong in PyLucene?

Thanks a lot in advance for your help
Marco
Andi Vajda
2017-01-29 02:10:33 UTC
Permalink
Post by marco turchi
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, " " ,
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is at on")
source:welcome source:world source:is source:at source:on
source:welcome source:welcome world source:welcome world is source:welcome
world is at source:world source:world is source:world is at source:world is
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can run to
reproduce the problem. One in Java producing the output you expect, one in
Python producing the output you're reporting.

Thanks !

Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
marco turchi
2017-01-29 11:50:17 UTC
Permalink
Dear Andi,
please find in attachment the Java and the Python codes. Both of them,
create an index with two records using Shingle analyser and then query it
printing the query and the terms of the query.

Thanks a lot for your help
Marco
Post by marco turchi
Dear All,
Post by marco turchi
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, " " ,
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is at on")
source:welcome source:world source:is source:at source:on
source:welcome source:welcome world source:welcome world is source:welcome
world is at source:world source:world is source:world is at source:world is
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can run
to reproduce the problem. One in Java producing the output you expect, one
in Python producing the output you're reporting.
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
Andi Vajda
2017-01-29 18:14:22 UTC
Permalink
Post by marco turchi
Dear Andi,
please find in attachment the Java and the Python codes. Both of them, create an index with two records using Shingle analyser and then query it printing the query and the terms of the query.
It looks like you attached only the python program, only one attachment.

Andi..
Post by marco turchi
Thanks a lot for your help
Marco
Post by marco turchi
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, " " ,
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is at on")
source:welcome source:world source:is source:at source:on
source:welcome source:welcome world source:welcome world is source:welcome
world is at source:world source:world is source:world is at source:world is
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can run to reproduce the problem. One in Java producing the output you expect, one in Python producing the output you're reporting.
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
<TestShingle.py>
marco turchi
2017-01-29 18:24:22 UTC
Permalink
It is strange because I can see the attached files in the email I sent
you...

I attach again the Java code. In case it is not attached again, you can
download from this link:
https://www.dropbox.com/s/o7ocygrdv8dqksl/CopyOfTest.java?dl=0
the file is called CopyOfTest.Java

Thanks a lot!
Marco
Post by marco turchi
Post by marco turchi
Dear Andi,
please find in attachment the Java and the Python codes. Both of them,
create an index with two records using Shingle analyser and then query it
printing the query and the terms of the query.
It looks like you attached only the python program, only one attachment.
Andi..
Post by marco turchi
Thanks a lot for your help
Marco
Post by Andi Vajda
Post by marco turchi
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, "
" ,
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is
at on")
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
source:welcome source:world source:is source:at source:on
I have run the same code in Java and the output is how I would expect
source:welcome source:welcome world source:welcome world is
source:welcome
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
world is at source:world source:world is source:world is at
source:world is
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can
run to reproduce the problem. One in Java producing the output you expect,
one in Python producing the output you're reporting.
Post by marco turchi
Post by Andi Vajda
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
<TestShingle.py>
Andi Vajda
2017-01-29 18:53:25 UTC
Permalink
It is strange because I can see the attached files in the email I sent you...
https://www.dropbox.com/s/o7ocygrdv8dqksl/CopyOfTest.java?dl=0
the file is called CopyOfTest.Java
Indeed. No attachment was received here. Probably some security feature somewhere. The link you included should be good enough.

Thanks !

Andi..
Thanks a lot!
Marco
Post by Andi Vajda
Post by marco turchi
Dear Andi,
please find in attachment the Java and the Python codes. Both of them, create an index with two records using Shingle analyser and then query it printing the query and the terms of the query.
It looks like you attached only the python program, only one attachment.
Andi..
Post by marco turchi
Thanks a lot for your help
Marco
Post by marco turchi
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, " " ,
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is at on")
source:welcome source:world source:is source:at source:on
source:welcome source:welcome world source:welcome world is source:welcome
world is at source:world source:world is source:world is at source:world is
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can run to reproduce the problem. One in Java producing the output you expect, one in Python producing the output you're reporting.
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
<TestShingle.py>
Andi Vajda
2017-01-29 20:38:30 UTC
Permalink
Post by marco turchi
It is strange because I can see the attached files in the email I sent
you...
I attach again the Java code. In case it is not attached again, you can
https://www.dropbox.com/s/o7ocygrdv8dqksl/CopyOfTest.java?dl=0
the file is called CopyOfTest.Java
I didn't try to run your programs yet but one source of difference noticed
is that in Python you do:
analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 6, ' ', True, False, None)
and in Java you do:
analyzer = new ShingleAnalyzerWrapper(new WhitespaceAnalyzer(), 2, 4, " ", true, false, null);

The numeric parameters are not the same: 2, 6 vs 2, 4.
Please use the same values in both versions and let us know if that solves
the problem.
Thanks !

Andi..
Post by marco turchi
Thanks a lot!
Marco
Post by marco turchi
Post by marco turchi
Dear Andi,
please find in attachment the Java and the Python codes. Both of them,
create an index with two records using Shingle analyser and then query it
printing the query and the terms of the query.
It looks like you attached only the python program, only one attachment.
Andi..
Post by marco turchi
Thanks a lot for your help
Marco
Post by Andi Vajda
Post by marco turchi
Dear All,
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, "
" ,
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
True, False, None)
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is
at on")
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
source:welcome source:world source:is source:at source:on
I have run the same code in Java and the output is how I would expect
source:welcome source:welcome world source:welcome world is
source:welcome
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
world is at source:world source:world is source:world is at
source:world is
Post by marco turchi
Post by Andi Vajda
Post by marco turchi
at on source:is content:is at source:is at on source:at source:at on
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can
run to reproduce the problem. One in Java producing the output you expect,
one in Python producing the output you're reporting.
Post by marco turchi
Post by Andi Vajda
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
<TestShingle.py>
marco turchi
2017-01-29 20:45:09 UTC
Permalink
Hi Andi,
while I was changing the parameter value, I have noticed another problem. I
have fixed it and it works.

Thanks a lot and sorry for bothering you!
Marco
Post by marco turchi
It is strange because I can see the attached files in the email I sent
Post by marco turchi
you...
I attach again the Java code. In case it is not attached again, you can
https://www.dropbox.com/s/o7ocygrdv8dqksl/CopyOfTest.java?dl=0
the file is called CopyOfTest.Java
I didn't try to run your programs yet but one source of difference noticed
analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 6, ' ', True, False, None)
analyzer = new ShingleAnalyzerWrapper(new WhitespaceAnalyzer(), 2, 4, "
", true, false, null);
The numeric parameters are not the same: 2, 6 vs 2, 4.
Please use the same values in both versions and let us know if that solves
the problem.
Thanks !
Andi..
Post by marco turchi
Thanks a lot!
Marco
Post by marco turchi
Post by marco turchi
Dear Andi,
please find in attachment the Java and the Python codes. Both of them,
create an index with two records using Shingle analyser and then query it
printing the query and the terms of the query.
It looks like you attached only the python program, only one attachment.
Andi..
Post by marco turchi
Thanks a lot for your help
Marco
Post by marco turchi
Dear All,
Post by marco turchi
I need to use the ShingleAnalyzerWrapper in PyLucene.
self.analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, "
" ,
True, False, None)
Post by marco turchi
Post by marco turchi
and I have used it inside QuertParser
query = QueryParser("source", self.analyzer).parse("welcome world is
at on")
Post by marco turchi
source:welcome source:world source:is source:at source:on
I have run the same code in Java and the output is how I would expect
source:welcome source:welcome world source:welcome world is
Post by marco turchi
source:welcome
world is at source:world source:world is source:world is at
Post by marco turchi
source:world is
at on source:is content:is at source:is at on source:at source:at on
Post by marco turchi
Post by marco turchi
source:on
Do you have any ideas in what I'm doing wrong in PyLucene?
Please, help me help you by including two simple programs that I can
run to reproduce the problem. One in Java producing the output you
expect,
one in Python producing the output you're reporting.
Post by marco turchi
Post by marco turchi
Thanks !
Andi..
Post by marco turchi
Thanks a lot in advance for your help
Marco
<TestShingle.py>
Loading...