Discussion:
Tokenizer text source
Marc Jeurissen
2016-10-25 14:34:13 UTC
Permalink
Hi,

I have a custom Analyzer and Tokenizer which I'm trying to migrate from
Pylucene 4.10 to 6.2.

Problem is that it is no longer possible to grab the text source from
neither the createComponents method or the Tokenizer constructor.
Documentation says the Tokenizer has a field 'input' which contains the
text source, but in Pylucene a Tokenizer does not seem to have a
attribute 'input'..

Any idea how I can address the text source?

analyzer = MyAnalyzer() -> 'createComponents' sets MyTokenizer
config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
store = SimpleFSDirectory(....)
writer = IndexWriter(store, config)
doc = Document()
doc.add(Field("title", "value of testing",TextField.TYPE_NOT_STORED))
writer.addDocument(doc) -> calls incrementToken of MyTokenizer but I
need to grab the text source in order to create my tokens.....

Thank you
--
Signature Marc Jeurissen | UAntwerpen
Met vriendelijke groeten,

Marc Jeurissen

<http://anet.be>
Bibliotheek UAntwerpen
Stadscampus - S.A.085
Prinsstraat 9 - 2000 Antwerpen
***@uantwerpen.be <mailto:***@uantwerpen.be>
T +32 3 265 49 71
<http://anet.be>
Andi Vajda
2016-10-27 10:49:23 UTC
Permalink
Post by Marc Jeurissen
I have a custom Analyzer and Tokenizer which I'm trying to migrate from
Pylucene 4.10 to 6.2.
Problem is that it is no longer possible to grab the text source from neither
the createComponents method or the Tokenizer constructor. Documentation says
the Tokenizer has a field 'input' which contains the text source, but in
Pylucene a Tokenizer does not seem to have a attribute 'input'..
Any idea how I can address the text source?
I now expanded in JCC the capability of explicitely requesting a wrapper for
a non public field, such as 'input' which is a protected field. That field
is then available as an attribute on the corresponding python wrapper class.

I then added
org.apache.lucene.analysis.Tokenizer:input
to the list of explicitely requested wrappers in pylucene's Makefile.
Post by Marc Jeurissen
from lucene import *
initVM()
<jcc.JCCEnv object at 0x10028a0f0>
Post by Marc Jeurissen
from org.apache.lucene.analysis import Tokenizer
Tokenizer.input
<attribute 'input' of 'Tokenizer' objects>

This is available from svn trunk rev 1766805.

To get this new feature, svn update to HEAD on trunk and:
- rebuild jcc
- rebuilt pylucene

If you have questions don't hesitate to ask (but subscribe to
pylucene-dev@ first so that your message doesn't sit in a moderation queue).

Thanks !

Andi..
Post by Marc Jeurissen
analyzer = MyAnalyzer() -> 'createComponents' sets MyTokenizer
config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
store = SimpleFSDirectory(....)
writer = IndexWriter(store, config)
doc = Document()
doc.add(Field("title", "value of testing",TextField.TYPE_NOT_STORED))
writer.addDocument(doc) -> calls incrementToken of MyTokenizer but I need
to grab the text source in order to create my tokens.....
Thank you
--
Signature Marc Jeurissen | UAntwerpen
Met vriendelijke groeten,
Marc Jeurissen
<http://anet.be>
Bibliotheek UAntwerpen
Stadscampus - S.A.085
Prinsstraat 9 - 2000 Antwerpen
T +32 3 265 49 71
<http://anet.be>
Loading...