Discussion:
accessing to protected elements in PythonTokenizer
Roxana Danger
2015-07-10 06:59:58 UTC
Permalink
Hello,
I am trying to construct a custom PythonTokenizer (see above), but I
am getting the error: "attribute 'reader' of 'Tokenizer' objects is not
readable" when accessing to it in reset class.
reader is a protected member in Tokenizer, I was supposing it to be
exposed through PythonTokenizer, and it is passed to the super class in the
constructor. Am I wrong?
Thanks, best regards,
Roxana

class ComposerTokenizer(PythonTokenizer):

def __init__(self, input):

PythonTokenizer.__init__(self, input)

self.reset()



def incrementToken(self):

if self.index < len(self.finaltokens):

self.clearAttributes()

offsetAttr = OffsetAttributeImpl()

offsetAttr.setOffset( ... )

self.index = self.index + 1

return True

else:

return False


def reset(self):

s = ''

ch = self.reader.read()

while ch <> -1:

s = s + ch

ch = self.reader.read()

self.index = 0

self.finalTokens = ... #processing s to extract
self.finaltokens









<http://www.reed.co.uk/lovemondays>
Andi Vajda
2015-07-10 11:00:25 UTC
Permalink
Post by Roxana Danger
Hello,
I am trying to construct a custom PythonTokenizer (see above), but I
am getting the error: "attribute 'reader' of 'Tokenizer' objects is not
readable" when accessing to it in reset class.
reader is a protected member in Tokenizer, I was supposing it to be
exposed through PythonTokenizer, and it is passed to the super class in the
constructor. Am I wrong?
You're right but there is no accessor for the reader object stored on the
Java side that makes it usable from the Python side.
You can either:
- add a getReader() method to the PythonTokenizer Java class that returns
it (and rebuild PyLucene after 'make clean')
- store the 'input' variable that is passed to your constructor on the
Python side, on your ComposerTokenizer instance. That 'input' is the
reader (at least, it's passed on to the Tokenizer Java class)

The first option is probably safer as it doesn't assume that
Tokenizer(reader) is not changing it in some way before storing it.

Andi..
Post by Roxana Danger
Thanks, best regards,
Roxana
PythonTokenizer.__init__(self, input)
self.reset()
self.clearAttributes()
offsetAttr = OffsetAttributeImpl()
offsetAttr.setOffset( ... )
self.index = self.index + 1
return True
return False
s = ''
ch = self.reader.read()
s = s + ch
ch = self.reader.read()
self.index = 0
self.finalTokens = ... #processing s to extract
self.finaltokens
<http://www.reed.co.uk/lovemondays>
Roxana Danger
2015-07-10 13:05:07 UTC
Permalink
Hi Andi,
Thank you very much. I will use the first solution.
Best regards.
Roxana
Post by Roxana Danger
Hello,
Post by Roxana Danger
I am trying to construct a custom PythonTokenizer (see above), but I
am getting the error: "attribute 'reader' of 'Tokenizer' objects is not
readable" when accessing to it in reset class.
reader is a protected member in Tokenizer, I was supposing it to be
exposed through PythonTokenizer, and it is passed to the super class in the
constructor. Am I wrong?
You're right but there is no accessor for the reader object stored on the
Java side that makes it usable from the Python side.
- add a getReader() method to the PythonTokenizer Java class that returns
it (and rebuild PyLucene after 'make clean')
- store the 'input' variable that is passed to your constructor on the
Python side, on your ComposerTokenizer instance. That 'input' is the
reader (at least, it's passed on to the Tokenizer Java class)
The first option is probably safer as it doesn't assume that
Tokenizer(reader) is not changing it in some way before storing it.
Andi..
Thanks, best regards,
Post by Roxana Danger
Roxana
PythonTokenizer.__init__(self, input)
self.reset()
self.clearAttributes()
offsetAttr = OffsetAttributeImpl()
offsetAttr.setOffset( ... )
self.index = self.index + 1
return True
return False
s = ''
ch = self.reader.read()
s = s + ch
ch = self.reader.read()
self.index = 0
self.finalTokens = ... #processing s to extract
self.finaltokens
<http://www.reed.co.uk/lovemondays>
Loading...