Discussion:
[jira] [Created] (PYLUCENE-32) pylucene CharArraySet jvm error
Alex (JIRA)
2014-10-16 10:46:33 UTC
Permalink
Alex created PYLUCENE-32:
----------------------------

Summary: pylucene CharArraySet jvm error
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment:

I added a customized lucene analyzer class to lucene core in Pylucene. This class is google guava as a dependency because of the array handling function available in com.google.common.collect.Iterables in guava. When I tried to index using this analyzer, I got the following error: Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in lucene.initVM() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.CharArraySet at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

Even the example indexing code in Lucene in Action that I tried earlier and worked, when I retried it after adding this class is returning the same error above. Am not too familiar with CharArraySet class as I can see the problem is from it. How do i handle this? Thanks

Reporter: Alex






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Andi Vajda
2014-10-16 22:40:30 UTC
Permalink
Post by Alex (JIRA)
I added a customized lucene analyzer class to lucene core in Pylucene.
This class is google guava as a dependency because of the array handling
function available in com.google.common.collect.Iterables in guava. When I
tried to index using this analyzer, I got the following error: Traceback
(most recent call last): File "C:\IndexFiles.py", line 78, in
java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet
org.apache.lucene.analysis.CharArraySet at
java.net.URLClassLoader$1.run(URLClassLoader.java:366) at
java.net.URLClassLoader$1.run(URLClassLoader.java:355) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:354) at
java.lang.ClassLoader.loadClass(ClassLoader.java:425) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
There is no class org/apache/lucene/analysis/CharArraySet in the current
version of Lucene Core (4.10.1) but there is a CharArraySet class in the
util package of the analysis package:
org/apache/lucene/analysis/util/CharArraySet

You are probably mixing versions of Lucene Core and your own that are not
compatible.

Andi..
Andi Vajda (JIRA)
2014-10-17 03:49:33 UTC
Permalink
[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174698#comment-14174698 ]

Andi Vajda commented on PYLUCENE-32:
------------------------------------

There is no class org/apache/lucene/analysis/CharArraySet in the current
version of Lucene Core (4.10.1) but there is a CharArraySet class in the
util package of the analysis package:
org/apache/lucene/analysis/util/CharArraySet

You are probably mixing versions of Lucene Core and your own that are not
compatible.
Post by Alex (JIRA)
pylucene CharArraySet jvm error
-------------------------------
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment: I added a customized lucene analyzer class to lucene core in Pylucene. This class is google guava as a dependency because of the array handling function available in com.google.common.collect.Iterables in guava. When I tried to index using this analyzer, I got the following error: Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in lucene.initVM() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.CharArraySet at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Even the example indexing code in Lucene in Action that I tried earlier and worked, when I retried it after adding this class is returning the same error above. Am not too familiar with CharArraySet class as I can see the problem is from it. How do i handle this? Thanks
Reporter: Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Alex (JIRA)
2014-10-17 06:41:34 UTC
Permalink
[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174781#comment-14174781 ]

Alex commented on PYLUCENE-32:
------------------------------

Thanks Andi. But am using pylucene version 3.6.2. I think the problem has to do with jvm instantiation caused by java-python incompatible array issues but I dont know how to solve this. Below are the java files I added to class to lucene core perhaps you will have more understanding of what the issue is:

The lemmatizer:
/*
* Lemmatizing library for Lucene
* Copyright (C) 2010 Lars Buitinck
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package englishlemma;

import java.io.*;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.tagger.maxent.TaggerConfig;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;

/**
* An analyzer that uses an {@link EnglishLemmaTokenizer}.
*
* @author Lars Buitinck
* @version 2010.1006
*/
public class EnglishLemmaAnalyzer extends Analyzer {
private MaxentTagger posTagger;

/**
* Construct an analyzer with a tagger using the given model file.
*/
public EnglishLemmaAnalyzer(String posModelFile) throws Exception {
this(makeTagger(posModelFile));
}

/**
* Construct an analyzer using the given tagger.
*/
public EnglishLemmaAnalyzer(MaxentTagger tagger) {
posTagger = tagger;
}

/**
* Factory method for loading a POS tagger.
*/
public static MaxentTagger makeTagger(String modelFile) throws Exception {
TaggerConfig config = new TaggerConfig("-model", modelFile);
// The final argument suppresses a "loading" message on stderr.
return new MaxentTagger(modelFile, config, false);
}

@Override
public TokenStream tokenStream(String fieldName, Reader input) {
return new EnglishLemmaTokenizer(input, posTagger);
}
}


The tokenizer for the lemmatizer:
/*
* Lemmatizing library for Lucene
* Copyright (c) 2010-2011 Lars Buitinck
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package englishlemma;

import java.io.*;
import java.util.*;
import java.util.regex.*;
import com.google.common.collect.Iterables;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.process.Morphology;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

/**
* A tokenizer that retrieves the lemmas (base forms) of English words.
* Relies internally on the sentence splitter and tokenizer supplied with
* the Stanford POS tagger.
*
* @author Lars Buitinck
* @version 2011.0122
*/
public class EnglishLemmaTokenizer extends TokenStream {
private Iterator<TaggedWord> tagged;
private PositionIncrementAttribute posIncr;
private TaggedWord currentWord;
private TermAttribute termAtt;
private boolean lemmaNext;

/**
* Construct a tokenizer processing the given input and a tagger
* using the given model file.
*/
public EnglishLemmaTokenizer(Reader input, String posModelFile)
throws Exception {
this(input, EnglishLemmaAnalyzer.makeTagger(posModelFile));
}

/**
* Construct a tokenizer processing the given input using the given tagger.
*/
public EnglishLemmaTokenizer(Reader input, MaxentTagger tagger) {
super();

lemmaNext = false;
posIncr = addAttribute(PositionIncrementAttribute.class);
termAtt = addAttribute(TermAttribute.class);

List<List<HasWord>> tokenized =
MaxentTagger.tokenizeText(input);
tagged = Iterables.concat(tagger.process(tokenized)).iterator();
}

/**
* Consumers use this method to advance the stream to the next token.
* The token stream emits inflected forms and lemmas interleaved (form1,
* lemma1, form2, lemma2, etc.), giving lemmas and their inflected forms
* the same PositionAttribute.
*/
@Override
public final boolean incrementToken() throws IOException {
if (lemmaNext) {
// Emit a lemma
posIncr.setPositionIncrement(1);
String tag = currentWord.tag();
String form = currentWord.word();
termAtt.setTermBuffer(Morphology.stemStatic(form, tag).word());
} else {
// Emit inflected form, if not filtered out.

// 0 because the lemma will come in the same position
int increment = 0;
for (;;) {
if (!tagged.hasNext())
return false;
currentWord = tagged.next();
if (!unwantedPOS(currentWord.tag()))
break;
increment++;
}

posIncr.setPositionIncrement(increment);
termAtt.setTermBuffer(currentWord.word());
}

lemmaNext = !lemmaNext;
return true;
}

private static final Pattern unwantedPosRE = Pattern.compile(
"^(CC|DT|[LR]RB|MD|POS|PRP|UH|WDT|WP|WP\\$|WRB|\\$|\\#|\\.|\\,|:)$"
);

/**
* Determines if words with a given POS tag should be omitted from the
* index. Defaults to filtering out punctuation and function words
* (pronouns, prepositions, "the", "a", etc.).
*
* @see <a href="http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html">The Penn Treebank tag set</a> used by Stanford NLP
*/
protected boolean unwantedPOS(String tag) {
return unwantedPosRE.matcher(tag).matches();
}
}

Meanwhile the tokenizer uses and depends on google guava array while the lemmatizer uses and depends on stanford pos tagger.

Thanks.
Post by Alex (JIRA)
pylucene CharArraySet jvm error
-------------------------------
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment: I added a customized lucene analyzer class to lucene core in Pylucene. This class is google guava as a dependency because of the array handling function available in com.google.common.collect.Iterables in guava. When I tried to index using this analyzer, I got the following error: Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in lucene.initVM() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.CharArraySet at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Even the example indexing code in Lucene in Action that I tried earlier and worked, when I retried it after adding this class is returning the same error above. Am not too familiar with CharArraySet class as I can see the problem is from it. How do i handle this? Thanks
Reporter: Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Alex (JIRA)
2014-11-04 14:29:33 UTC
Permalink
[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174781#comment-14174781 ]

Alex edited comment on PYLUCENE-32 at 11/4/14 2:29 PM:
-------------------------------------------------------

Resolved


was (Author: alexboy):
Thanks Andi. But am using pylucene version 3.6.2. I think the problem has to do with jvm instantiation caused by java-python incompatible array issues but I dont know how to solve this. Below are the java files I added to class to lucene core perhaps you will have more understanding of what the issue is:

The lemmatizer:
/*
* Lemmatizing library for Lucene
* Copyright (C) 2010 Lars Buitinck
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package englishlemma;

import java.io.*;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.tagger.maxent.TaggerConfig;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;

/**
* An analyzer that uses an {@link EnglishLemmaTokenizer}.
*
* @author Lars Buitinck
* @version 2010.1006
*/
public class EnglishLemmaAnalyzer extends Analyzer {
private MaxentTagger posTagger;

/**
* Construct an analyzer with a tagger using the given model file.
*/
public EnglishLemmaAnalyzer(String posModelFile) throws Exception {
this(makeTagger(posModelFile));
}

/**
* Construct an analyzer using the given tagger.
*/
public EnglishLemmaAnalyzer(MaxentTagger tagger) {
posTagger = tagger;
}

/**
* Factory method for loading a POS tagger.
*/
public static MaxentTagger makeTagger(String modelFile) throws Exception {
TaggerConfig config = new TaggerConfig("-model", modelFile);
// The final argument suppresses a "loading" message on stderr.
return new MaxentTagger(modelFile, config, false);
}

@Override
public TokenStream tokenStream(String fieldName, Reader input) {
return new EnglishLemmaTokenizer(input, posTagger);
}
}


The tokenizer for the lemmatizer:
/*
* Lemmatizing library for Lucene
* Copyright (c) 2010-2011 Lars Buitinck
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/

package englishlemma;

import java.io.*;
import java.util.*;
import java.util.regex.*;
import com.google.common.collect.Iterables;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.process.Morphology;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

/**
* A tokenizer that retrieves the lemmas (base forms) of English words.
* Relies internally on the sentence splitter and tokenizer supplied with
* the Stanford POS tagger.
*
* @author Lars Buitinck
* @version 2011.0122
*/
public class EnglishLemmaTokenizer extends TokenStream {
private Iterator<TaggedWord> tagged;
private PositionIncrementAttribute posIncr;
private TaggedWord currentWord;
private TermAttribute termAtt;
private boolean lemmaNext;

/**
* Construct a tokenizer processing the given input and a tagger
* using the given model file.
*/
public EnglishLemmaTokenizer(Reader input, String posModelFile)
throws Exception {
this(input, EnglishLemmaAnalyzer.makeTagger(posModelFile));
}

/**
* Construct a tokenizer processing the given input using the given tagger.
*/
public EnglishLemmaTokenizer(Reader input, MaxentTagger tagger) {
super();

lemmaNext = false;
posIncr = addAttribute(PositionIncrementAttribute.class);
termAtt = addAttribute(TermAttribute.class);

List<List<HasWord>> tokenized =
MaxentTagger.tokenizeText(input);
tagged = Iterables.concat(tagger.process(tokenized)).iterator();
}

/**
* Consumers use this method to advance the stream to the next token.
* The token stream emits inflected forms and lemmas interleaved (form1,
* lemma1, form2, lemma2, etc.), giving lemmas and their inflected forms
* the same PositionAttribute.
*/
@Override
public final boolean incrementToken() throws IOException {
if (lemmaNext) {
// Emit a lemma
posIncr.setPositionIncrement(1);
String tag = currentWord.tag();
String form = currentWord.word();
termAtt.setTermBuffer(Morphology.stemStatic(form, tag).word());
} else {
// Emit inflected form, if not filtered out.

// 0 because the lemma will come in the same position
int increment = 0;
for (;;) {
if (!tagged.hasNext())
return false;
currentWord = tagged.next();
if (!unwantedPOS(currentWord.tag()))
break;
increment++;
}

posIncr.setPositionIncrement(increment);
termAtt.setTermBuffer(currentWord.word());
}

lemmaNext = !lemmaNext;
return true;
}

private static final Pattern unwantedPosRE = Pattern.compile(
"^(CC|DT|[LR]RB|MD|POS|PRP|UH|WDT|WP|WP\\$|WRB|\\$|\\#|\\.|\\,|:)$"
);

/**
* Determines if words with a given POS tag should be omitted from the
* index. Defaults to filtering out punctuation and function words
* (pronouns, prepositions, "the", "a", etc.).
*
* @see <a href="http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html">The Penn Treebank tag set</a> used by Stanford NLP
*/
protected boolean unwantedPOS(String tag) {
return unwantedPosRE.matcher(tag).matches();
}
}

Meanwhile the tokenizer uses and depends on google guava array while the lemmatizer uses and depends on stanford pos tagger.

Thanks.
Post by Alex (JIRA)
pylucene CharArraySet jvm error
-------------------------------
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment: I added a customized lucene analyzer class to lucene core in Pylucene. This class is google guava as a dependency because of the array handling function available in com.google.common.collect.Iterables in guava. When I tried to index using this analyzer, I got the following error: Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in lucene.initVM() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.CharArraySet at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Even the example indexing code in Lucene in Action that I tried earlier and worked, when I retried it after adding this class is returning the same error above. Am not too familiar with CharArraySet class as I can see the problem is from it. How do i handle this? Thanks
Reporter: Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Alex (JIRA)
2014-11-04 14:30:33 UTC
Permalink
[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex updated PYLUCENE-32:
-------------------------
Environment: Resolved (was:

I added a customized lucene analyzer class to lucene core in Pylucene. This class is google guava as a dependency because of the array handling function available in com.google.common.collect.Iterables in guava. When I tried to index using this analyzer, I got the following error: Traceback (most recent call last): File "C:\IndexFiles.py", line 78, in lucene.initVM() JavaError: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Java stacktrace: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/CharArraySet Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.CharArraySet at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

Even the example indexing code in Lucene in Action that I tried earlier and worked, when I retried it after adding this class is returning the same error above. Am not too familiar with CharArraySet class as I can see the problem is from it. How do i handle this? Thanks
)
Post by Alex (JIRA)
pylucene CharArraySet jvm error
-------------------------------
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment: Resolved
Reporter: Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Alex (JIRA)
2014-11-04 14:30:34 UTC
Permalink
[ https://issues.apache.org/jira/browse/PYLUCENE-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex closed PYLUCENE-32.
------------------------
Resolution: Fixed
Post by Alex (JIRA)
pylucene CharArraySet jvm error
-------------------------------
Key: PYLUCENE-32
URL: https://issues.apache.org/jira/browse/PYLUCENE-32
Project: PyLucene
Issue Type: Question
Environment: Resolved
Reporter: Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Loading...