Discussion:
Total number of terms in index for collection frequency
Gmehlin Floran
2016-08-18 14:31:19 UTC
Permalink
Hi,

I am struggling to compute the collection frequency of a term (PyLucene 4.10.1).
So far, I can have the collection count of terms with :

reader = IndexReader.open(SimpleFSDirectory(File(LUCENE_INDEX)))
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
cf_dict = {}
try:
while (termsref.next()):
termval = TermsEnum.cast_(termsref)
fg = termval.term().utf8ToString()
cf = reader.totalTermFreq(Term("contents", termval.term()) # collection count
cf_dict[fg]=cf
except StopIteration, e:
print ''

I would like to have the "frequency" in cf_dict instead of the count. For this, I need to divide it with the total number of indistinct terms in the index.

Does anyone know how to get this ?

Thank you for your help,

Floran
Dirk Rothe
2016-08-18 20:02:28 UTC
Permalink
Hi Floran,

we're looping over all lucene-docs, apply the appropriate analyzer,
iterate and collect the distinct tokens. Pretty inefficient I guess, but
you also get the frequency for each unique token. Nice for checking:
https://en.wikipedia.org/wiki/Zipf%27s_law

--dirk

Am 18.08.2016, 16:31 Uhr, schrieb Gmehlin Floran
Post by Gmehlin Floran
Hi,
I am struggling to compute the collection frequency of a term (PyLucene 4.10.1).
reader = IndexReader.open(SimpleFSDirectory(File(LUCENE_INDEX)))
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
cf_dict = {}
termval = TermsEnum.cast_(termsref)
fg = termval.term().utf8ToString()
cf = reader.totalTermFreq(Term("contents", termval.term()) # collection count
cf_dict[fg]=cf
print ''
I would like to have the "frequency" in cf_dict instead of the count.
For this, I need to divide it with the total number of indistinct terms
in the index.
Does anyone know how to get this ?
Thank you for your help,
Floran
Loading...