Discussion:
[PATCH] jcc: add python3 support
Ruediger Meier
2017-03-09 13:34:46 UTC
Permalink
Hi,

I did some work to port jcc to python3, see
https://github.com/rudimeier/jcc

There are two interesting branches, py2 and py3
py2 should still work for python2 >=2.7 without any behavior change
py3 completes experimental python3 support (but still python2
incompatible).


regarding py2 branch:
- fixes many (but not yet all) python3 incompatibilities
- should still work for python2 >=2.7 without any behavior change
- succeeds lucene's test-suite completely (pylucene-4.10.1/java-1.7
and (pylucene-6.4.1/java-1.8).
- removes compatibility for python < 2.7 (though it would not be hard
to keep it)


regarding py3 branch:
- based on py2 it adds a few more patches for python3 support (>3.1)
but still in a way which is python2 incompatible
- still fails about 25% of pylucene tests.
- Note I just did pylucene's python3 support for testing trivially
like '2to3 -w $(find -name "*.py")'


Please comment on this. I'd like to get python3 support upstream.

Some more notes:

My patches were inspired from the old python-3.1 port
http://svn.apache.org/repos/asf/lucene/pylucene/branches/python_3/jcc/

I've refactored/rebased it and splitted the huge patches into many
smaller ones inclusive keeping python2 support. Almost all commits in
my py2 branch are trivially to review and independent of each other. So
I would be glad if you would merge as many of them as you like.

Since the py3 branch is still not 100% correct (guess still some
Bytes/Unicode problems) I would be glad if somebody would help to get
it running.


Cheers,
Rudi
Andi Vajda
2017-03-09 16:07:22 UTC
Permalink
Post by Ruediger Meier
Hi,
I did some work to port jcc to python3, see
https://github.com/rudimeier/jcc
There are two interesting branches, py2 and py3
py2 should still work for python2 >=2.7 without any behavior change
py3 completes experimental python3 support (but still python2
incompatible).
- fixes many (but not yet all) python3 incompatibilities
- should still work for python2 >=2.7 without any behavior change
- succeeds lucene's test-suite completely (pylucene-4.10.1/java-1.7
and (pylucene-6.4.1/java-1.8).
- removes compatibility for python < 2.7 (though it would not be hard
to keep it)
- based on py2 it adds a few more patches for python3 support (>3.1)
but still in a way which is python2 incompatible
- still fails about 25% of pylucene tests.
- Note I just did pylucene's python3 support for testing trivially
like '2to3 -w $(find -name "*.py")'
Please comment on this. I'd like to get python3 support upstream.
My patches were inspired from the old python-3.1 port
http://svn.apache.org/repos/asf/lucene/pylucene/branches/python_3/jcc/
I've refactored/rebased it and splitted the huge patches into many
smaller ones inclusive keeping python2 support. Almost all commits in
my py2 branch are trivially to review and independent of each other. So
I would be glad if you would merge as many of them as you like.
Since the py3 branch is still not 100% correct (guess still some
Bytes/Unicode problems) I would be glad if somebody would help to get
it running.
Thank you for your contribution. I have not looked at it yet but you're now the second contributor with python3 support. The one thing I suspect is missing is proper python 3.x (x > 3 ?) <-> java string conversions. In these versions of python the internal string representation was changed to be more clever about how many bytes to use per unicode char based on the data of the string. One would want to take advantage of that to minimize conversions between both languages. Support for earlier versions of python 3 is irrelevant and not necessary.
You can take a look at the PyICU sources (github) for the kinds of conversion functions I'm referring to.
(function PyUnicode_FromUnicodeString and reverse in common.cpp: https://github.com/ovalhub/pyicu/blob/master/common.cpp)
Also, support for python 2 is not necessary in the new branch as it's being retired (python 2) in a few years.

I have no time right now to spend quality time on jcc/python3 support but it's been increasingly on my mind lately and I hope to spend some time on this soon. At that point, I'll take a look at your patches and the other contributor's (see list archives) as well.

Many thanks for your contribution !

Andi..
Post by Ruediger Meier
Cheers,
Rudi
RĂ¼diger Meier
2017-03-11 03:12:27 UTC
Permalink
Post by Andi Vajda
Thank you for your contribution. I have not looked at it yet
but
you're now the second contributor with python3 support.
Ah this one was hard to find. It looks very similar to my one and has
some nice build fixes which I imported too. Otherwise my fork is
addressing already some more issues now.
Post by Andi Vajda
The one thing
I suspect is missing is proper python 3.x (x > 3 ?) <-> java string
conversions. In these versions of python the internal string
representation was changed to be more clever about how many bytes to
use per unicode char based on the data of the string. One would want
to take advantage of that to minimize conversions between both
languages.
Support for earlier versions of python 3 is irrelevant and
not necessary.
So maybe you could derive benefit from my mentioned py2 branch which
carefully removes already python < 2.7 support and solves all the
trivial but labour-intensive tasks towards python3.
Post by Andi Vajda
You can take a look at the PyICU sources (github) for
the kinds of conversion functions I'm referring to. (function
https://github.com/ovalhub/pyicu/blob/master/common.cpp)
Thanks, I will have a look at it.

Actually I have now all test-suite problems fixed, most of the failures
were cause by python3 incompatibilities in the pylucene itself.

BUT some more interesting questions arised ... about the jcc interface.
Speaking in examples:

1.
Should JArray('byte')("x") still work in python3 or should we require
JArray('byte')(b"x")? Currently JArray('byte')(U"x") is not supported.

2.
Currently JArray_byte.string_() returns class str on python2 which are
"bytes". Should we keep it that way? The original old python3 port added
a new bytes_() method to get "bytes" and changed string_() to return
unicode. This looks reasonable but incompatible for the users. We could
instead keep the string_() function as is and add a new unicode_() instead.

I've implemented both alternatives already for 1. and 2. Just need a
decision what is the way we want to go.

cu,
Rudi
Ruediger Meier
2017-03-14 20:11:02 UTC
Permalink
FYI I have finished the jcc port and pylucene as well.

jcc here: https://github.com/rudimeier/jcc
(works for both python 2 and 3)

pylucene here: https://github.com/rudimeier/pylucene


Note the pylucene repository contains actually 3 ported versions (3.x,
4.x and 6.x) in different branches. All tests succeed (Linux tested).
Only a few minor manual changes were needed (beside running the 2to3
script).

cu,
Rudi
Andi Vajda
2017-03-14 21:45:36 UTC
Permalink
Post by Ruediger Meier
FYI I have finished the jcc port and pylucene as well.
jcc here: https://github.com/rudimeier/jcc
(works for both python 2 and 3)
pylucene here: https://github.com/rudimeier/pylucene
Note the pylucene repository contains actually 3 ported versions (3.x,
4.x and 6.x) in different branches. All tests succeed (Linux tested).
Only a few minor manual changes were needed (beside running the 2to3
script).
Thank you !

Andi.,
Post by Ruediger Meier
cu,
Rudi
Loading...