python-tokenizer: a translation of Python’s tokenize.py library for Racket
Danny Yoo <hashcollision.org>
This is a fairly close translation of the tokenize.py library from Python.
The main function, generate-tokens, consumes an input port and produces a sequence of tokens.
For example:
> (require (planet dyoo/python-tokenizer))
> (define sample-input (open-input-string "def d22(a, b, c=2, d=2, *k): pass"))
> (define tokens (generate-tokens sample-input))
> (for ([t tokens]) (printf "~s ~s ~s ~s\n" (first t) (second t) (third t) (fourth t)))
NAME "def" (1 0) (1 3)
NAME "d22" (1 4) (1 7)
OP "(" (1 7) (1 8)
NAME "a" (1 8) (1 9)
OP "," (1 9) (1 10)
NAME "b" (1 11) (1 12)
OP "," (1 12) (1 13)
NAME "c" (1 14) (1 15)
OP "=" (1 15) (1 16)
NUMBER "2" (1 16) (1 17)
OP "," (1 17) (1 18)
NAME "d" (1 19) (1 20)
OP "=" (1 20) (1 21)
NUMBER "2" (1 21) (1 22)
OP "," (1 22) (1 23)
OP "*" (1 24) (1 25)
NAME "k" (1 25) (1 26)
OP ")" (1 26) (1 27)
OP ":" (1 27) (1 28)
NAME "pass" (1 29) (1 33)
ENDMARKER "" (2 0) (2 0)
1 API
(require (planet dyoo/python-tokenizer:1:=0)) |
(generate-tokens inp) → (sequenceof (list/c symbol? string? (list/c number? number?) (list/c number? number?) string?)) inp : input-port
token-type: one of the following symbols: 'NAME, 'NUMBER, 'STRING, 'OP, 'COMMENT, 'NL, 'NEWLINE, 'DEDENT, 'INDENT, 'ERRORTOKEN, or 'ENDMARKER. The only difference between 'NEWLINE and 'NL is that 'NEWLINE will only occurs if the indentation level is at 0.
text: the string content of the token.
start-pos: the line and column as a list of two numbers
end-pos: the line and column as a list of two numbers
current-line: the current line that the tokenizer is on
The last token produced, under normal circumstances, will be 'ENDMARKER.
If a recoverable error occurs, generate-tokens will produce single-character tokens with the 'ERRORTOKEN type until it can recover.
Unrecoverable errors occur when the tokenizer encounters eof in the middle of a multi-line string or statement, or if an indentation level is inconsistent. On an unrecoverable error, generate-tokesn will raise an exn:fail:token or exn:fail:indentation error.
(struct exn:fail:token exn:fail (loc) #:extra-constructor-name make-exn:fail:token) loc : (list/c number number)
(struct exn:fail:indentation exn:fail (loc) #:extra-constructor-name make-exn:fail:indentation) loc : (list/c number number)
2 Translator Comments
The translation is a fairly direct one; I wrote an auxiliary package to deal with the while loops, which proved invaluable during the translation of the code. It may be instructive to compare the source here to that of tokenize.py.
Here are some points I observed while doing the translation:
Mutation pervades the entirety of the tokenizer’s main loop. The main reason is because while has no return type and doesn’t carry variables around; the while loop communicates values from one part of the code to others through mutation, often in wildly distant locations.
Racket makes a syntactic distinction between variable definition (define) and mutation (set!). I’ve had to deduce which variables were intended to be temporaries, and hopefully I haven’t induced any errors along the way.
In some cases, Racket has finer-grained type distinctions than Python. Python does not use a separate type to represent individual characters, and instead uses a length-1 string. In this translation, I’ve used characters where I think they’re appropriate.
Most uses of raw strings in Python can be translated to uses of the at-exp reader.
Generators in Racket and in Python are pretty similar, though the Racket documentation can do a better job in documenting them.
When dealing with generators in Racket, what one really wants to usually produce is a generic sequence. For that reason, the Racket documentation really needs to place more emphasis in in-generator, not the raw generator form.
Python heavily overloads the in operator. Its expressivity makes it easy to write code with it. On the flip side, its flexibility makes it a little harder to know what it actually means.
Regular expressions, on the whole, match well between the two languages. Minor differences in the syntax are potholes: Racket’s regular expression matcher does not have an implicit begin anchor, and Racket’s regexps are more sensitive to escape characters.
Python’s regexp engine returns a single match object that can support different operators. Racket, on the other hand, requires the user to select between getting the position of the match, with regexp-match-positions, or getting the textual content with regexp-match.