Project

General

Profile

Bug #5278

REXML -- Malformed comment

Added by tbf (Thomas Fritzsche) about 9 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin11.1.0]
Backport:
[ruby-core:39289]

Description

Hi Ruby-Team,

I use lib rexml for XML parsing. Kanjidic2 XML-File: http://www.csse.monash.edu.au/~jwb/kanjidic2/  (I do not attach file because it it too large)
It works with version 1.8.7 but PaseException ("Malformed comment" is raised in lib/rexml/parsers/baseparser.rb

My Code looks like this:

require 'rexml/document'
require 'rexml/streamlistener'
class KanjiListener
include REXML::StreamListener
end

f = File.new("kanji.xml","rb")
list = KanjiListener.new

REXML::Document.parse_stream(f, list)

The used XML-File from above link has a comment section that looks like:

...
<!-- Version 1.6 - April 2008
This is the DTD of the XML-format kanji file combining information from
the KANJIDIC and KANJD212 files. It is intended to be largely self-
documenting, with each field being accompanied by an explanatory
comment.
-->
...

It's strange but the parser fails at "self- documented".

The issue comes up here (about line 345):
...
if md[0][2] == ?-
md = @source.match( COMMENT_PATTERN, true )

            case md[1]
            when /--/, /-$/
              raise REXML::ParseException.new("Malformed comment", @source)
            end

...

The MatchingData md[1] contains the complete comment and than regular expression /-$/ matches.
From Debugging I guess the original Buffer is read by "readline" and somehow still includes the end-of-line markers.

I tried to open the original FileIO with different newline-parameters but nothing helped. I tried different ruby versions (incl. todays 1.9.3-head) but complete 1.9 seems to have the problem while 1.8 works.
I meanwhile converted to nokogiri XML-Parser and this works without problem on 1.9.x and I would expect that REXML could parse this too. For test purpose I just changed a single character on this file so that "/-$/" does not match "self-" in original XML file and than it works.

どうぞよろしくお願いします。

Also available in: Atom PDF