Bug #1965
closedthe strange thing in Iconv under windows(GBK)
Description
=begin
I have a file encoding in utf-8,this is the content:
#掉
config
I read it and then match it with =~/ab/,it will raise: ArgumentError: invalid byte sequence in GBK.
There is something strange:
irb> s=IO.readlines('test.utf8').join
=> "#鎺\x89\nconfig"
irb> gbk=Iconv.conv('gbk','utf-8',s)
=> "#掉\nconfig"
irb> utf=Iconv.conv('utf-8','gbk',gbk)
=> "#鎺塡nconfig"
irb> s==utf
=> false # in Ruby1.8.7,it will say true
irb> s=~/ab/
ArgumentError: invalid byte sequence in GBK
irb> utf=~/ab/
=> nil
my environment:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Windows XP,GBK,chcp=>936
=end
Files
Updated by naruse (Yui NARUSE) about 15 years ago
=begin
This seems to be caused by iconv library.
Please try another iconv.dll.
=end
Updated by phoenix (junchen wu) about 15 years ago
=begin
Maybe I need try another iconv.dll to make the s==utf return true,but then both s=~/ab/ and utf=~/ab/ will raise the ArgumentError: invalid byte sequence in GBK.
I want to read my string from my utf-8 file,and compile it with regexp without raise error,this will work fine in Linux,but not work in my GBK windows.
=end
Updated by naruse (Yui NARUSE) about 15 years ago
=begin
Oh I see.
You should s=IO.readlines('test.utf8',:encoding=>'utf-8').join.
or s=IO.read('test.utf8',:encoding=>'utf-8')
=end
Updated by phoenix (junchen wu) about 15 years ago
=begin
Thanks so much,it works fine now!
Is there some setting to make the IO read all files using :encoding=>'utf-8' by default,or should the IO check the file encoding and auto set this before read it?
Rails read files use File.read(),if must add :encoding=>'utf-8' to all the file reader,there will be lots of work to do;-)
Sorry for my pool known of ruby usage,thanks for your patient!
=end
Updated by naruse (Yui NARUSE) about 15 years ago
- Status changed from Open to Closed
=begin
Is there some setting to make the IO read all files using :encoding=>'utf-8' by default
Encoding.default_external gives the default.
Rails may use this and set as UTF-8, so you shouldn't change this.
following gives detailed information
http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
http://blog.grayproductions.net/articles/understanding_m17n
http://github.com/candlerb/string19/blob/361c7d9acf1745006fb3f35e94a1ee844d0bff07/string19.rb
should the IO check the file encoding and auto set this before read it?
'EncDet' is the one, but this is not merged yet because of naming problem.
These are written in Japanese, but you can see candidates.
If you have good name, suggest it.
http://redmine.ruby-lang.org/issues/show/973
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/33628
Anyway if you know the encoding of a file, to specify explicitly is safest way.
=end