Feature #3595
closedTheres no encoding to differentiate a stream of Binary data from an 8-Bit ASCII string
Added by dreamcat4 (Dreamcat Four) over 14 years ago. Updated over 13 years ago.
Description
=begin
Hi,
It would help if we could have a seperate encoding for binary strings to mark them as binary data and not lump it in with ASCII.
Why cant we do this?
As things stand, I have to re-open String and add an attribute for that. Its silly.
=end
Updated by runpaint (Run Paint Run Run) over 14 years ago
=begin
The encoding name 'ASCII-8BIT' is a bit, pun unintentional, of a misnomer as it has little to do with ASCII. It indicates, in effect, the absence of an encoding whereby one byte always constitutes one character, and all byte sequences are valid. Therefore, it is precisely the right encoding to associate with binary data, and even has 'BINARY' as an alias. Strings containing ASCII text, i.e. characters in the range 0-127, should be associated with the 'US-ASCII' encoding.
=end
Updated by naruse (Yui NARUSE) over 14 years ago
- Status changed from Open to Rejected
=begin
For octet stream, use ASCII-8BIT; it means ASCII compatible octet string.
If it is not ASCII compatible, we can't write simply like "\x00\x00".force_encoding("ASCII-8BIT") == "\x00".
So it is intentional that the name 'ASCII-8BIT' contains the term 'ASCII'.
=end
Updated by dreamcat4 (Dreamcat Four) over 14 years ago
=begin
Sorry but I asked for an encoding for BINARY data. Not 'octets'. Not 'ascii-anything'. You seem to have misunderstood.
01101110110 ?
11011110 ?
Think about it, the correct encoding BINARY should always only ever be a numerical one. Which isnt represented anywhere in Encodings.list
=end
Updated by jballanc (Joshua Ballanco) over 14 years ago
=begin
In the world of encodings, a String in Ruby is an array of bytes and those bytes may represent a character, part of a multibyte character, or an invalid character. Which of those three possibilities depends on the byte and the encoding. For a 'US-ASCII' String, there will never multibyte characters, so the only options are that a byte is a valid or invalid character. In 'US-ASCII', any byte 0x00000000 to 0x01111111 is a valid char and any other byte is invalid. For an 'ASCII-8BIT', there are no multibyte characters and there are no invalid characters. Therefore, any byte you put into an 'ASCII-8BIT' string can be retrieved unmolested.
I think you maybe misunderstand what an encoding is? When you say that the correct encoding should "only ever be a numerical one", well, bytes are always numerical, and Strings are arrays of bytes. Encoding just says what those numbers should mean, and 'ASCII-8BIT' just says that they mean what they are: bytes. If you, on the other hand, are looking to have a String where the characters themselves are restricted to '1' and '0', well, in what encoding? Keep in mind, in 'US-ASCII' encoding '0' is actually 0x00110000 and '1' is actually 0x00110001.
=end
Updated by dreamcat4 (Dreamcat Four) over 14 years ago
=begin
Let me re-phase this another way:
Its simply a very poor assumption to say that the Encoding "BINARY" is an alias of ASCII-8-bit.
Its true that Ruby would store the string as an octet stream identically to an 8-bit ASCII octet stream. However the interpretation of the string in a running program is entirely different. On one hand, you're telling the program that the octet stream can be displayed as 8-bit ascii data (or can be transcoded to some other human-readable encoding).
On the other case, a binary string is always only an octet stream of binary data. It has a real and unique encoding which is not connected to the mappable domain of the other encodings.
In other words - 8-bit ascii is a mapping of glyphs. Raw binary data has no such ASCII glyphs and no mapping to those ASCII glyphs. If you think implementing a numerical conversion is unnecessary, then fine. Dont implement one. A string of unencoded binary data can always be transcoded to -> "" (empty string). Then theres no requirement to map the bits (or a hexadecimal representation) numerically onto an ASCII table.
=end
Updated by dreamcat4 (Dreamcat Four) over 14 years ago
=begin
And lets face it, if Encoding::BINARY were its own seperate encoding that shouldnt really hurt anybody. Given the definition of what binary data is. If a ruby programmer wants to continue using 8-Bit Ascii strings that shouldn't get broken. Looking at it from that (my) perspective theres not really much of an obvious downside.
You should only ever display a BINARY encoded string with String#force_encoding() IMHO.
=end
Updated by spatulasnout (B Kelly) over 14 years ago
=begin
Dreamcat Four wrote:
Its simply a very poor assumption to say that the Encoding "BINARY" is
an alias of ASCII-8-bit.Its true that Ruby would store the string as an octet stream identically
to an 8-bit ASCII octet stream. However the interpretation of the string
in a running program is entirely different. On one hand, you're telling
the program that the octet stream can be displayed as 8-bit ascii data
(or can be transcoded to some other human-readable encoding).On the other case, a binary string is always only an octet stream of
binary data. It has a real and unique encoding which is not connected to
the mappable domain of the other encodings.In other words - 8-bit ascii is a mapping of glyphs. Raw binary data has
no such ASCII glyphs and no mapping to those ASCII glyphs. If you think
implementing a numerical conversion is unnecessary, then fine. Dont
implement one. A string of unencoded binary data can always be
transcoded to -> "" (empty string). Then theres no requirement to map
the bits (or a hexadecimal representation) numerically onto an ASCII
table.
My understanding is that ASCII-8BIT is exists in recognition of
the fact that, over the past few decades, it has been common for
binary file formats to contain ASCII tags.
$ hexdump -C q2dm1.bsp
00000000 49 42 53 50 26 00 00 00 bc df 0e 00 7a 32 00 00 |IBSP&.......z2..|
00000010 a0 00 00 00 20 bc 00 00 d8 b2 01 00 d0 cb 00 00 |.... ...........|
$ hexdump -C 5th.wav
00000000 52 49 46 46 8e c8 08 00 57 41 56 45 66 6d 74 20 |RIFF....WAVEfmt |
00000010 10 00 00 00 01 00 01 00 44 ac 00 00 88 58 01 00 |........D....X..|
00000020 02 00 10 00 64 61 74 61 18 c8 08 00 00 00 00 00 |....data........|
$ hexdump -C quake062-ware1.jpg
00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 02 01 00 48 |......JFIF.....H|
00000010 00 48 00 00 ff ee 00 0e 41 64 6f 62 65 00 64 80 |.H......Adobe.d.|
$ hexdump -C logo_a.bmp
00000000 42 4d 36 24 00 00 00 00 00 00 36 04 00 00 28 00 |BM6$......6...(.|
00000010 00 00 80 00 00 00 40 00 00 00 01 00 08 00 00 00 |......@.........|
$ hexdump -C pak0.pak
00000000 50 41 43 4b 62 5a f4 0a c0 3a 03 00 0a 05 01 08 |PACKbZ...:......|
00000010 00 00 00 00 ff 00 ff 00 40 01 c8 00 00 00 00 0f |........@.......|
$ hexdump -C q2source-3.21.zip
00000000 50 4b 03 04 14 00 02 00 08 00 ca 9e 96 2b fe 68 |PK...........+.h|
00000010 a4 9a b4 09 00 00 b4 15 00 00 1c 00 00 00 71 75 |..............qu|
$ hexdump -C quake2.exe
00000000 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 |MZ..............|
00000010 b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
$ hexdump -C quake2
00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 04 00 |.ELF............|
00000010 02 00 03 00 01 00 00 00 c0 9c 04 08 34 00 00 00 |............4...|
In practical terms, it's not clear to me there's much benefit
in having a separate binary encoding which explicitly denies
ASCII compatibility?
It's true that there are plenty of binary formats lacking any
ASCII tags in their structure. But the ability to parse such
formats is not being hampered by ASCII-8BIT's ability to
(optionally) treat the data like ASCII.
Regards,
Bill
=end
Updated by naruse (Yui NARUSE) over 14 years ago
=begin
Bill explains the reason.
Please show your use case.
I can't understand what troubles you.
=end
Updated by dreamcat4 (Dreamcat Four) over 14 years ago
=begin
Hi,
Well I was unaware of this. In that case the argument Bill has can be seen as an issue. Reading a file with the IO object would read the ASCII tags, and you wouldn't know what to do. The tags map to both Ascii 7-bit and ascii 8-bit anyway.
It seems that the correct thing to do when reading a file through an IO object is set the encoding to Encoding::BINARY and ignore the ascii tags. Unless the ASCII tag says its a text file, then set the Encoding to ASCII. Thats pretty easy really.
What prompted me to report this:
Translating data from a Ruby hash object and simple Ruby types into a Plist representation. To give users a standard and appropriate way to differentiate between their Ruby strings which are either textual (ascii or unicode), and their persistent binary data. A StringIO object is clearly not intended to represent a stream of binary data, since you have declared a specific Encoding::BINARY.
There is simply no compelling argument why Encoding::BINARY should be an alias of 8-bit ASCII.
=end
Updated by naruse (Yui NARUSE) over 14 years ago
=begin
2010/7/22 Dreamcat Four redmine@ruby-lang.org:
Well I was unaware of this. In that case the argument Bill has can be seen as an issue.
Reading a file with the IO object would read the ASCII tags, and you wouldn't know what to do.
The tags map to both Ascii 7-bit and ascii 8-bit anyway.
It says People often treat a part of binary string as ASCII.
It seems that the correct thing to do when reading a file through an IO object is set
the encoding to Encoding::BINARY and ignore the ascii tags.
Unless the ASCII tag says its a text file, then set the Encoding to ASCII.
Thats pretty easy really.
Ruby is practical language; We want to write
ruby -e'p IO.binread("foo.gif")[0,3]=="GIF"'
don't want
ruby -e'p IO.binread("foo.gif")[0,3].force_encoding("US-ASCII")=="GIF"'
What prompted me to report this:
Translating data from a Ruby hash object and simple Ruby types into a Plist representation.
To give users a standard and appropriate way to differentiate between their Ruby strings
which are either textual (ascii or unicode), and their persistent binary data.
A StringIO object is clearly not intended to represent a stream of binary data,
since you have declared a specific Encoding::BINARY.There is simply no compelling argument why Encoding::BINARY should be an alias of 8-bit ASCII.
There are three encoding types: Unicode, ASCII, BINARY.
You should map as following:
Unicode: related Unicode encoding
ASCII: US-ASCII
BINARY: ASCII-8BIT
Or something bad may happen when you map as above?
--
NARUSE, Yui
naruse@airemix.jp
=end
Updated by spatulasnout (B Kelly) over 14 years ago
=begin
Dreamcat Four wrote:
It seems that the correct thing to do when reading a file through an
IO object is set the encoding to Encoding::BINARY and ignore the
ascii tags. Unless the ASCII tag says its a text file, then set the
Encoding to ASCII. Thats pretty easy really.
But one doesn't want to ignore the tags when they denote the
structure of the file.
Here's an excerpt from a simple WAV file parser I had written
several years ago while using ruby 1.8.4, which still works on
1.9.2 thanks to ASCII-8BIT.
class WAVParseError < StandardError; end
class NotRIFFFormat < WAVParseError; end
class NotWAVEFormat < WAVParseError; end
def read_chunk_header(file)
chunk_name = file.read(4)
len = file.read(4).unpack("V").first
[chunk_name, len]
end
def parse_wav(file)
riff, riff_len = read_chunk_header(file)
raise NotRIFFFormat unless riff == 'RIFF'
riff_end = file.tell + riff_len
wave = file.read(4)
raise NotWAVEFormat unless wave == 'WAVE'
while file.tell < riff_end
chunk_name, len = read_chunk_header(file)
fpos = file.tell
yield file, chunk_name, len if block_given?
file.seek(fpos + len)
end
end
if $0 == __FILE__
# by way of example, just print the chunk names and lengths
ARGV.each do |fname|
File.open(fname, "rb") do |io_|
puts fname
begin
parse_wav(io_) do |io, chunk_name, len|
puts "%4s %08x" % [chunk_name, len]
end
rescue StandardError => ex
warn "error: #{ex.message}"
end
end
end
end
~~~~~~~~~~~~~~~~~~~~~~~~~
$ ruby -v parse_wav.rb m:/snd/startrek/trezap.wav
ruby 1.8.4 (2005-12-24) [i386-mswin32]
m:/snd/startrek/trezap.wav
fmt 00000010
data 0000b9f1
LIST 00000058
cue 0000001c
LIST 00000038
$ ruby19 -v parse_wav.rb m:/snd/startrek/trezap.wav
ruby 1.9.2dev (2010-07-06) [i386-mswin32_100]
m:/snd/startrek/trezap.wav
fmt 00000010
data 0000b9f1
LIST 00000058
cue 0000001c
LIST 00000038
The above just lists the chunks; but an extended version of
the parser decided whether to parse certain chunks in more
detail with logic like the following:
case chunk_name
when 'fmt ' then handle_fmt_chunk(io, len)
when 'data' then handle_data_chunk(io, len)
end
So we definitely don't wish to ignore the chunk names.
> What prompted me to report this:
>
> Translating data from a Ruby hash object and simple Ruby types into
> a Plist representation. To give users a standard and appropriate
> way to differentiate between their Ruby strings which are either
> textual (ascii or unicode), and their persistent binary data.
Could you use Encoding::ASCII instead of ASCII-8BIT in this case,
to differentiate between ascii vs. binary?
Regards,
Bill
=end
Updated by jballanc (Joshua Ballanco) over 14 years ago
=begin
On Jul 21, 2010, at 11:26 PM, Dreamcat Four wrote:
Issue #3595 has been updated by Dreamcat Four.
Hi,
Well I was unaware of this. In that case the argument Bill has can be seen as an issue. Reading a file with the IO object would read the ASCII tags, and you wouldn't know what to do. The tags map to both Ascii 7-bit and ascii 8-bit anyway.
Keep in mind, there isn't really any such thing as 8-bit ASCII. The ASCII standard only defines 128 characters. This is why ASCII-8BIT == BINARY. The "ASCII" part implies that there are no multibyte characters, the "8BIT" implies that there are no invalid character bytes (even though there are only, strictly speaking, 7 bits worth of meaningful characters).
What prompted me to report this:
Translating data from a Ruby hash object and simple Ruby types into a Plist representation. To give users a standard and appropriate way to differentiate between their Ruby strings which are either textual (ascii or unicode), and their persistent binary data.
I'm assuming that you're referring to OS X's plists? And I'm also assuming that you want to differentiate between dumping nodes versus nodes, yes? If so, the I'm afraid that encoding is not your issue. The issue is that Ruby (ab)uses String for "Strings" in the traditional sense and "Just someplace to put an arbitrary array of bytes".
This is something that I've been dealing with a lot, lately, and I'd been meaning to put forward a more thought out proposal. In Cocoa, there is NSString and friends for traditional strings, and NSData for arbitrary length data. I think a divide like this could be useful for Ruby in the future. In particular, as I said before the issue is not specific encodings, it's that encodings don't mean anything for binary data.
Think of it this way: What are the contents of a "String" object? Characters, right? The fact that these characters are stored as bytes is an implementation detail, and one that, if we could all just move toward a common encoding like Unicode, most programmers could forget about. Even today, the only reason you would need to know anything about encodings is if you were consuming characters from a source outside your control. More importantly, though, is that if you change the bytes as a consequence of a change to the encoding, the content of the String (i.e. the array of characters it represents) does not change. On the other hand, if you're overloading a String to store bytes, then a change in encoding is catastrophic. That's because what you were really storing wasn't a "String".
To get to the root of your problem: If you are using RubyCocoa or MacRuby, NSData is there for you (and you'll probably get improved performance to boot). If you want to do something in pure ruby to generate plists with nodes, I would suggest creating a custom "Data" class for this purpose, and leave alone the topic of encodings.
Cheers,
Josh
=end
Updated by jballanc (Joshua Ballanco) over 14 years ago
=begin
On Jul 21, 2010, at 11:41 AM, Dreamcat Four wrote:
Issue #3595 has been updated by Dreamcat Four.
And lets face it, if Encoding::BINARY were its own seperate encoding that shouldnt really hurt anybody. Given the definition of what binary data is. If a ruby programmer wants to continue using 8-Bit Ascii strings that shouldn't get broken.
Oh, one more thing...there is no such thing as an ASCII-8BIT string that isn't data. The high bit values don't map to any characters. There are derivative encodings that use these values for other characters, but they have their own names like "ISO8859_2". If someone wants an ASCII string, they'll use US-ASCII. I'd think it's safe to assume that ASCII-8BIT always implies binary data.
=end
Updated by jballanc (Joshua Ballanco) over 14 years ago
=begin
On Jul 22, 2010, at 1:46 AM, Joshua Ballanco wrote:
On Jul 21, 2010, at 11:41 AM, Dreamcat Four wrote:
Issue #3595 has been updated by Dreamcat Four.
And lets face it, if Encoding::BINARY were its own seperate encoding that shouldnt really hurt anybody. Given the definition of what binary data is. If a ruby programmer wants to continue using 8-Bit Ascii strings that shouldn't get broken.
Oh, one more thing...there is no such thing as an ASCII-8BIT string that isn't data. The high bit values don't map to any characters. There are derivative encodings that use these values for other characters, but they have their own names like "ISO8859_2". If someone wants an ASCII string, they'll use US-ASCII. I'd think it's safe to assume that ASCII-8BIT always implies binary data.
To illustrate the point:
irb1.9
irb(main):001:0> "\xAA".force_encoding('ASCII-8BIT').encode('UTF-8')
Encoding::UndefinedConversionError: "\xAA" from ASCII-8BIT to UTF-8
from (irb):1:inencode' from (irb):1 from /usr/local/bin/irb1.9:12:in
'
irb(main):002:0> "\xAA".force_encoding('US-ASCII').encode('UTF-8')
Encoding::InvalidByteSequenceError: "\xAA" on US-ASCII
from (irb):2:inencode' from (irb):2 from /usr/local/bin/irb1.9:12:in
'
irb(main):003:0> "\xAA".force_encoding('ISO8859-2').encode('UTF-8')
=> "Ş"
Notice how, in the first case you get an "UndefinedConversionError"? That's because there's not a character that corresponds to this value in ASCII-8BIT encoding. In the second case, you get an "InvalidByteSequenceError" because you've put an 8 bit value into an ASCII string, which is only valid for 7 bits. Finally, using one of the "Extended ASCII" encodings, you can successfully convert to Unicode because now this value, 0xAA, actually represents a character. So, in other words, ASCII-8BIT is already what you're looking for.
=end
Updated by rogerdpack (Roger Pack) over 14 years ago
=begin
I think his original complaint was that if he's passing back data from a method, like
def give_me_your_binary_data
end
The can't pass back a string that is marked as BINARY encoding, since querying for its encoding returns "ASCII-8BIT".
See the previous discussion (which I think never quite terminated--any more thoughts there?)
Maybe it should be renamed as its default GENERIC and have ASCII-8BIT as an alias.
Thoughts?
-r
=end
Updated by jballanc (Joshua Ballanco) over 14 years ago
=begin
On Jul 22, 2010, at 12:42 PM, Roger Pack wrote:
I think his original complaint was that if he's passing back data from a method, like
def give_me_your_binary_data
endThe can't pass back a string that is marked as BINARY encoding, since querying for its encoding returns "ASCII-8BIT".
See the previous discussion (which I think never quite terminated--any more thoughts there?)
Maybe it should be renamed as its default GENERIC and have ASCII-8BIT as an alias.
Thoughts?
If this is the case, then I feel we're arguing semantics. There is no such thing as an "ASCII-8BIT" string (where by "string" I mean an array of characters that can be printed). Values 129-256 in an "ASCII-8BIT" string cannot be printed or converted to any other encoding. Personally, I'd be in favor of making "BINARY" the default and "ASCII-8BIT" the alias, but it shouldn't matter.
- Josh
=end