Project

General

Profile

Actions

Feature #20031

closed

Regexp using greedy quantifier and unions on a big string uses a lot of memory

Added by FabienChaynes (Fabien Chaynes) 7 months ago. Updated 6 months ago.

Status:
Feedback
Assignee:
-
Target version:
-
[ruby-core:115550]

Description

Trying to match on some regexp using a greedy quantifier on any character (.*) and unions (foo|bar) uses a lot of memory if the string we want to match on is big.

Reproduction code

puts RUBY_DESCRIPTION

unless File.exist?("/tmp/fake_email.eml")
  content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}"
  File.write("/tmp/fake_email.eml", content)
end

content = File.read("/tmp/fake_email.eml")

def print_size(context, sike_kb)
  puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB"
end

print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)

print_size("content", content.bytesize / 1024)

REGEXP_UNION = Regexp.union(
  "Accept-Language:",
  "Thread-Topic:",
).freeze
reg = /.*#{REGEXP_UNION}/
p reg

GC.start
print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
allocated = GC.stat(:total_allocated_objects)

p content.match(reg)

p GC.stat(:total_allocated_objects) - allocated
GC.start
print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last)
p "---"

Output:

$ /usr/bin/time -v ruby full_repro.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
baseline: 71.3 MB
content: 47.7 MB
/.*(?-mix:Accept\-Language:|Thread\-Topic:)/
before match?: 71.3 MB
#<MatchData "Accept-Language:">
14
after match?: 71.6 MB
"---"
  Command being timed: "ruby full_repro.rb"
  User time (seconds): 1.15
  System time (seconds): 0.74
  Percent of CPU this job got: 100%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 2423108
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 609575
  Voluntary context switches: 52
  Involuntary context switches: 12
  Swaps: 0
  File system inputs: 0
  File system outputs: 0
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

strace:

$ sudo strace -p 588834
strace: Process 588834 attached
[...]
mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000
mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000
mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000
mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000
mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000
mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000
mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000
mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000
mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000
mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000
mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000
mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000
mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000
mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000
mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000
mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000
munmap(0x7f8cae4a8000, 4026535936)      = 0
munmap(0x7f8e92224000, 12500992)        = 0
writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32
[...]

We can see that the memory retained by Ruby is low (71.6 MB), but content.match(reg) used 2_423_108 kbytes (~2.4 GB) in the regex engine.

Findings

To reproduce we need:

  • a greedy quantifier on any character (.*)
  • a union with at least two members (foo|bar). More members won't increase the memory consumption further
  • to perform the match on a long string

The memory seems to increase linearly against the string size:

String size Memory consumed
11.9 MB 629_988 kbytes
23.8 MB 1_235_976 kbytes
47.7 MB 2_423_076 kbytes

I was able to reproduce it on old Ruby versions as well.

@byroot (Jean Boussier) helped me to put all this together and according to his investigation the memory allocation would be coming from regexec.c:4100 (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/regexec.c#L4100):

size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1;
uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t));

It seems surprising to use that much memory compared to the string size. Is it expected?

Updated by mame (Yusuke Endoh) 7 months ago

I think this behavior is expected. The regex engine stretches the stack by the number of charaters in the input string because of backtracking. The size of the stack is visible as memory usage.

Simply put, /.*(foo|bar)/ is inefficient because it looks (by recursive calls) for the last occurrence of /foo|bar/ in the string. If the first occurrence is acceptable, then /.*?(foo|bar)/ would be much efficient.

Note that the code at regexec.c:4100 is unrelated to this problem, or at least not the dominant factor. This is code for ReDoS protection, introduced in Ruby 3.2, which allocates only 1/8 of the string size at most. The memory consumption in question is reproduced in Ruby 3.1, where this code does not exist.

Updated by Eregon (Benoit Daloze) 7 months ago

I played around a bit and both time and memory seem linear in the size of the input string on CRuby.
However the memory usage is indeed huge, ~2GB for 50MB string, ~4GB for 100MB string, ~8GB for 200MB string, etc.

I tried it on TruffleRuby which uses TRegex (a DFA regexp engine) for curiosity, and that uses 235MB for 50MB string, 334MB for 100MB string, 530MB for 200MB string.
(with /usr/bin/time -v ruby --experimental-options --engine.Compilation=false file.rb to exclude the JIT memory to get more stable numbers).
Using this modified benchmark to pass the input size.

Updated by mame (Yusuke Endoh) 6 months ago

  • Tracker changed from Bug to Feature
  • Status changed from Open to Feedback
  • ruby -v deleted (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux])
  • Backport deleted (3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN)

This is not a bug. If anyone creates a patch to improve this, we may consider it. We could reduce the stack size consumed, or add a /.*(foo|bar)/ pattern-specific optimization, if this pattern is frequent.

Actions

Also available in: Atom PDF

Like0
Like0Like1Like1