Fix ‘[[:cc:]]*literal’ regex failing to match ‘literal’ (bug#24020)

The regex engine tries to optimise Kleene star by avoiding backtracking when it can detect that star’s operand cannot match what follows it in the pattern. For example, when ‘[[:alpha:]]*1’ tries to match a ‘foo’, the engine will test the longest match for ‘[[:alpha:]]*’, namely ’foo’ which is the entire string. Literal digit one still present in the pattern will however not match the remaining empty string. Normally, backtracking would be performed trying a shorter match for the character class (namely ‘fo’ leaving ‘o’ in the string), but since the engine knows whatever would be put back into the string cannot possibly match literal digit one so no backtracking will be attempted. In the regexes of the form ‘[[:CC:]]*X’, the optimisation can be applied if the character class CC does not match character X. In the above example, this holds because digit one is not in alpha character class. This test is performed by mutually_exclusive_p function but it did not check class bits of a charset opcode. This resulted in an assumption that character classes do not match multibyte characters. For example, it would incorrectly conclude that [[:alpha:]] doesn’t match ‘ż’. This, in turn, led to the aforementioned Kleene star optimisation being incorrectly applied in patterns such as ‘[[:graph:]]*☠’ (which should match ‘☠’ but doesn’t as can be tested by executing (string-match-p "[[:graph:]]*☠" "☠") which should return 0 but instead yields nil. This issue affects any class witch matches multibyte characters, i.e. if ‘[[:cc:]]’ matches a multibyte character X then ‘[[:cc:]]*X’ will fail to match ‘X’. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, range table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching.
author: Michal Nazarewicz <mina86@mina86.com> 2016-07-18 15:59:26 +0200
committer: Michal Nazarewicz <mina86@mina86.com> 2016-07-25 23:52:27 +0200
commit: 6dc6b0079ed3632ed9082bc79d8cb6fc96d33f43 (patch)
tree: ffad067337d44c6b559f474fb7421fa7fcf892c6 /test
parent: b176d169347925d57ca63ab63b85d92e49a53c81 (diff)
1 files changed, 92 insertions, 0 deletions
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
new file mode 100644
index 0000000000..00165ab051
--- /dev/null
+++ b/test/src/regex-tests.el
@@ -0,0 +1,92 @@
+;;; regex-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
+
+;; Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+
+(ert-deftest regex-word-cc-fallback-test ()
+  "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020).
+
+Test that a regex of the form \"[[:cc:]]*x\" where CC is
+a character class which matches a multibyte character X, matches
+string \"x\".
+
+For example, ‘[[:word:]]*\u2620’ regex (note: \u2620 is a word
+character) must match a string \"\u2420\"."
+  (dolist (class '("[[:word:]]" "\\sw"))
+    (dolist (repeat '("*" "+"))
+      (dolist (suffix '("" "b" "bar" "\u2620"))
+        (dolist (string '("" "foo"))
+          (when (not (and (string-equal repeat "+")
+                          (string-equal string "")))
+            (should (string-match (concat "^" class repeat suffix "$")
+                                  (concat string suffix)))))))))
+
+(defun regex--test-cc (name matching not-matching)
+  (should (string-match-p (concat "^[[:" name ":]]*$") matching))
+  (should (string-match-p (concat "^[[:" name ":]]*?\u2622$")
+                          (concat matching "\u2622")))
+  (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching))
+  (should (string-match-p (concat "^[^[:" name ":]]*\u2622$")
+                          (concat not-matching "\u2622")))
+  (with-temp-buffer
+    (insert matching)
+    (let ((p (point)))
+      (insert not-matching)
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]"))
+      (should (equal (point) p))
+      (skip-chars-forward (concat "^[:" name ":]"))
+      (should (equal (point) (point-max)))
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]\u2622"))
+      (should (or (equal (point) p) (equal (point) (1+ p)))))))
+
+(ert-deftest regex-character-classes ()
+  "Perform sanity test of regexes using character classes.
+
+Go over all the supported character classes and test whether the
+classes and their inversions match what they are supposed to
+match.  The test is done using `string-match-p' as well as
+`skip-chars-forward'."
+  (let (case-fold-search)
+    (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n")
+    (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n")
+    (regex--test-cc "digit" "012" "abcABCłąka-, \t\n")
+    (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n")
+    (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n")
+    (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n")
+
+    (regex--test-cc "word" "abcABC012\u2620" "-, \t\n")
+
+    (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n")
+    (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ")
+    (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1")
+    (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1")
+
+    (regex--test-cc "space" " \t\n\u2001" "abcABCł0123")
+    (regex--test-cc "blank" " \t" "\n\u2001")
+
+    (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620")
+    (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1")
+    (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622")
+    (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1")))
+
+;;; regex-tests.el ends here
author	Michal Nazarewicz <mina86@mina86.com>	2016-07-18 15:59:26 +0200
committer	Michal Nazarewicz <mina86@mina86.com>	2016-07-25 23:52:27 +0200
commit	6dc6b0079ed3632ed9082bc79d8cb6fc96d33f43 (patch)
tree	ffad067337d44c6b559f474fb7421fa7fcf892c6 /test
parent	b176d169347925d57ca63ab63b85d92e49a53c81 (diff)