• Allow define text patterns.
  • Validate data by users.
  • Pick parts from longer strings.
  • Can convert data into new format.
  • Used for find-and-replace operations.
  • Allow search own code.
  • What are regular expressions?
    • Tools for searching and matching parts of text.
    • Symbols representing text pattern.
    • Format language interpreted by regular expression processor.
    • Used for matching, searching and replacing text.
    • Used by programming languages.
    • RegexP sometimes referred as regular expressions.
  • Usage Examples.
    • Test if credit card number has correct nuber of digits.
    • Test email address is in valid format.
    • Search document either “colour” or “colour”
    • Replace occurrences of “Bob”, “Bobby”, or “B.” with “Robert”
    • Count how many times “training” preceded by “computer”, “video” or “online”.
  • Use symbols to describe a pattern.
  • Matches is a keyword.
  • Regular expression matche text if correctly describes text.
  • Text matches regular express if correctly described by expression.
  • Choosing Regular Expression Engine
    • C/C++
    • Java
    • Javascript
    • .NET
    • Perl
    • PHP
    • Python
    • Ruby
    • Unix
    • Apache
    • MySQL
  • RegEx Editors
    • grep, egrep
    • TextMate
    • Atom
    • Sublime
    • Notepad++
    • RegexBuddy
    • RegexMagic
    • regexr.com
    • regex101.com
    • regexpal.com
  • Notation Conventions and Modes
    • Regular Expression
      • /abc/
        • Inside two forward slashes.
          • Tells us there is an expression.
      • / open
      • / close
      • Use without forward slashes.
    • Text String
      • “abc”
        • Matches literal character in text string.
        • Use without quotes.
    • Flags indicate different modes using with RegEx
  • Modes
    • Standard: /re/
    • Global: /re/g
    • Case insensitive: /re/i
    • Multiline: /re/m
      • By default can’t span more than one line.
  • Grep
    • g/re/p
      • Used to be that global flag was at beginning.
    • Global Regular Expression Print
    • Often used as a verb.
  • Literal Characters
    • /car/ matches “car”
    • /car/ matches the first three letters of “carnival”
    • Similar to searching in a word processor.
    • Simplest match there is.
    • Case-sensitive (by default)
      • Good advice, write all Regex characters as case sensitive.
        • More compatible and stops making mistakes.
      • /car/i
        • This would be case insensitive.
    • Spaces are also characters.
    • Standard (non-global) matching.
      • Earliest (leftmost) match is always preferred.
      • /zz/ matches the first set of Zs in “pizzazz.”
    • Global Matching
      • All matches found throughout text.
      • Use Global flag.
        • /zz/g
          • Keeps going until gets to the end of the string.
      • Regex moves from left to write.
      • /cat/
        • The cow, camel , and cat communicated.
  • Metacharacters
    • Characters with special meaning.
    • Transform literal characters into powerful expressions.
    • Only a few to learn.
    • \ . * + {} [] ^ $ ? () : ! =
    • Can have more than one meaning.
    • /zz/ .* .+ [abc] $ ? (abc) \d{2,3}/
  • Wildcard Metacharacter
    • .
      • Any character except newline.
      • /h.t/ matches “hat”, “hot” and “hit” but not “heat”
        • Broadest match possible.
        • Most common metacharacter.
        • Most common mistake.
          • /9.00/ matches “9.00”, “9500” and “9-00”.
  • A good regular expression should match text want target and only that text.
    • /h.t/
      • matches “h t”
      • matches “h#t”
      • Line breaks do not match.
    • /.a.a.a/
      • Should match “banana”, “papaya”
  • Escaping Metacharacters
      • Escape the next character.
      • What is escaping?
        • Treat the metacharacter that follows as a literal character.
    • Allows use of metacharacters as literal characters.
    • Match a period with /\./
    • /9\.00/ matches “9.00”, not “9500” or “9-00”
    • Match backslash by escaping a backslash
      • (\)
    • Only for metacharacters
    • Literal characters should never be escaped - may give them meaning.
    • Quotation marks not metacharacters; do not need escaped.
    • /9.00/g
      • /9\.00/g
        • Matches 9.00
    • /h.._export\.txt/g
      • Matches all of the following:
        • his_export.txt her_export.txt
          • Should escape literal period with \.
        • /resume.\.txt/g
          • resume1.txt resume2.txt resume3_txt.zip
            • The above expression matches the first two text files.
        • /\//g
          • Shows that you want to see a literal forward slash.
    • A space is a character
    • Tabs
      • (\t)
        • This is a control character.
    • Line returns (\r, \n , \r\n)
    • /a\tb/g
    • /c\rd/g
      • Matches this:
        • abc def
  • Define a Character Set
    • [
      • Begin a character set.
    • ]
      • End a character set.
    • Any one of several chracters.
    • Only one character.
    • Order of character in set does not matter.
    • /[aeiou]/
      • Matches any one vowel.
    • /gr[ea]y/
      • Matches “grey” and “grey”
        • Only a single character.
    • /gre[ea]t/ does not match “great”
    • /#[-123456789]
      • Contestant#1
      • Contestant#2
  • Character Ranges
    • Helps with character sets.
    • /[0123456789/
    • /[abcdefghijklmnopqrstuvwxyz]/
    • /[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
        • Indicates a range of characters.
          • Includes all characters between two characters.
            • Can do with letters and symbols.
          • Only a metacharacter inside a character set; a dash otherwise.
          • [0-9]
          • [A-Za-z]
            • These two ranges match the character sets above.
          • Caution
            • [50-00] is not all numbers from 50 to 99.
            • It is same as [0-9].
            • Character set including 5, 0-9, and 9.
            • Not a number range, it is a character range.
          • Matching phone number for example
            • 555-666-7890
              • To match this:
                • /[0-9][0-9][0-9]-[0-9][0-9]0-9]-[0-9][0-9][0-9]/
  • Negative Metacharacters
      • Negate a character set.
    • Not any one of several characters.
      • It can’t be there.
    • Add ^ as first character inside character set.
    • /[^aeiou]/ matches any one consonant (non-vowel)
    • /see[^mn]/ matches:
      • “seek” and “sees”
        • Does not match:
          • “seem” or “seen”
          • “see.” and “see “ also matched.
            • Caution
              • Does not match “see”
    • [^abcdefghijklmnopqrstuvwxyz]
    • /[^a-zA-Z]/g
      • Looks for spaces.
  • Metacharacters Inside Character Sets
    • Most metacharacters inside character sets already escaped.
    • Do not need escape them again.
    • /h[a.]t/ matches
      • “hat” and “h.t”
        • NOT “hot”
    • Exceptions: ] - ^ \
    • /var[[(][0-9][\])]/
      • Finds var(3) var[4]
    • /file[0\-\_]1/
    • /2013[-/]10[-/]05/
    • /file[0\-\_]1/g
      • file01 file-1 file\1 file_1
  • Shorthand Character Sets
    • \d
      • Digit
        • [0-9]
    • \w
      • Word Character
        • [a-zA-Z0-9]
    • \s
      • Whitespace
        • [\t\r\n]
    • \D
      • Not digit
        • [^0-9]
    • \W
      • Not word character
        • [^a-zA-Z0-9_]
    • \S
      • Not whitespace
        • [^ \t\r\n]
      • Very old expression tools in Unix might not have these.
    • Caution
      • Underscore is a word character!
      • Hyphen is not a word character!
    • /\d\d\d\d/ matches “1984”, but not “text”
    • /\w\w\w/ matches “ABC”, “123” and “1_A”
    • /\w\s\w\w/ matches “I am”, but not “Am I”
    • /[\w\-]/ matches any word character or hyphen (useful)
    • /[^\d]/ same as /\D/ and /[^0-9]/
      • Caution
        • /[^\d\s] IS NOT THE SAME AS [\D\S]
        • /[^\d\s]/ IS NOT digit OR space character
        • /[\D\S]/ EITHER NOT digit OR NOT space character
    • [^\d\s]
      • Finds something is not a digit or a space.
  • Repetition Metacharacters
        • Preceding (item) zero or more times.
          • Preceding (item) one or more times.
    • ?
      • Preceding (item) zero or one time.
    • /.+/ matches any string of characters except a line return.
    • /Good .+\./ matches “Good morning.”, “Good day.”, “Good evening.” and “Good night.”
    • /\d+/
      • matches “90210”
    • /\s[a-]+ed\s/
      • matches “lowercase words ending in “ed”
    • /apples*/ matches
      • “apple”, “apples” and “applesssssss”
    • /apples+/ matches
      • “apples” and “applesssss”, but not “apple”
    • /appples?/ matches “apple” and “apples”, but not “applessssss”
    • /\d\d\d+/ matches
      • numbers with three digits or more.
    • /colou?r/ matches
      • “colour” and “colour”
    • /.+/
      • Anything but a new line gets matached.
      • /Good .+\./g
        • Matches:
          • Good morning.
          • Good day.
          • Good evening.
          • Good night.
  • Quantified Repetition.
    • {
      • Start quantified repetition of preceding item.
    • }
      • End quantified repetition of preceding item.
    • Specify amount of times something is repeated.
      • {min,max}
        • min and max are positive numbers.
        • min must always be included and can be zero.
        • max is optional.
    • Three syntaxes
      • \d{4,8} matches numbers with four - eight digits.
      • \d{4} matches numbers exactly four digits (min is max).
      • \d{4} matches numbers with four or more digits (max is infinite).
    • \d{0,} is same as \d*
    • \d{1} is same as \d+
    • /\d{3}-\d{3}-\d{4}/
      • Matches most US phone numbers.
    • /A{1,2} bonds/ matches
      • “A bonds” and “AA bonds”
        • NOT “AAA bonds”
    • /\d\. \w{5}\s/g
      • 1. a
      • 2. ab
      • 3. abc
      • 4. abcd
      • 5. abcde
      • 6. abcdef
        • The above expression matches number 5. string.
  • Greedy Expressions
    • /\d+\w+\d+/
      • 01_FY_07_report_99.xls
    • /”.+”, “.+”/
      • “Milton”, “Waddams”, “Initech, Inc.”
      • “Milton”, “Waddams”
      • “Milton”, “Waddams”, “Initech, Inc.”
    • Standard repetition quantifiers are greedy.
    • Expression tries match longest possible string.
    • Defers to achieving overall match.
    • / .+\.jpg/
      • matches “filename.jpg”
    • The + is greedy, but “gives back” the “.jpg” to make match.
    • Gives back as little as possible.
      • /.*[0-9]+/
        • matches “Page 266”
        • .* matches “Page 26”
        • [0-9]+ portion matches only “6”
        • Match as much as possible before give control next expression part.
      • / .*[0-9]+/
        • End of string doesn’t match wildcard anymore.
          • Page 266
    • Regular Expressions are eager.
      • Try and give back to you.
  • Lazy Expression
    • ?
      • Make preceding quantifier lazy.
    • *? - zero or one time.
    • +? - lazy behaviour
      • Instructs quantifier to do lazy strategy, before greedy strategy.
    • Still defers to overall match.
    • Not necessarily faster or slower.
    • /.*?[0-9]+/
      • Page 266
        • Matches the whole thing here.
  • Grouping Metacharacters
    • (
      • Start grouped expression.
    • )
      • End grouped expression.
    • Group portions of the expression.
    • Apply repetition operators to a group.
    • Create a group of alternation expressions.
    • Captures group for use in matching and replacing.
    • /(abc)+/ matches
      • “abc” and “abcabcabc”
      • /(in)?dependent/ matches
        • “independent” and “dependent”
      • /run(s)?/ same as /runs?/
    • Regexr can replace parts of text you find.
    • $1, $2 can give you matching groups of characters.
  • Alternation Metacharacter
      • Match previous or next expression.
    • is an OR
      • Either match expression on left or match expression on right.
      • Ordered, leftmost expression gets precedence.
    • Multiple choices daisy-chained.
    • Group alternation expressions keep distinct.
    • /apple orange/
      • match “apple” and “orange”
    • /abc def ghi jkl/
      • matches “abc”, “def”, “ghi” and “jkl”
    • /apple(juice sauce)/ not same as /applejuice sauce/
    • /w(eilie)rd/
      • matches “weird” and “wierd”
    • /(AA BB CC){4}/
      • matches “AABBAACC” and “CCCCBBBB”
  • Efficiency Using Alternation
    • /(peanut peanutbutter)/g
    • /peanut(butter)??/g
      • matches “peanut”, but not “butter”.
    • /(abc def ghi jkl)/
    • /(three see thee tree)/
        • I think those are thin trees.
    • Put simplest and most efficient expression first.
    • /\w+_\d{2,4} d{4}_export export_\d{2}}/
    • /export_\d{2} \d{4}_export \w+_\d{2,4}
  • Start and End Anchors
      • Start of string/line.
    • $
      • End of string/line
    • \A
      • Start of string, never end of line.
    • \Z
      • End of string, never end of line.
    • Reference a position, not a character.
    • They are zero-width.
    • /^apple/ or /\Aapple/
    • /apple$ or /apple\Z/
    • /^apple$/ or /\Aapple\Z/
      • Javascript does not support these.
      • Have to use PCRE.
    • /^def/
      • matches first occurence of string
        • defabcdefghi
    • /def$/
      • Check if at the end of the string
        • defabcdefghidef
    • /\w+@\w+\.[a-z]{3}/g
      • matches someone@nowhere.com-blahblahblah
  • Line Breaks and Multiline Mode
    • Using perl compatible here.
    • /^[a-z ]+/g
    • Single-line mode
      • default mode.
        • ^ and $ do not match at line breaks.
        • \A and \Z do not match at line breaks.
    • Multiline mode
      • ^ and $ match start and end of lines.
      • \A and \Z not match line breaks.
        • /m
          • Tells it to evaluate it in multiline mode.

Screenshot 2022-06-10 at 15.03.09.png

  • /^[a-z ]+$/gm
    • Matches everything here except “sweet peas (2)”: milk apple juice sweet peas (2)
  • Word Boundaries
    • \b
      • Word boundary (start/end of word)
    • \B
      • Not word boundary
    • Reference position and not actual character.
    • Before first word character in string.
    • After last word character in string.
    • Between word character and non-word character.
    • Word Characters: [A-Za-z0-9_]
    • /\b\w+\b/ finds four matches in “This is a test.”
    • /\b\w+\b/ matches
      • “abc_123”, only part of “top-notch”
    • /\bNew\bYork\b/ does not match “New York”
    • /\bNew\b \bYork\b/ matches “New York”
    • /\B\w+\B/ finds two matches in “This is a test.”, “hi and “es”.
    • /\ba\b/g
      • looks for the letter “a” with words on the side of it.
    • /\b\w+s\b/
      • We picked apples.

Updated: