Understanding Regular Expression Syntax and Special Characters

This article will provide a solid foundation to get started using Regex and provides a Reference Guide at the end.

Understanding Regular Expression Syntax and Special Characters

Regular expressions (RegEx, or regex) are powerful tools used in programming and text processing to search, match, and manipulate patterns within strings. To harness their full potential, it's crucial to understand the syntax and special characters they employ. In this learning guide, we will explore each special character, provide resources for further study, and use clever analogies and teaching methods to enhance comprehension.

Before moving forward I want to thank 🦉 Regex Buddy 🦉 from https://www.regular-expressions.info/. Many of the links (🔗) provided are straight from this site, which has remained my go-to for regex resources since I found it.

Enjoy this 👆 website's vast amount of regex relics. Keep it handy as you read this article as it serves as a reference, a regex tester, and a supplemental resource to take concepts learned even further so you can test out and fiddle with the expressions explained throughout

  1. The Caret Symbol (^): The caret symbol is used in regular expressions to indicate matching at the beginning of a string. For example, the pattern "^active" will only match strings that start with "active." Think of the caret as an anchor that locks the match to the start of the string. 🔗

  2. The Dollar Symbol ($): The dollar symbol signifies matching at the end of a string. When using the pattern "state$", it will only match strings that end with "state." Visualize the dollar symbol as a hook that grabs the match at the string's conclusion. 🔗

  3. Character Classes ([]) Character classes allow us to specify a set of characters that can match at a particular position. Within the square brackets, we can define a range or list of characters to match against. For instance, "[3a-c]" matches either a digit 3 or any lowercase letter from 'a' to 'c.' 🔗

  4. Negated Character Classes ([^]): The negated character class, denoted by "[^x-z1]," matches any character except those specified within the brackets. In this example, it matches any character that is not 'x,' 'y,' 'z,' or '1.' Think of it as an "anti-match" operator. 🔗

  5. Alternation (|) The pipe symbol (|) allows us to specify alternative patterns. For example, the regex "A|S" matches either an 'A' or an 'S'. Imagine the pipe as a fork in the road, offering different paths to choose from. 🔗

  6. Grouping (()): Grouping with parentheses serves two purposes: capturing and matching a group of characters. For instance, the pattern "(8097ba)" captures and matches the sequence of characters "8097ba." Consider the parentheses as a container that captures and labels a specific portion of the string. 🔗

  7. Escaping Special Characters (): To match a literal special character, such as a dot (.), dollar ($), or caret (^), we need to escape it with a backslash (). For example, to match the exact sequence "ac..ve," we write the pattern "ac..ve." Visualize the backslash as a protective shield preserving the character's literal meaning. 🔗

Mastering special characters in regular expressions gives you the ability to craft precise and powerful patterns for pattern matching and text manipulation. Use the provided resources and analogies to deepen your understanding, and practice applying these concepts to real-world scenarios. Regular expressions may seem daunting at first, but with practice, they become a valuable tool in your programming arsenal.

Regex In Action

Here are examples of how we can use RegEx with JavaScript and Python.

JavaScript Example:

const inputString = "active state 3b";
const caretRegex = /^active/;
const dollarRegex = /state$/;
const charClassRegex = /[3a-c]/;
const negatedCharClassRegex = /[^x-z1]/;
const alternationRegex = /A|S/;
const groupingRegex = /(8097ba)/;
const escapedCharRegex = /ac\.\.ve/;

console.log(caretRegex.test(inputString)); // Output: true
console.log(dollarRegex.test(inputString)); // Output: true
console.log(charClassRegex.test(inputString)); // Output: true
console.log(negatedCharClassRegex.test(inputString)); // Output: true
console.log(alternationRegex.test(inputString)); // Output: false
console.log(groupingRegex.exec(inputString)); // Output: ['8097ba', index: 7, input: 'active state 3b', groups: undefined]
console.log(escapedCharRegex.test(inputString)); // Output: false

Python Example:

import re

input_string = "active state 3b"
caret_regex = re.compile('^active')
dollar_regex = re.compile('state$')
char_class_regex = re.compile('[3a-c]')
negated_char_class_regex = re.compile('[^x-z1]')
alternation_regex = re.compile('A|S')
grouping_regex = re.compile('(8097ba)')
escaped_char_regex = re.compile('ac\.\.ve')

print(caret_regex.search(input_string))  # Output: <re.Match object; span=(0, 6), match='active'>
print(dollar_regex.search(input_string))  # Output: <re.Match object; span=(7, 12), match='state'>
print(char_class_regex.search(input_string))  # Output: <re.Match object; span=(13, 14), match='3'>
print(negated_char_class_regex.search(input_string))  # Output: <re.Match object; span=(15, 16), match='b'>
print(alternation_regex.search(input_string))  # Output: None
print(grouping_regex.search(input_string))  # Output: <re.Match object; span=(6, 13), match=' state '>
print(escaped_char_regex.search(input_string))  # Output: None

In both examples, we define regular expressions using the respective languages' regex syntax. We then apply these regular expressions to the input string using test() in JavaScript and search() in Python, then check for a match. The outputs demonstrate the results of the matching operations.

Note: The examples assume the input string provided is representative of the entire text being processed. In practice, you may need to modify the regular expressions and input string depending on your specific requirements.

Let's break each of these code examples down and dig a bit deeper. After reading, the best way to move forward is to start making up your own examples with practical use cases. Use the reference guide below, and hit me up anytime with questions or free reviews.

Regular Expressions Explained:

  1. caretRegex = /^active/

    • ^ is the caret symbol, indicating matching at the beginning of a string.

    • active is the literal string to match.

    • Description: This regex matches strings that start with "active".

  2. dollarRegex = /state$/

    • state is the literal string to match.

    • $ is the dollar symbol, indicating matching at the end of a string.

    • Description: This regex matches strings that end with "state".

  3. charClassRegex = /[3a-c]/

    • [3a-c] is a character class, matching any character that is either '3', 'a', 'b', or 'c'.

    • Description: This regex matches any occurrence of '3', 'a', 'b', or 'c' in the string.

  4. negatedCharClassRegex = /[^x-z1]/

    • [^x-z1] is a negated character class, matching any character that is not 'x', 'y', 'z', or '1'.

    • Description: This regex matches any character except 'x', 'y', 'z', or '1' in the string.

  5. alternationRegex = /A|S/

    • A|S is an alternation, matching either 'A' or 'S'.

    • Description: This regex matches either 'A' or 'S' in the string.

  6. groupingRegex = /(8097ba)/

    • (8097ba) is a capturing group, capturing and matching the sequence of characters "8097ba".

    • Description: This regex captures and matches the sequence "8097ba" in the string.

  7. escapedCharRegex = /ac\.\.ve/

    • ac\.\.ve is a sequence of literal characters, where \. matches a literal dot character.

    • Description: This regex matches the exact sequence "ac..ve" in the string, where the dot is escaped to match a literal dot.

This article provides you with a solid foundation of regular expressions (RegEx, or regex). As stated earlier, the best way to move forward is to start making up your own examples with practical use cases. Use the reference guide below, and hit me up anytime with questions or free reviews.

REGEX REFERENCE GUIDE

Regex

Matches ...

/^active/

Matches strings starting with "active".

/state$/

Matches strings ending with "state".

/[3a-c]/

Matches any occurrence of '3', 'a', 'b', or 'c'.

/[^x-z1]/

Matches any character except 'x', 'y', 'z', or '1'.

/(8097ba)/

Captures and matches the sequence "8097ba".

/^\d{3}-\d{3}-\d{4}$/ 

 ... a phone number in the format "###-###-####". 

/[a-zA-Z]+[0-9]+/   

 ... a sequence of letters followed by a sequence of digits. 

/\w/g     

 ... a single alphanumeric character globally. 

/\W/g     

 ... a single non-alphanumeric character globally. 

/^.{3}$/  

 ... a string consisting of exactly three characters. 

/^\S+$/    

 ... a string consisting of one or more non-whitespace characters. 

/\b\d{3}\b/     

 ... a three-digit number as a whole word. 

/^\d{1,2}:\d{2}$/ 

 ... a time in the format "hh:mm" (24-hour format). 

/[0-9]{2,4}/ 

 ... a two to four-digit number. 

/\b\d{2,4}\b/ 

 ... a whole number consisting of two to four digits. 

/\b\w+\b/ 

 ... a whole word consisting of one or more alphanumeric characters. 

/\b\w{5}\b/ 

 ... a word consisting of exactly five alphanumeric characters. 

/^\w+$/    

 ... a word consisting of one or more alphanumeric characters. 

/\b[A-Z]+\b/ 

 ... a word consisting of one or more uppercase letters. 

/^$/      

 ... an empty string. 

/\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}/ 

 ... an IP address in the format "x.x.x.x" where each 'x' is a number from 0 to 255. 

/[^\d\W]/  

 ... any alphanumeric character. 

/./g      

 ... any character (including newline) globally. 

/[^0-9]/  

 ... any character except a digit. 

/[^\dA-F]/i 

 ... any character that is not a hexadecimal digit (case-insensitive). 

/[^\w-]/    

 ... any character that is not a word character or a hyphen. 

/[^aeiou]/ 

 ... any consonant character. 

/[aeiou]/ 

 ... any lowercase vowel character. 

/[^A-Za-z0-9]/  

 ... any non-alphanumeric character. 

/[^\d\sA-Za-z]/    

 ... any non-alphanumeric, non-whitespace character. 

/[.+?{]/ 

 ... any of the special characters ".", "+", "", "?", or "{". 

/[a-zA-Z]/ 

 ... any uppercase or lowercase letter. 

/([A-Z])\w+/g   

 ... capitalized words globally and captures the first letter. 

/[A-Z][a-z]+/   

 ... capitalized words. 

/^\d+(,\d+)$/    

 ... comma-separated numbers, allowing zero or more occurrences. 

/(.)(?=.\1)/  

 ... consecutive repeated characters. 

/(.)\1+/   

 ... consecutive repeated characters. 

/\$\d+.\d{2}/    

 ... currency amounts in the format "$x.xx". 

/\d{4}-\d{2}-\d{2}/  

 ... dates in the format "yyyy-mm-dd". 

/(\d+).(\d+)/  

 ... decimal numbers in the format "x.y" and captures the whole and fractional parts separately. 

/[\w-]+@[a-z]+.[a-z]{2,}/i   

 ... email addresses case-insensitively. 

/\S+@\S+.\S+/  

 ... email addresses. 

/[A-Z]{3}/ 

 ... exactly three uppercase letters. 

/.([^.]+)$/     

 ... file extensions and captures the extension without the dot. 

/\b[A-Za-z]{5}\b/  

 ... five-letter words. 

/[^aeiou\s]{4}/i  

 ... four consecutive consonant characters, case-insensitively. 

/^([01]{8}\s?){4}$/    

 ... groups of four 8-bit binary numbers separated by optional whitespace. 

/^\d+-\d+$/    

 ... hyphenated numbers in the format "x-y". 

/\d+.\d+.\d+.\d+/    

 ... IP addresses in the format "x.x.x.x". 

/(\w+)\s=\s(\w+)/   

 ... key-value pairs in the format "key=value" and captures the key and value separately. 

/\w+\s=\s\w+/   

 ... key-value pairs in the format "key=value" with optional whitespace around the equals sign. 

/(\w+)\s=\s'"['"]/   

 ... key-value pairs where the value is enclosed in single or double quotes and captures the key and value separately. 

/(?!^)[a-z]/g    

 ... lowercase letters except the first character in each line. 

/(\d+)-(\d+)/   

 ... number ranges in the format "x-y" and captures the starting and ending numbers separately. 

/^(\d+)-\1-\1$/   

 ... numbers repeated three times separated by hyphens. 

/\d+(?!\w)/    

 ... numbers that are not followed by a word character. 

/\d+(?!\d)/    

 ... numbers that are not followed by another digit. 

/[^.!?,]+/g  

 ... one or more characters that are not periods, exclamation marks, question marks, or commas, globally. 

/^\d+/    

 ... one or more digits at the start of the string. 

/\d++{2,}/    

 ... one or more digits followed by two or more plus (+) characters. 

/\d+/g    

 ... one or more digits globally. 

/[a-z]+/i  

 ... one or more lowercase letters case-insensitively. 

/[^\w]/g   

 ... one or more non-alphanumeric characters globally. 

/[^\w\s]/g 

 ... one or more non-alphanumeric, non-whitespace characters globally. 

/[^\d]+/g 

 ... one or more non-digit characters globally. 

/\D+/g     

 ... one or more non-digit characters globally. 

/[^\d\s]+/g 

 ... one or more non-digit, non-whitespace characters globally. 

/[^\s]+/g   

 ... one or more non-whitespace characters globally. 

/[A-Za-z]+/  

 ... one or more uppercase or lowercase letters. 

/\s+/g     

 ... one or more whitespace characters globally. 

/^(?=.[a-z])(?=.[A-Z])(?=.\d)[a-zA-Z\d]{8,}$/  

 ... passwords with at least eight characters, containing at least one lowercase letter, one uppercase letter, and one digit. 

/(\d{3}-?){2}\d{4}/  

 ... phone numbers in the format "xxx-xxx-xxxx" or "xxxxxxxxxx". 

/([A-Z][a-z]+)\s+\1/ 

 ... repeated capitalized words separated by whitespace. 

/(\b\w+\b)\s+\1/  

 ... repeated consecutive words separated by whitespace. 

/\b(\w+)\b(?=.\b\1\b)/   

 ... repeated words in a sentence. 

/(\w+),\1/  

 ... repeated words separated by a comma. 

/^(\w+)\s+\1$/  

 ... repeated words separated by whitespace. 

/^(\w+ )+\w+.$/   

 ... sentences that end with a period, where each word is separated by a space. 

/([A-Za-z]+)\s+is\s+([A-Za-z]+)/  

 ... sentences that state something is something else, e.g., "Apple is a fruit." and captures the subject and object. 

/^([a-z]+ )+[a-z]+$/  

 ... sentences where each word is separated by a space, with only lowercase letters. 

/[^.]+/  

 ... sequences of characters that are

/^#\w{6}$/  

 ... six-character hexadecimal color codes starting with a hashtag (#). 

/^[A-Za-z0-9]{8}$/  

 ... strings consisting of exactly eight alphanumeric characters. 

/^[\w\s]$/     

 ... strings consisting of only alphanumeric characters and whitespace. 

/^[^0-9]$/  

 ... strings that do not contain any digits. 

/(.?)/g   

 ... text enclosed in parentheses globally, using non-greedy matching. 

/[\w+]/    

 ... text enclosed in square brackets, e.g., "[abc]". 

/\bcat\b/     

 ... the exact word "cat" (whole word match). 

/\bword\b/ 

 ... the exact word "word" (whole word match). 

/[^\d\s]{3}/g    

 ... three consecutive characters that are neither digits nor whitespace characters, globally. 

/[^aeiou]{3}/     

 ... three consecutive consonant characters. 

/[a-zA-Z]{3}\d+/  

 ... three consecutive letters followed by one or more digits. 

/a{3,5}/   

 ... three to five consecutive 'a' characters. 

/\b\d{3}\b/  

 ... three-digit numbers as whole words. 

/^\d{1,2}:\d{2}(?::\d{2})?(?:\s?[AP]M)?$/i  

 ... time in the format "hh:mm:ss AM/PM" or "hh:mm AM/PM", case-insensitively. 

/(\d+):(\d+)/  

 ... time in the format "hh:mm" and captures the hour and minute separately. 

/[a-z]{2}\d?/   

 ... two lowercase letters followed by an optional digit. 

/(\d{3}-\d{2}){2}/   

 ... two occurrences of a 3-digit number followed by a hyphen and a 2-digit number. 

/(\d{4}-\d{2}-\d{2}){2}/   

 ... two occurrences of a date in the format "yyyy-mm-dd" separated by a hyphen. 

/([A-Z][a-z]+){2,}/   

 ... two or more consecutive capitalized words. 

/[aeiou]{2,}/i 

 ... two or more consecutive vowels case-insensitively. 

/\s{2,}/g  

 ... two or more consecutive whitespace characters globally. 

/(\d{2}-){2}\d{2}/  

 ... two-digit numbers separated by hyphens, such as "xx-xx-xx". 

/(?!^)[A-Z]/g  

 ... uppercase letters except the first character in each line. 

/^https?:\/\/[^\s]+$/  

 ... URLs starting with "http://" or "https://". 

/\/{2}.+/    

 ... URLs starting with two consecutive slashes. 

/\b\d+\b/     

 ... whole numbers (positive integers). 

/\b[A-Z]+\b/    

 ... whole words consisting of only uppercase letters. 

/^[A-Za-z]{4,}$/  

 ... words consisting of at least four consecutive letters. 

/(.)\w\1/    

 ... words that start and end with the same letter. 

/([a-z])\w\1/i    

 ... words where the first and last letter are the same, case-insensitively. 

/(\w)\w+\1/  

 ... words where the first and last letter are the same. 

/[^\d\s]{3,5}/g  

 ... words with 3 to 5 consecutive non-digit, non-whitespace characters. 

/[^\W\d_]{5,}/g  

 ... words with at least five consecutive non-special characters. 

/\w{5,}/  

 ... words with five or more consecutive alphanumeric characters. 

/[xyz]/   

 ... zero or more occurrences of 'x', 'y', or 'z'. 

I hope this has helped some fresh regexers (horrible name, lol)... learners of regex. Never hesitate to hit me up with questions, comments, jobs, or anything tech related!!! Please ❤️ if you find value and subscribe to the newsletter for more articles about React, Web Development, AI-Language Models (ChatGPT), React Native, Typescript, TailwindCSS, Tutorials, Learning Aids, and more!!!

Jon Christie

jonchristie.net