Regular Expressions

Regular Expressions
udemy2 placeholder image

Regular expressions are a formatted sequence or pattern of characters that can be used in a search operation. They are written in a specific syntax and then are usually used to search for patterns in other text, or returning whether or not that text has met the pattern.

Matching a Single Word

PHP offers us many different built-in functions to use regular expressions. Let's look at an example of matching a single word using preg_match():

	
    <?php
        $string = 'Cheesecakes taste good.';
        $pattern = '/cake/';
        echo(preg_match($pattern, $string));
    ?>
	
	
    1
	

That means the string cake was found in the string Cheesecakes taste good..

In general, the syntax for preg_match() is:

	
    preg_match($pattern, $string);
	

Using Meta Characters

Regular expressions aren't limited to just straight text. Our search patterns can utilize meta characters. Meta characters are characters that can be used in our regular expressions that have special meaning to them.

For example, a dot . character will match any single character:

	
    <?php
        $string = 'Cheesecakes taste good.';
        $pattern = '/t.ste/';
        echo(preg_match($pattern, $string));
    ?>
	
	
    1
	

Notice how even though our pattern read t.aste that it was still able to match the string taste? The . matched the a.

Here's a list of all the meta characters you can use in your PHP regular expressions.

  • .: Any character
  • ?: Makes the previous character optional
  • \w: A word character
  • \W: A non-word character
  • \d: A digit
  • \D: A non-digit character
  • \s: A whitespace character
  • \S: A non-whitespace character
  • \b: A match at the beginning/end of a word
  • \B: A match not at the beginning/end of a word
  • \0: A NUL character
  • \n: A new line character
  • \f: A form feed character
  • \r: A carriage return character
  • \t: A tab character
  • \v: A vertical tab character
udemy placeholder image

Using Pattern Modifiers

A pattern modifier allows you to change the way a pattern match works. A modifier is just a character put at the end of the pattern that tells PHP to alter it's behavior.

Here's how to do a case-insensitive global search for a string inside another string:

	
    <?php
        $pattern = "/apple/i";
        $text = "Apples can be different colors, like a red apple, a yellow apple, and a green apple.";

        $matches = preg_match_all($pattern, $text, $array);
        echo($matches . " matches were found.");
        print_r($array);
    ?>
	
	
    4 matches were found.
    Array ( [0] => Array ( [0] => Apple [1] => apple [2] => apple [3] => apple ) )
	

We found the word apple in our string four times, including the first match which was the uppercase Apple because we added i at the end of the pattern, which instructed PHP to ignore cases.

We also used a new function called preg_match_all() which is the same as the earlier preg_match() function except it will return all matches. Here's the basic syntax for preg_match_all():

	
    preg_match_all($pattern, $text, $array);
	

Here are other modifiers you can use:

  • i: This makes the searching case-insensitive
  • g: This makes the searching global which prevents it from stopping after the first match
  • o: This evaluates the expression only once.
  • m: This makes the searching multiline instead of a single line
  • s: This makes dot . characters match all characters.

Matching using Sets

Sets allow you to match against a set of characters that you enclose between brackets. You can think of it as like using the dot . except you define exactly what characters can be matched.

Let's look at a set that matches the first three integers:

	
    <?php
        $string = 'I pay 15% in tips.';
        $pattern = '/[123]/';
        echo(preg_match($pattern, $string));
    ?>
	
	
    1
	

A match was found because the number 1 as part of 15% matched one of the characters we defined in the set, [123].

You can also do the inverse of the set by negating it. This is done by placing a ^character right after the opening bracket. Let's match any character that is not one of those three numbers:

	
    <?php
        $string = 'I pay 15% in tips.';
        $pattern = '/[^123]/';
        echo(preg_match($pattern, $string));
    ?>
	
	
    1
	

As expected, because even just the very first chracter, the I was not any of the first three numbers in our set, there was a match.

Alternatively, you can express a set as a range between two values. For example, this is how you can match any number using a set:

	
    $number = '/[0-9]/';
	

Matching Words and Sentences

In addition to matching individual characters and sets of characters, you can also match words and sentences using symbols that represent repetition. This takes advantage of the fact that words and sentences are simply repetitions of letters, with some punctuation.

Let's say we define a sentence as a string that begins with a capital letter. So far this makes our pattern look like this:

	
    <?php
        $pattern = '/[A-Z]/';
    ?>
	

After the capital letter, we can match anything afterwards to continue the pattern. If you remember, this is done with the dot . symbol. Our pattern now looks like this:

	
    <?php
        $pattern = '/[A-Z]./';
    ?>
	

Finally, we need to tell PHP to keep matching the previous symbol. In other words, we want to keep matching anything after the first capital letter. This is done with the plus + sign. This pattern now matches a capital letter plus anything afterwards:

	
    <?php
        $pattern = '/[A-Z].+/';
    ?>
	

Finally, to end the sentence, we can define it as the presence of either a period, question mark, or exclamation point. Now our pattern looks like this:

	
    <?php
        $pattern = '/[A-Z].+(\.|\?|!)/';
    ?>
	

We just have one final issue to resolve. As of now, this pattern will match multiple sentences as one giant one since technically the start will be a capital letter and the end will be some kind of punctuation. We need to tell PHP to take a match as soon as it is available and continue searching for more matches. We do this with a question mark ? meta character.

Finally, our regular expression can match sentences:

	
    <?php
        $pattern = '/[A-Z].+?(\.|\?|!)/';
        $text = "Hello. I am sample text! How are you?";

        $matches = preg_match_all($pattern, $text, $array);
        echo($matches . " matches were found.");
        print_r($array);
    ?>
	
	
    3 matches were found.
    Array ( [0] => Array ( [0] => Hello. [1] => I am sample text! [2] => How are you? ) [1] => Array ( [0] => . [1] => ! [2] => ? ) )
	

Now you can access each individual match found by accessing the different elements in the array.

Here are all the repetition-related symbols, called quantifiers:

  • +: This repeats the previous character or set one or more times
  • *: This repeats the previous character or set zero or more times
  • ?: This repeats the previous character or set zero or one time
  • {a}: This repeats the previous character exactly a number of times
  • {a, b}: This repeats the previous character any number between a and b
namecheap placeholder image

Testing and Validation

Another popular use case for regular expresions is testing whether or not a match was found on the string provided. This is awesome for trying to validate whether or not the string is in a specific format that we want, such as a valid email address, username, password, URL, and others.

To test an entire string for a match, you must begin the regular expression pattern with ^ and end it with $.

Let's practice by trying to validate a username for a website. Let's define a valid username as the following two rules:

  1. Alphanumeric characters plus periods and dashes only
  2. 6 to 12 characters in length

The first rule can be fufilled using this pattern:

	
    <?php
        $pattern = '/^[a-zA-Z0-9.-]$/';
    ?>
	

Pretty simply put, we have defined three sets here a-z, A-Z, plus our two additional characters . and -. Combined this allows for the pattern to check for lowercase letters, uppercase letters, numbers, periods, and dashes.

Now let's enforce the second rule which is a length size in our pattern. Using one of the quantifiers we learned about above, we have our final pattern:

	
    <?php
        $pattern = '/^[a-zA-Z0-9.-]{6,12}$/';
        $array = ['Username',
                    'U.sername',
                    'U53RN4M3',
                    'user',
                    '[email protected]',
                    'usernamezzzzz'];
        foreach ($array as $value) {
            echo(preg_match($pattern, $value));
        }
    ?>
	
	
    1
    1
    1
    0
    0
    0
	

Just like that we are able to define a pattern, test it on the entire string, then return a boolean indicating whether or not the string met those rules. Now you can have some logic to send back the user a message if their username isn't up to par!

Search and Replace

Yet another popular use for regular expressions is performing search and replace operations. Instead of the typical search and replace function where you hand it a string to replace it with another string, you can instead pass in a regular expression and it'll replace all matches with your new string.

	
    <?php
        $pattern = '/(\w+) (\w+) (\w+)/';
        $text = 'i love you';
        $replacement = '$3 $2 $1';
        echo(preg_replace($pattern, $replacement, $text));
    ?>
	
	
    you love i
	

To understand what is going on here, you must be introduced to a new concept in regular expressions called groups. By wrapping three different patterns in parenthesis, we have defined three different groups, each being able to be referenced by $1, $2, and $3.

Now we can create a new string as the replacement that will define the final output.

Because the replacement string just references the three groups in reverse order, the output was simply the three words from the original string in reverse order. That is why i love you became you love i.

Groups are a powerful way to create a new string using another string where you can fully control the final output.

Conclusion

Regular expressions are widely used in pretty much every programming language because of how powerful and flexible they are. They allow you to do everything from matching a single character, to entire words and sentences, testing and validating, and search and replaces.