Every Startup Should Learn Regular Expressions Regex

In this article I explained how you can use Google search operators to find leads.

A lot of people don’t use them because they never heard of them.

There are 2 other advanced computer skills that every startup should know:

  • Regular expressions aka Regex
  • Basic Unix Commands

Believe me when I tell you this.

Learning regular expressions will save you hours and hundreds of dollars.

This is why every startup should learn regular expressions.

Problem 1: Cleaning a Massive Excel File

You have a massive Excel file with user data.

Maybe a product list or a contacts list.

Perhaps you want to send an email to all your contacts.

To maximize your reach you might want to make sure all the email addresses are formatted correctly.

I have seen this too many times:

  • johndoe@gmailcom
  • marysmithyahoo.com

These emails will bounce because they have missing characters.

You can spend a few hours cleaning this data manually.

OR

You can learn Regular Expressions (Regex) and it will take you a few minutes.

Problem 2: Cleaning Up Data From Google Searches

Say that you want a list of all countries.

You could copy/paste from Wikipedia into Excel and spend 1-2 hrs cleaning up whatever you pasted. That is if Excel doesn’t crash. Since Excel is not great at pasting things from the web.

OR

You can learn Regular Expressions (Regex) and it will take you a few minutes.

The Long Way Solution

You could hire a virtual assistant to clean up this data.

Maybe they don’t know Regular Expressions so they still could take hours. At least they are not your hours.
Perhaps they know Regex so they could do it faster. But they might be in a different time zone. Like way different.

You want the data now!

The Most Efficient and Fastest Solution: Regex

Learn Regular Expressions And Automate

A Regular Expression (aka regex) is “a sequence of characters that define a search pattern”

  • A sequence of characters.
  • That define a search pattern.

A regex could find characters that make these patterns

Example: marysmithyahoo.com

  • Find a sequence of characters: any character up to “yahoo”
  • That make a search pattern: any word before “yahoo” that doesn’t have the “@”.

Requirements For Regex

  • A (good) text editor (not word or google word)
  • Attention to detail (a lot of attention)

The Best Text Editors For Regex

  • Textmate (only for Mac)
  • Sublime Text 2 (for Mac and Windows)
  • Vim (if you dare to)

Regular Expressions With Sublime Text

Since you might not have a Mac. You can download Sublime for free for Mac and Windows.

Basics of Regular Expressions

Remember the concept:

  • A sequence of characters.
  • That define a search pattern.

1. Regex to match a text

Open sublime and copy/paste this:

apple
application
This is an app
another apple
apple
capptain

snap1

  • Type Ctrl+F or CMD+F to open the Find.
  • Enable Regex with the button that has a period and star: .*
  • Type: app
  • It matches the pattern “app” everywhere.

Even if you don’t have Regex enable it will do the same. So just wait for the magic…

2. Magic Spells of Regex

  • \d Matches any digit
  • . A period matches any character
  • \. Used to literally match a period
  • [A-Za-z0-9] Match from A to Z, a to z, 0 to 9
  • + Match one or more repetitions
  • \s Match any whitespace
  • ^ Match the start of a line
  • $ Match the end of a line

3. Please do not drop out yet

I know this sounds like “code”. Might as well be in Martian.

Believe me. Regular expressions will save you hours and hundreds of dollars.

4. Get a list of countries from Wikipedia

Say that you need a list of all countries where Spanish is an official language.

Open this website in Google Chrome:

https://en.wikipedia.org/wiki/List_of_countries_where_Spanish_is_an_official_language

I know this is a short list. But you could run into larger files to use Regex.

  • Right click on the list and hit “Inspect Element”

snap2

  • Click on the magnifying glass and click on a row on the table.

snap3

  • On the bottom section that shows html code. Click on an element until you find one that encloses the whole table.

snap4

  • Right click on that element.

snap5

  • Copy/ Paste into Sublime

snap6

5. Use Regex To Clean Up This Code

  • Go to Sublime Top Menu and click Find
  • Go to it again and click Replace
  • Enable the Regex button

snap7

Looking at this code. We need to remove everything up to where the country is.

For example. To get “Mexico”. We need to remove all the code until we get to the word “Mexico”

snap8

I also see that there are some HTML lines that can be easily removed with Find and Replace

  • Find short HTML line
  • Copy/Paste into Find
  • Replace With: (leave blank)
  • Replace All

snap9

The Find and Replace panels will close.

The best way is to learn the shortcuts. Go again to the top menu to see what the shortcuts are for Find and Replace.

6. Remove all code up to the Country

We now see that every line starts with: td style

And right before the country there is a: title="

snap10

We could use a Regex to remove all this

^.+title.+”>

Look how awesome this is
It found the first match and it outlines all 21 matches.

snap11

Here is how it works

  • ^ goes to the start of the line
  • . matches any character
  • + one or more times
  • title matches the word title
  • . matches any character after the word title
  • matches quotes character
  • > matches greater than character

Replace With: Leave empty
Replace All.

We removed a lot of code. We are almost done.

snap12

Let’s remove everything after the less than sign: “<”

<.+

  • < matches less than character
  • . any character (after the <)
  • + one or more times

snap13

 

Now let’s remove the empty lines manually.

snap14

Wrong!

Another Regex

^\s

  • ^ goes to the start of the line
  • \s matches any whitespace

Replace With: Leave empty
Replace All

snap15

 

The final result is a neat list of countries

snap16

Using Regex for other things

  • Clean up a list of emails
  • Clean up a list of products
  • Extract emails from a list
  • Etc…

Like the gecko likes to say…

15 minutes of Regex could save you
15% or more on operating costs

Leave a Reply