Skip to content

Using the Regular Expressions Module

Lorne Gaetz edited this page Mar 31, 2014 · 3 revisions

Staring on or about version 2.11.9, Superfecta includes a module that allows users to scrape web pages for Caller ID Name information. This page is a guide for users to properly configure the fields for this purpose. This module is also useful for querying a REST URL that only returns CNAM.

The URL Field

The first required field is the URL. This will be the complete URL including the phone number that yields a web page containing the Caller ID name information. In plase of the phone number, you will substitute the text string "$thenumber" (without quotes). For example, supposing I want to do a reverse number lookup using the yellowpages.com site. Navigating through the options, inputting the phone number, and the resulting URL looks like this:

http://www.yellowpages.com/reversephonelookup?fap_terms%5Bphone%5D=%28$thenumber&fap_terms%5Bsearchtype%5D=phone

note that the URL contains $thenumber in place of the actual phone number.

Regular Expression field

This field requires one or more perl compatible regular expressions (without delimiters or escapes), that will uniquely match the Caller ID Name in the web page loaded from the URL defined above. The the name part of the regex must be enclosed by parentheses. To figure out what the regex needs to be, you need to know what text the URL returns. First put a single random character in the regex field, save settings, set the debug level to "ALL" and trigger a debug lookup on a known good number. The debug output will include a text display of the returned page in a text box titled "Orignal Raw Returned Data:". Search through this data for the expected name and take note of the characters before and after the name. Using the example URL above, the returned name data looks something like this:

<div class="vcard" id="fapcard" style="display: none">
<div class="fn org">ACME Taxi</div>
<div class="adr">

The regex to return just the "ACME Taxi" part would look like this:

<div class="fn org">(.+?)</div>

note that parentheses and wildcard characters are used in place of the name we want returned. Some sites have different formats depending on whether the phone number is residential or business, or if it is a sponsored result, toll free, etc. It may be necessary to go thru this process with several different regex's.

A simplified case would be if you are using this module to query a REST URL that only returns the CNAM without any extraneous characters that need to be ignored. For this case a regex that includes all characters looks like this:

(.*)

Options

This field is used for regex options. The help is pretty straight forward, but one thing that could be useful is the 's' option. Supposing in the example above we needed a more specific regex to prevent false matches under certain circumstances. We might want it to include the lines preceding and following the name line. To do that we would put a '.*?' wild card character in place of the newlines and use the 's' option to permit the wildcard to match a newline. The regex would look like this:

<div class="vcard" id="fapcard" style="display: none">.*?<div class="fn org">(.*?)</div>.*?<div class="adr">