Recently, I set upon creating an e-mail list for a bi-monthly newsletter that my company Evolcon is going to send with updates on everythings related to Qlik and becoming a more data-driven organization. Of course, I used Qlik to parse and clean the e-mail list, and in the process, I used some not so common scripting functions that are fun to use.
The first step was to extract the e-mail information from Outlook and Gmail. You can easily find sites that will lead you through the process in Google. I used a Google script found on this blog and the Outlook extraction was pretty straight forward.
Parsing with subfield()
Once I extracted all the e-mail addresses from my e-mail and copied them to an Excel file, I started to write my script in QlikView to parse and clean the e-mails. The first challenge I encountered was parsing the domain from the e-mail. That’s a pretty simple job with the subfield() function, and the following function did the job:
SubField(Email,'@',2) as Domain,
The subfield() function takes the string you want to parse as its first parameter, the string that you want to use as the separator or delimiter as its second parameter, and the index of the substring that you want the function to give you as its third parameter.
The above function worked to return the domain, but in the process of testing, I noticed that several fields contained various e-mails separated by a semi-colon (;). A lesser known feature of the subfield() function is its ability to create a new row for each subfield in a string if you do not define its third parameter, or the subfield index.
So when we apply the following load script,
Load Email_To, subfield(Email_From,';') as Email_From Resident Email;
to the following table,
|email@example.comfirstname.lastname@example.org; email@example.com; firstname.lastname@example.org|
Qlik creates the following table with 3 separate rows.
Advanced parsing with iterno()
This solution works fine if you need to parse only one field, but what if you have the following table with 2 columns that you need to parse.
|email@example.comfirstname.lastname@example.org; email@example.com; firstname.lastname@example.org||Elvis Presley; Marlyn Monroe; Karl Pover|
Digging a little deeper into Qlik’s ETL function list, I used the iterno() function.
Load Email_To, SubField(Email_From,';',IterNo()) as Email_From, SubField(Name_From,';',IterNo()) as Name_From, Resident Email While len(subfield(Email_From,';',IterNo())) >0;
The iterno() function along with the while clause makes Qlik repeat the load statement for the same table row until the condition in the while clause is false. The first time a row is load the iterno() returns 1 and then each additional time the same row is loaded, the value returned by iterno() is incremented by 1.
If you apply the subfield() function using a a separator that doesn’t exist in the string or you refer to an index that doesn’t exist then the subfield() function returns a NULL value. If I let you in on the fact that I prefer to use the len(Field)>0 instead of isnull(Field) to test for NULL or empty values then you can understand that this load statement will load the same row if only one e-mail is found in the Email_From field or there are no more semi-colon (;) separated e-mails left to load.
Therefore the result of the above load statement will be the following table:
Filtering with wildmatch()
You are almost finished. The final point to consider is filtering out all those automatic e-mail replies. The list of e-mails you would have to filter could be extensive so instead you can use one of my functions, wildmatch(), and to avoid hardcoding the filters into the script, you should create the list of filtered words in Excel and use the concat() function to define the parameters of the wildmatch() dynamically.
The list of filtered words in the table below
with the following script
Filters: LOAD chr(39) & concat(Filters,chr(39) & ',' & chr(39)) & chr(39) as Filter FROM Mailing_List.xlsx (ooxml, embedded labels, table is Filtros) Group by 1;
let vFilter = peek('Filter',0,'Filters');
Emails_Filtered: Load distinct * Resident Emails Where not wildmatch(Email,$(vFilter));
will result in the where clause being PASTE WHERE CLAUSE
Where not wildmatch(Email,'*=*','*reply*','*responder*','*support*','*marketing*','*auto*');
and allow you to easily filter all e-mails you don’t want to include in your e-mailing list.
You now have a parsed and clean e-mailing list ready to send your newsletter with the help of subfield(), iterno(), while, and wildmatch() functions. The best part of doing it in Qlik is that is completely reusable and as you add to the filter list it will get better with time.
If you want to try it out yourself, I’ve included the Qlik application and Excel file with a list of common filters in English and Spanish below in this post and in the downloads page. If you have the personal edition, I’ve included the script file that you can use. All you have to do is open the Edit Script window (Ctrl + E) in a new Qlik application, go to the file menu Insert -> Script File and select MailingListSample.qvs. Be sure to save the Qlik file in the same folder as Mailing List Sample.xlsx.
Here are the files.
Hope to see you around,