Verma’s Research Targets Detection of Phishing Emails and Opinion Spam
NSF-Funded Study Uses Natural Language Processing Techniques
Phishing emails attack in-boxes on a daily basis.
In these emails, the sender tries to steal sensitive information from Internet users. There are many direct and indirect damages caused by phishing including additional equipment, software and manpower needed to combat the issue. It is estimated that the damage caused yearly runs from several hundred million to billions of dollars.
A three-year grant from the National Science Foundation’s Secure and Trustworthy Cyberspace Program to University of Houston’s Rakesh Verma will support research to identify phishing emails and opinion spam by using natural language processing techniques.
Verma will use a new approach looking at the body of emails to automatically identify phishing emails to help protect Internet users.
“If you look at an email, it basically has three parts, the header, the body and the links. The novelty of this analysis is that no one we know of has looked at the content of the body of an email in identification of phishing emails,” said Verma, a professor of computer science in the College of Natural Sciences and Mathematics.
When analyzing the body of the email, Verma uses natural language processing techniques, such as looking at what kind of verbs are used in the email. “We are not only using syntactic techniques on individual words; we’re also looking at what the concept is,” he said. Sense disambiguation programs allow them to identify concepts of verbs with multi-meaning.
“Using the concept, we look at all the words that can refer to that concept. That way, we are able to make our classifiers very robust,” Verma said. “Phishers are constantly trying to change and adapt their technique. Regardless of how they change, we should be able to catch them.”
Using an alternate technique, groups of words are used to help identify phishing emails.
“We took a small data set and did a statistical analysis to find out what patterns are more common in phishing emails but not as common in good emails,” Verma said. “Using those patterns, our classification accuracy is more than 95 percent, using only the text.” Once Verma’s technique of analyzing the body of an email is integrated with analysis of the header and links, an accuracy of 99 percent is obtained.
Verma’s goal is to create a plug-in that can be downloaded and installed on browsers that will automatically go through emails and mark them as phishing emails.
In addition to detection of phishing emails, Verma is working on detection of opinion spam.
There are many websites dedicated to reviews of services and products that allow user opinions. Within these reviews, there are often planted opinions. This could include false positive opinions planted by the service/product provider or false negative opinions planted by a competitor.
Verma is using statistical analysis to create a classifier that automatically determines if reviews on opinion sites are genuine reviews or opinion spam.
“We are currently able to identify 91-92 percent of the opinions into the correct category. Our goal is to reach more than 95 percent,” he said. “It becomes tricky because we have found that sometimes genuine, honest opinions can be very shallow, and there is nothing to tie them to the service/product provider.”
- Lauren Abbott, College of Natural Sciences and Mathematics