During this year’s PAN @ CLEF evaluation lab on digital forensics, three of Anchormen’s data scientists participated in the Author Profiling task. The goal was to predict the author’s gender based on series of tweets (text) and images.
For this article, we sit down with Luka Stout, a Machine Learning Engineer at Anchormen, who agreed to share his experience with us.
Can you tell us a bit more about the PAN, the task that you chose and who you worked on it with?
Luka Stout: PAN is a series of scientific events on digital text forensics and stylometry (variations in writing style between writers). They organize this event every year for a while now and we usually participate when we have the time. I worked on this together with Chris Pool and Robert Musters, both very skilled Data Scientists at Anchormen.
This year’s tasks were author identification, author obfuscation, and author profiling. There isn’t any special reason why we chose specifically the author profiling task. Chris has some experience with it and thought it could be interesting to work on.
What is author profiling and where do you see applications of the research?
Luka Stout: Author profiling distinguishes between classes of authors studying their sociolect aspect (the dialect of a social group), that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in fields such as forensics, security, and marketing.
For example, from a forensic linguistics perspective one would like to be able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, based on the analysis of blogs and online product reviews, the demographics of their target groups.
What kind of data did you have to work with?
The training set consisted of text in three different languages and images that were posted by the authors. For each author, we had 100 tweets and 10 images. For English and Spanish, there were 3,000 authors and for Arabic, 1,500 authors. In total, the developed model analyzed 750,000 tweets and 75,000 images.
What was the goal that you set out to accomplish and how did you do it?
Luka Stout: The goal was to develop a model that can infer the gender of the author based on
1) their tweets,
and 3) a combination of the two.
The inclusion of 3 different languages raised a further question – can a single model architecture work well across various languages?
In the end, we built two models. One for the text-based task that used text classification, and another for the image-based task. Both traditional techniques (tf-idf and Naïve Bayes) and deep learning techniques (Recurrent Neural Networks and Convolutional Neural Networks) were applied in order to create a model with the highest possible accuracy level.
Side Note: If you follow our “Data Science Digest” series, you will remember that recently we talked about Natural Language Processing (NLP) and one of the main areas we focus on was text classification.
So, the moment of truth. What were the results?
We achieved a combined accuracy of 76.2% when predicting an author’s gender with text alone. Although, all models were tested, the deep learning one gave the best results. What is interesting is that there was a significant discrepancy between the languages with Spanish being the least accurate. This might be because additional preprocessing was done for English and Arabic, but it could as well be because of unknown variables such as language pattern which we don’t understand.
Here the results were a bit disappointing for us. We faced some difficulties with determining author’s gender based on images alone. Especially if the images were not of themselves but random objects. Because of that, we tried focusing only on selfies. We built a classifier that can distinguish between selfies and other types of images with a 96% accuracy and then put them (the selfies set) through a prediction model to determine if the user is a male or a female. On the other hand, images without a selfie were ran through a random prediction model which gave insignificant results. The accuracy levels were 62.3% for Arabic, 65.8% for English, and 62.3% for Spanish.
Do you mind sharing the full paper and results with our readers?
Luka Stout: Sure! Below you can see the table with the results and you can find the full paper here.
What was the biggest difficulty that you guys faced when working on the project?
Luka Stout: I guess the most difficult aspect is the same as with any other data project – the amount of data that you have to work with. The quality and quantity of available data is always the first concern, because that is what gives you an indication of how accurate (and useful) results you will have in the end. In this case we had only 100 tweets and 10 images per author available and as you know, Twitter has a limit on characters you can use in a tweet. All of this meant that we didn’t have much to work with, but I think the models still performed rather well.
Another issue was with the image classification. Because the images varied so much (i.e. you can have 1 selfie, 1 picture of a tree, a house, a beach, an inspirational quote, your pet, etc.) it was difficult to classify them in a meaningful way. This is also indicated in the lower image results that we got. I’m curious to see how other teams performed in that area.