However, our starting point will always be SVR with token unigrams, this being the best performing combination.

For those techniques where hyperparameters need to be selected, we used a leave-one-out strategy on the test material.

Then we describe our experimental data and the evaluation method Section 3after which we proceed to describe the various author profiling strategies that we investigated Section 4.

For each blogger, metadata is present, including the blogger s self-provided gender, age, industry and astrological sign. Figure 4 shows that the male population contains some more extreme exponents than the female population.

Feature type Unigram 1: The age component of the system is described in Nguyen et al. Apart from normal tokens like words, numbers and dates, it is also able to recognize a wide variety of emoticons. However, we do observe different behaviour when reversing the signs.

In this paper, we start modestly, by attempting to derive just the gender of the authors 1 automatically, purely on the basis of the content of their tweets, using author profiling techniques.

For the bigrams Figure 2we see much the same picture, although there are differences in the details.

To test that, we would have to experiment with a new feature types, modeling exactly the difference between the normalized and the original form.

In effect, this N is a further hyperparameter, which we varied from 1 to the total number of components usuallyas there are authorsusing a stepsize of 1 from 1 to 10, and then slowly increasing the stepsize to a maximum of 20 when over We aimed for users.

However, even with purely lexical features, 4. For all feature types, we used only those features which were observed with at least 5 authors in our whole collection for skip bigrams 10 authors. When you share, everyone wins. The authors apply logistic and linear regression on counts of token unigrams occurring at least 10 times in their corpus.

For the measurements with PCA, the number of principal components provided to the classification system is learned from the development data. When running the underlying systems 7.

Another interesting group of authors is formed by the misclassified ones. Trigrams Three adjacent tokens. As scaling is not possible when there are columns with constant values, such columns were removed first.

However, the high dimensionality of our vectors presented us with a problem. When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus. For each setting and author, the systems report both a selected class and a floating point score, which can be used as a confidence score.

Attribution — You must give appropriate creditprovide a link to the license, and indicate if changes were made. The second classification system was Linguistic Profiling LP; van Halterenwhich was specifically designed for authorship recognition and profiling.

For the unigrams, SVR reaches its peak. Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams. Results: In this section, we will present the overall results of the gender recognition.

Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features. Apparently, in our sample, politics is a male thing.

We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism. The licensor cannot revoke these freedoms as long as you follow the license terms. SVR now already reaches its peak