Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal.

This means that the content of the n-grams is more important than their form. In this way, we derived a classification score for each author without the system having any direct or indirect access to the actual gender of the author. Apparently, in our sample, politics is a male thing.

Unigrams are mostly closely mirrored by the character Mixed couple dating site, as could already be suspected from the content of these two feature types.

We did a quick spot check with authora girl who plays soccer and is therefore also misclassified often; here, the PCA version agrees with and misclassified even stronger than the original unigrams versus.

From this point on in the discussion, we will present female confidence as positive numbers and male as negative. LP keeps its peak at 10, but now even lower than for the token n-grams This apparently colours not only the discussion topics, which might be expected, but also the general language use.

We then progressed to the selection of individual users. SVR now already reaches its peak Then we outline how we evaluated the various strategies Section 3.

If we look at the rest of the top males Table 2we may see more varied topics, but the wide recognizability stays. The creators themselves used it for various classification tasks, including gender recognition Koppel et al.

Then follow the results Section 5and Section 6 concludes the paper. There is an extreme number of misspellings even for Twitterwhich may possibly confuse the systems models.

However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition. We aimed for users. If no cue is found in a user s profile, no gender is assigned. In this paper, we start modestly, by attempting to derive just the gender of the authors 1 automatically, purely on the basis of the content of their tweets, using author profiling techniques.

In the following sections, we first present some previous work on gender recognition Section 2. When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus.

This corpus has been used extensively since. SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1.

