Improving natural language processing with demographic-aware models

Word associations vary across different demographics, allowing researchers to build better natural language processing models if they can account for demographics.

word associations Enlarge

Understanding the associations that are formed in the mind is crucial to understanding the way humans acquire language throughout a lifetime of learning. Word associations are believed to mirror the mental model of the conceptual connections in a human mind. They start forming early in life, as language is acquired and one learns based on the environment where concepts lie in relation to each other. In addition, these associations shift and morph as a person gains new life experiences; for example, older people associate “sleep” with “awake,” instead of “bed” or “dream,” which are the top choices for younger age groups.

Computational linguistics has traditionally taken the “one-size-fits-all” approach, with most models being agnostic to the speakers behind the language. With the introduction and adoption of Web 2.0, there has been an exponential increase in the availability of digital user-centric data in the form of blogs, microblogs and other forms of online participation. Such data often times can be augmented with demographic or other user-focused attributes, which can enable computational linguists to go beyond generic corpus-based metrics of word associations, and attempt to extract associations that pertain to given demographic groups that would not have been possible without administering time consuming and resource intensive word association surveys.

Michigan researchers, including Prof. Rada Mihalcea, research fellow Carmen Banea, and graduate student Aparna Garimella, have found that word associations vary across different demographics, and researchers can build better natural language processing models if they can account for demographics. Their research consisted of building a new dataset with word association responses for approximately 300 stimulus words, collected from more than 800 respondents of different genders (male/female) and from different locations (India and the United States) with a total of 176,097 responses. They collected the data through the means of an online survey that was structured into two sections: word association and demographic information. Each survey participant was presented with a set of 50 stimulus words at a time and a demographic questionnaire consisting of seven questions covering gender, age, location, occupation, ethnicity, education, and income.

The results showed significant variations in the word associations made by these groups. For example, for the stimulus word “bath”, the most frequent response for both American and Indian men was “water,” while for Indian women was “soap” and for American women was “bubble.”

The researchers proposed a new demographic-aware word association model based on a neural network skip-gram architecture. They showed that this method outperforms other generic methods and previously proposed models of word association, thus demonstrating that it is useful to account for the demographics of the people behind the language when performing the task of automatic word association.

They regard this as a first step toward demographic-aware NLP, and plan to address more advanced NLP tasks while accounting for demographics. Among the applications that stand to gain from considering demographic situated text are information retrieval (which relies heavily on word associations/similarity), demographic-aware keyword extraction, dialogue personalization, and so forth, as users can be presented with information that is more customized and relevant to them. Equitable natural language processing is also a potential application, as such demographic-aware methods can avoid imbalances or biases that can appear in text datasets, particularly when they are obtained from authors exhibiting homogenous traits.

The research was published in the proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017).