Have you ever wondered about the language used in Social Media? Or the woexrds we use are based on our personality traits? If your answer is yes, you are in luck.
Background
The Positive Psychology Center based out of the University of Pennsylvania created what is known as The World Well-Being Project (WWBP). This amazing project is forging scientific techniques by measuring psychological well-being and physical health based on the analysis of language in social media. The brightest computer scientists, psychologists, and statisticians are putting their heads together on the psychosocial processes that affect health and happiness, and are exploring the potential for replacing expensive survey methods. In 2013, WWBP published a study entitled “Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach”. In one of the largest studies to date, WWBP analyzed over 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers. To analyze the messages, they used two different methods to find demographic and psychological attributes:
- Differential Language Analysis (DLA): a method designed by WWBP to identify the most distinguishing language features from any given attribute.
- Linguistic Inquiry and Word Count (LIWC): a popular tool used in psychology to find ways people use words in their daily lives. This can provide rich information about their beliefs, fears, thinking patterns, social relationships, and personalities.
In this particular study, WWBP also had volunteers take standard personality tests (the Big 5 Factor Model) to determine the words used in certain personality traits. Combine all of these, and they were able to link the social media language of personality, gender, and age with 91.9% accuracy. Now let’s get our hands dirty and take a look at how they gathered the Facebook status updates and formulated their visualization charts.
The Data
The complete dataset consisted of approximately 19 million Facebook status updates written by all participants. The team at WWBP restricted their analysis to those Facebook users meeting 4 certain criteria:
- They must indicate English as a primary language
- Have written at least 1,000 words in their status updates
- Be under the age of 65
- Indicate gender (Male & Female) and age
Language of Gender:
Female language features are shown on top with male language features below. The size of the word indicates the strength of the correlation; the color indicates relative frequency of usage. Underscores (_) connect words of multiword phrases.
- Females from this study (Top), used more emotion words (e.g., excited) and first-person singulars and they mention more psychological and social processes (e.g., ‘love you’ and ‘<3’ –a heart).
- Males used more swear words and object references (e.g., ‘xbox’, ‘black ops’, ‘wishes he’).
Language of Age:
As you can see in Figure 3 above, there are subtle changes of topics progressing from one age group to the next. Also, there are clear distinctions in words such as use of slang, emoticons, and Internet speak in the 13 to 18 age group. In the 23 to 29 age group, you can see a couple Internet speak or work topics (e.g. ‘at work’, ‘new job’). We see a school related topic for 13 to 18 year olds (e.g. ‘school’, ‘homework’, ‘ugh’), while we see a college related topic for 19 to 22 year olds (e.g. ‘semester’, ‘college’, ‘register’). As you progress to the 30 to 65 age group, words being used are focused more on emotional stability with family and friends (e.g. ‘daughter’, ‘my son’, ‘my kids’, and ‘my fb friends’). In general, you will see a progression of school, college, work, and family when looking at the major topics across all age groups.
Standard Frequency of Topics and Words Based on Age:
In Fig. 4A, the graph shows the relative frequency of the most selective topics for each age group as a function of age. Fig. 4B reinforces this hypothesis by presenting a similar pattern based on other social topics. Fig. 4C shows the use of ‘we’ increases after the age of 22, whereas ‘I’ decreases. This definitely suggests the increasing importance of friendships and relationships as people age.
Language of Personality:
The researchers dug into how our language and personality coincide. They analyzed the words used by participants and organized them based on the personality of each participant. Here’s a quick refresher of the 5 Personality Factors:
- Extroversion: describes how you interact with people.
- Neuroticism: is how you deal with emotions
- Agreeableness: is how you feel towards others
- Conscientiousness: describes how organized and dependable you are
- Openness: is when someone is curious and openminded to new experiences and knowledge
We can see at the top left that socially related categories like party topics emerge as a key distinguishing feature for Extroverts. Additionally, results suggest that Introverts are interested in Japanese media (e.g. ‘anime’, ‘manga’, ‘internet’, and Japanese style emoticons: ˆ_ˆ).The bottom left of Figure 5 above shows that people High in Neuroticism commonly mentioned phrases like ‘sick of’, ‘depressed’ and ‘I hate’. The bottom right shows language related to emotional stability (Low Neuroticism). Low Neurotic individuals wrote about enjoyable social activities that foster harmony or create a greater emotional balance, such as ‘sports’, ‘vacation’, ‘beach’, ‘church’, ‘team’, and a family time topic. In Figure 6 below (bottom right), people who display Low Openness use shortened words in their status update (e.g. ‘2day’, ‘ur’, ‘every 1’). People who are High Open (bottom left) utilize creative words (e.g. ‘art’, ‘universe’, ‘music’, ‘writing’, and ‘soul’). You may not resonate with this, but Low Conscientiousness people (Middle-left) use very explicit words in their updates whereas High Conscientiousness people (Middle-left) use phrases (e.g. ‘to work’, ‘ready for’, and ‘great day’). As you can see, social media platforms such as Facebook and Twitter are very favorable resources for the study of people; status updates and tweets are expressive, personal, and have emotional content. Remember a few things:
- That language, in general, is unbiased and is measurable behavioral data
- Facebook language specifically allows researchers to observe individuals as they present their true self to the online world.
DLA Method
Figure 1
As you can see from Figure 1 above, the DLA operates by: 1.) Feature Extraction: Extracting Language by: (a) words and phrases: a sequence of 1 to 3 words found in a string of text, emoticons, and two or more words that correspond to some conventional way of saying things. (b) topics: automatically analyzing large collections of unlabeled text. 2.) Correlational Analysis: the process of correlating words with gender, age, and personality. Since they found thousands of significantly correlated words, visualization charts were the key in understanding their research (see Fig. 2 below).
Fascinating right?
Would it be possible to obtain this data?! I am doing a similar study, and I’d appreciate it.
Best,
Craig Cook
Would it be possible to obtain this data?! I am doing a similar study, and I’d appreciate it.
Best,
Craig Cook
Would it be possible to obtain this data?! I am doing a similar study, and I’d appreciate it.
Best,
Craig Cook
Would it be possible to obtain this data?! I am doing a similar study, and I’d appreciate it.
Best,
Craig Cook