Over the last few years, there has been a significant increase in the use of Twitter to share updates, seek help and report emergencies during a disaster. Social media platforms can be instrumental for keeping track of events like damage to personal property or injuries during natural disasters. However, algorithms keeping track of social media posts to signal the occurrence of natural disasters must be swift so that relief operations can be mobilized immediately.
A team of researchers led by Dr. Ruihong Huang, assistant professor in the Department of Computer Science and Engineering at Texas A&M University, has developed a novel weakly supervised approach that can train machine learning algorithms quickly to recognize tweets related to disasters.
“Because of the sudden nature of disasters, there’s not much time available to build an event recognition system,” said Huang. “Our goal is to be able to detect life-threatening events using individual social media messages and recognize similar events in the affected areas.”
The researchers described their findings in the proceedings from the Association for the Advancement of Artificial Intelligence’s 34th Conference on Artificial Intelligence.
Texts on social media platforms, like Twitter, can be categorized using standard algorithms called classifiers. This sorting algorithm separates data into labeled classes or categories, such as how spam filters in email service providers scan incoming emails and classify them as either “spam” or “not spam” based on its prior knowledge of what spam and non-spam messages are.
Most classifiers are an integral part of machine learning algorithms that make predictions based on carefully labeled sets of data. In the past, machine learning algorithms have been used for event detection based on tweets or a burst of words within tweets. To ensure a reliable classifier for the machine learning algorithms, human annotators have to manually label large amounts of data instances one by one, which usually takes several days, sometimes even weeks or months.
The researchers also found that it is essentially impossible to find a keyword that does not have more than one meaning on social media depending on the context of the tweet. For example, if the word “dead” is used as a keyword, it will pull in tweets talking about a variety of topics such as a phone battery being dead or the television series “The Walking Dead.”
“We have to be able to know which tweets that contain the predetermined keywords are relevant to the disaster and separate them from the tweets that contain the correct keywords but are not relevant,” said Huang.
To build more reliable labeled datasets, the researchers first used an automatic clustering algorithm to put them into small groups. Next, a domain expert looked at the context of the tweets in each group to identify if it was relevant to the disaster. The labeled tweets were then used to train the classifier how to recognize the relevant tweets.
Using data gathered from the most impacted time periods for hurricanes Harvey and Florence, the researchers found that their data labeling method and overall weakly-supervised system took one to two person-hours instead of the 50 person-hours that were required to go through thousands of carefully annotated tweets using the supervised approach.
Despite the classifier’s overall good performance, they also observed that the system still missed several tweets that were relevant but used a different vocabulary than the predetermined keywords.
“Users can be very creative when discussing a particular type of event using the predefined keywords, so the classifier would have to be able to handle those types of tweets,” said Huang. “There’s room to further improve the system’s coverage.”
In the future, the researchers will look to explore how to extract information about the user’s location so first responders will know exactly where to dispatch their resources.
Other contributors to this research include Wenlin Yao, a doctoral student supervised by Huang from the computer science and engineering department; Dr. Ali Mostafavi and Cheng Zhang from the Zachry Department of Civil and Environmental Engineering; and Shiva Saravanan, a local high school student (who is currently attending Princeton University) that interned in the Natural Language Processing Lab at Texas A&M.
This work is supported by funds from the National Science Foundation.