1.Statistical Semantic Classification of Crisis Information
Prashant Khare, Miriam Fernandez, Harith Alani
1
{prashant.khare, miriam.fernandez, h.alani} @open.ac.uk
Knowledge Media Institute, The Open University, UK
2.Motivation
2
People of NSW, be careful because there's fires spreading! Stay safe everyone!
Hundreds of volunteers in Mexico tried to unearth children they hoped were still alive beneath a school's ruins
Two trucks and one car in the water after a road collapse at Hwy 287 and Dillon. #cowx #boulderflood
CRISIS
Wildfire
Floods
Earthquake
3.Motivation
3
Challenges
A flood of data gets generated. For e.g.- On an average over a million tweets got generated during Hurricane Harvey 2017.
500% increase in the tweets bandwidth during 2011 Japan earthquake.
Almost impossible to manually absorb and process the sheer volume.
In addition, the characteristics of social media posts such as short length, colloquialism, syntactic issues pose additional challenges of processing the data.
4.Motivation
4
Relevant and Non-Relevant
5.Motivation
5
FEMA launched an initiative to use public social media data for situational awareness purpose1.
1: https://www.dhs.gov/sites/default/files/publications/privacy-pia-FEMA-OUSM-April2016.pdf
Image source – fema.gov
6.Previous Efforts - Identifying Crisis Related Information
ML Classification Methods:
Supervised Approaches: Often making use of n-grams, linguistic features, and/or statistical features of tweets.
Unsupervised Approaches: Keyword processing and clustering.
Semantic Models:
Representation of the information emerging from Crisis Events, providing faceted search of crisis related information.
6
7.Hypothesis and Aim
Hypothesis:
Semantics can establish a consistency within crisis relevant information and enhance the discriminative power of classifiers.
Complement the previous approaches by investigating the impact of semantic features in ML classification along with statistical features.
7
8.Method
Collect Data from CrisisLex.org- collection of Crisis oriented tweets.
Extract Statistical Features.
Semantic Enrichment of tweets via annotation using Babelfy API.
Expand the semantics by incorporating hypernyms through BabelNet.
Filter out the less informative and abstract features (using a hypernym hierarchy).
Classify using SVM classification method.
8
9.9
#HighParkFire
burn area map
as of
Monday night 10 p.m.
http://t.co/1guBTcXX
area region map representation
Burn destroy
Monday weekday
day_of_the_week night period
burn
area
map
Monday
night
Annotated Tokens
Semantically enriched (augmented hypernyms)
Lets us consider this Tweet from Colorado Wildfire 2012:
Semantic Enrichment of Tweets
Image source: http://babelnet.org/about
11.Semantic Features: Annotations, Expansion, & Filtering
11
BabelNet:
Extracted 4 million relations by iteratively querying for hypernyms.
Using Directed-graph and betweenness centrality – ‘Entity’ (SynSet ID ‘bn:00031027n’) was found to be most abstract concept.
Using shortest path – Maximum depth of nodes found to be 21.
Using Information Gain, most of the informative concepts were between the depth 3 and 7.
12.Semantic Features: Annotations, Expansion, & Filtering
12
Features plotted against Levels/Information Gain – Training Data for Colorado Wildfire Test case
13.Data
CrisisLexT26 (3206 tweets – 1667 related, 1539 not related)
9 Events- predominantly in English
13
14.Features
Statistical Features (SF)
Number of Nouns, Verbs, Pronouns
Tweet Length
Number of Words, Hashtags
Readability -Gunning Fox Index using average sentence length (ASL) and percentage of complex words (PCW) : 0.4*(ASL + PCW)
Unigrams
Semantic Features (SemF)
Semantic Annotation Features (SemAF)
Semantic Expansion Features (SemEF)
Semantic Filtering Features (SemFF)
14
15.Experiment
Classifier – Support Vector Machine with Linear Kernel
Designed two type of experiments:
Crisis Classification Model (train and test on all 9 crisis events)
Create following classifier models
SF
SF + SemAF
SF + SemAF + SemEF
SF +SemFF (statistical features and filtered semantic annotations, along with hypernyms)
Cross Crisis Classification (train on 8 crisis events and test on 9th crisis event)
15
Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
16.Experiment
Crisis Classification (10- fold cross validation)
16
Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
17.Experiment
Cross-Crisis Classification
17
Statistical Features (SF), Semantic Annotation Features (SemAF), Semantic Expansion Features (SemEF), Semantic Filtering Features (SemFF)
18.Results and Observations
Number of Hashtags, Nouns, and Pronouns are most relevant features among Statistical features – based on Information Gain of the features.
Apart from Statistical features, Semantic expansion reflected concepts such as ‘Happening’ and ‘Event’ (which are hypernym of concepts ‘Incident’, ‘Fire’, ‘Crisis’, ‘Disaster’, and ‘Death’ in BabelNet) among Top 10 attributes.
18
19.Results and Observations
19
Post A –“RT @LarimerCounty: #HighParkFire burn area map as of Monday night 10 p.m.”
Post B –“Colorado wildfires their worst in a decade http://t.co/RtfLmfds”
Post C – “RT @RedCross: Thanks to generosity of volunteer blood donors there is currently enough blood on the shelves to meet demand.#BostonMarathon”
Misclassified in SF but correctly classified in SF+SemAF. ‘Burn’ did not occur in training data but annotation ‘Fire’ did.
Misclassifed in SF+SemAF but correctly classified in SF+SemAF+SemEF. ‘Fire’, a high IG feature, is hypernym of original annotation ‘wildfire’ which was not a ranked feature in training data.
Misclassified in SF+SemAF+SemEF but correctly classified in SF+SemFF. ‘Thanks’ and ‘Meet’ expanded to very low discriminative and abstract features as ‘Virtue’ and ‘Desire’. Excluding them raised the discriminative power of more informative concepts such as ‘Volunteer’ and ‘Benefactor’ (hypernym of ‘donor’).
20.Take Away
Potential in mixing the statistical and semantic features for classification.
Most noteworthy improvement is achieved when hybrid model is used to classify an entirely new data.
Semantic expansion can also result in noise.
Filtering can help in addressing the noise.
In future, we should expand the data to more crisis events (currently 5 types of events) and sample size of each event.
20