GloFAS Social Media Activity Analysis (SMAA)

---------- WARNING: Experimental Layer

Social Media Activity Analysis (SMAA) is a system that integrates Social Media analysis into Early Warning Systems. This integration allows the collection of Social Media data to be automatically triggered by flood risk warnings determined by a hydro-meteorological model.

This layer is available in both EFAS and GloFAS systems, there are only some slight differences in the formatting of the feature information. The data is exactly the same on both platforms.

Data Flow

Description of Tasks

Step.1. A data collection is triggered on the basis of weather forecasts. The data are collected from the Twitter public streaming API (if query results exceed 1% of all the tweets, limitations apply). The data collection is limited to a set of administrative areas. The areas are determined by the intersection of RRA (flood probab < 48 hours + high likelihood) or EFI (Acc Rainfall + 48hr threshold > 0.5) areas with administratives area (gadm including NUTS for EU as defined for all products in EFAS and GLOFAS). The duration of the collection is from the trigger until 2 days after the peek (or EFI 2 days after no extreme rain is foreseen).

Step.2. Each tweet's text likelihood of being about a flood is classified using ML models (scale 0 to 1). For some languages there are already in-house trained models (EN,ES,FR,DE,IT,RO,AR,PT), for some languages training of models is ongoing (TH,TA,ID,JP,RU) and for the rest of languages, a multilingual classifier based on Facebook LASER embeddings will be used.

Step.3. Each tweet is geotagged using either the location mentioned in its text or the coordinates of the tweet itself. If the location extracted is out of the administrative areas for which the collection is triggered, the tweet will be discarded from further analysis

Step.4. Geotagged tweets are grouped into 3 categories, low (flood relevance prob < 0.2) - mid (0.2-0.8) - high (>0.8) for each area.

Step.5. According to the ratio between number of tweets in 'mid' bucket and 'high' bucket (mandatory more than 10 in any case) the administrative area is 'colored' in grey (low>1), orange (high>5 X mid) or red (high>9 X mid)

Step.6. For regions in red or orange, a script extracts the 5 most representative tweets. The algorithm sort tweets by an index calculated as multiplication of 3 indexes (multiplicity Idx1, centrality Idx2, flood relevance probability Idx3) to assure we extract relevant tweets (Idx3) with strong representation (Idx1) but diverse enough one from another (Idx2). Heuristics might change.

Step.7. Every hour a shapefile for the polygons representing areas and a json for the most representative tweets is sent to the GLOFAS interface where it is processed for display

Step.8. The collections are switched OFF or ON each time there is a new EFAS/GLOFAS forecast available. New collections are created, existing collections are either switched OFF or extended in time according to the new forecasts.

N.B. Please note that you may find some tweets that are FALSE positive describing an impact (or something somehow related like a loss → lost → lost cat /lost person or again a description of war victims etc). FALSE positives are OK (The model never got a discharge wrong???), the open issue is how to prevent those tweets to be computed as the representative one (without using stopwords)

Page tree

GloFAS Social Media Activity Analysis (SMAA)

Data Flow

Description of Tasks