French presidential election

Datasets Description

An extract from the twitter flow about the french election
# Context This dataset is born from a test with the twitter streaming api to filter and collect data from this flow on a specific topic, in this case the [French election][1].The script used to make this data collection is available on this [Github repository][2]. Since the 18th of March, the [French election][3] enter in the final straight line until the first poll the 23 April 2017 , the candidates for the position are: - M. Nicolas DUPONT-AIGNAN - Mme Marine LE PEN - M. Emmanuel MACRON - M. Benoît HAMON - Mme Nathalie ARTHAUD - M. Philippe POUTOU - M. Jacques CHEMINADE - M. Jean LASSALLE - M. Jean-Luc MÉLENCHON - M. François ASSELINEAU - M. François FILLON The idea was to collect the data from the Twitter API periodically. The acquisition process evolved as follows: - Versions 1, 2 and 3 Every hour a python script listens to the twitter api stream for 10 minutes during 3 weeks. - Version 4+ The new versions will be based on a new data structure, and start after the validation by the French constitutional council on 18 March 2017 of the candidates. The data will be stored in a dbsqlite file and will be updated as often as I can (at least every week) For the versions 7+, and with the first results of the different notebooks 5 candidates (Macron, Fillon, Le Pen, Melenchon, Hamon) seems more present on twitter (in terms of mention) so we will add for these candidates the analytics of [google trends][4] for the web search in France. These data will be a compilation of the Interestfor the past 7 days scale (with an hour timestamp) and the past 30 days scale (with a day timestamp) , in this two scales we will store: - The interests over time (Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the peak.) - The interests by subregion (See which term ranked highest in each region during the specified time frame. Values are scaled from 0 to 100, where 100 is the region with peak popularity, a value of 50 is the region where the term is half as popular, and a value of 0 means that term was less than 1% as popular as the peak). # Content In this dbsqlite file, you will find a data table that contains for every row: ===============Common=============== - the index of the line - the language of the tweet - for each candidate :mention_candidatename, if the candidate or his associated account has been called (0 or 1) - the tweet - the timestamp in milliseconds ===============Version 4+=============== - the day - the hour (London timezone) - the username of the user that made the tweet - the username location (that he gives with his profile) - if the tweet is a retweet or a quote (0 or 1) - the username that has been retweeted - the original tweet (the one retweeted or quoted) # Acknowledgements This election is gonna be intense. # Inspiration The first version of the dataset was just a test to collect the data and see the first pieces of work that the community can do with this dataset.The new versions are (I think and hope) adapted to do deep text analytics. [1]:,_2017 [2]: [3]:,_2017 [4]:,%2Fm%2F0fqmlm,%2Fm%2F02rdgs,%2Fm%2F04zzm99,%2Fm%2F0551nw