I completed data scraping of the contents of a social media site and performed a sentiment analysis regarding two new privacy laws in the spring of 2016.
Duration: 3 weeks
Methods Employed: Visual Analytics, Data Scrapping, Data Visualization
Tweets concerning two privacy laws from spring 2016 were collected using Tweepy through the Twitter API.
The Rule 41 tweets were collected between April 30th, 2016 and May 2nd, 2016. After cleaning out anything not in English or directly relevant, that left me with 569 tweets for that dataset. I left Retweets in the dataset, because I was looking at the frequency of opinions rather than only original tweets. The assumption was that tweets that are seen more than once are shared because the person agrees with the sentiment. The Evan’s law’s tweets are from April 13, 2016, and there were 332 usable tweets from that dataset. The server the data mining was being conducted on had an error April 14-15 while it was running unattended, and later data collected was not relevant enough to add to the dataset. As a niche area of interest, there was not a great quantity of data available, so handled it more akin to qualitative data.
“Textalyzers”: Evan’s law is a New York legislation lobbied for by the group Distracted Operators Risk Casualties. In 2014 the Supreme Court determined that police require a warrant to search cellular phones. What Evan’s Law, also known as NY A8613A or 6325A, would allow is for police officers to field test cellular devices following a motor vehicle accident or collision, to determine if the driver is in violation of any driving laws. While on the surface, this may seem fairly benign, the bill itself has many clauses in it that cause concern, such as that consent is given for this testing by merely operating a motor vehicle in the state of New York. If the driver refuses to allow the police officer access to their phone, their license will immediately be suspended.
Rule 41: As a result of the Playpen case the FBI was handling recently, where they were trying to catch a child pornography ring, the Supreme Court approved an amendment to rule 41 dealing with search and seizure. With these amendments that go into effect December 1, 2016, the FBI will be able to request for a judge “to issue a warrant to use remote access to search electronic storage media” if “the district where the media or information is located has been concealed through technological means” or “the media are protected computers that have been damaged without authorization and are located in five or more districts”. Also contrary to public image of federal surveillance, in order to obtain “a warrant to use remote access to search electronic storage media and seize or copy electronically stored information, the officer must make reasonable efforts to serve a copy of the warrant and receipt on the person whose property was searched or who possessed the information that was seized or copied. Service may be accomplished by any means, including electronic means”.
One of the more effective methods I used on the data set was gathering statistics about the data set through wordcounter.net, which allows you to search for trends in your dataset of one word as well as short phrases, depending on the user’s preference. Popular keywords used in the tweets about Rule 41, from the 6,937 words the text was cleaned down to, are 163 mentions of “hacking powers”, 52 mentions of “scary new route”, 24 mentions of “weird complicated bureaucratic” and 31 mentions of “government hacking”. “Watchdogs” was also used 49 times and “dystopian” was used 40 times.
Specific fears iterated in the data set were 7 tweets regarding concern with this giving the FBI the ability to hack innocent botnet victims. 15 tweets were concerned with the implications this has for customers storing sensitive information in the cloud. 6 tweets consider this the largest expansion of “search and seizure” laws in history.
For the Evan’s law data set, the data set was comprised of many more retweets. The most common being “Distracted driving clampdown: Cops could get ‘textalyzers‘” with 66 retweets. “People will absolutely hate this new tech, but it might save their lives: DORCs made it happen.” Was the second most common retweet with 16. Then there were a few individuals expressing concern such as “But if it’s mandatory where a breathalyzer isn’t????”.

The tool I ended up using was from the University of Arizona, and can be found here: http://wordcloud.cs.arizona.edu/index.html the option selected is “Sentiment Green- Red”. The more red a word is the higher value it has in negative sentiment, and the more green it is, the higher value in positive sentiment. Also the size represents it’s prevalence in the dataset. There were also many options to sort by emotional sentiments, though I found those to be less useful for my purposes with this dataset. The above image shows the top 50 words from the data set, and the bottom shows the top 150.

The first sentiment analysis tool I utilized with this dataset was Strider on April 30, 2016, with the initial 229 tweets about Rule 41, and the 332 tweets about Textalyzers. The Rule 41 tweets were given a sentiment score of -3.8, and Evan’s Law a sentiment score of -1.9. This lead me to believe there might be a statistical difference once I acquired more data. However, once I had acquired the full data set, Rule 41 got boosted up to a 2.4, or positive emotion. It seems unlikely any human that read the tweets would agree with that.
Python NLTK Text Classification didn’t fare much better. Jigsaw did not particularly work well with any of the three formats I managed to create of my datasets. So Jason Davies and I had a reunion, and I made word clouds. I knew I could not complete Data Visualization and Visual Analytics, and submit word clouds as my final project. So I went digging about academic papers for visualization techniques that might lend themselves well to these types of datasets and purpose.
A tool I found incredibly fascinating to explore this data set with (though not in class on a small screen) is Textexture. It was able to utilize the effectiveness of word counts in so far as capturing semantics, but provided much more information and context than a word cloud. This tool allows for the user to view a network of related words, and drill down to specific pathways. Furthermore the text is displayed on the side, and is reactive to these queries. With such a large dataset, the initial web is rather messy and a bit difficult to navigate, but after the user drills down to relevant connections (a few are suggested as filters at the bottom) can be quite useful to view trends. Some of the Rule 41 tweets are uploaded to this link here: http://textexture.com/index.php?text_id=74063

There were many limitations to this study, both with the dataset itself and the analytic tools. None of the sentiment analysis tools used were able to pick up the literary and cultural references the dataset was laden with. Additionally, the tool (and most ethical ones that could be used) only allowed for the collection of public tweets. Given that the topic is privacy, that gives a skewed sample for examining privacy sentiments. Also, the type of person who is writing these tweets, that the text analytic tools rated as college graduate level, are not representative of the public at large.
Some questions this data set didn’t answer were: what caused there to be less negative tweets about Evan’s law than Rule 41? Both certainly had negative opinions, but the Evan’s laws tweets never got to the “ALRIGHT, WOULD SOMEONE *PLEASE* MURDER WHOEVER IS BEHIND THIS RULE 41 FUCKSHIT? #LastRT You’ll be a hero to the human race.” Level that Rule 41 did, and instead stayed a lot more balanced or undecided. Is this because one was a federal law and the other a state law? Is it because of the stigma the word “hacking” and misconceptions about it have with the general population? Does it have to do with the EFF backing of the concerns of Rule 41? Are people already used to giving up personal information to the cops when they are pulled over, so this isn’t as far of a stretch? Is it because the consequences of texting and driving are more tangible? The size and scope of this project doesn’t extend far enough to answer any of those questions. More data would need to be gathered from more sources about both of these laws, as well as other historic privacy laws.
The fact that text analytics and content analysis lends to the assumption that these individuals are educated members of the populace. Yet there still seems to be misconceptions at large regarding these tweets. I’m not sure if it is because of the influences of personal biases, or influence of another with strong biases, or the fact that legal documents themselves can be a bit intimidating and difficult to decipher. But the data from this project got me thinking about how Visual Analytics tools could be helpful in assisting educated individuals to have a greater understanding of legal documents. I saw a few academic studies that have been conducted since the 1970s, but those have primarily been focused at changing the language used in legal documents. But as discussed on many instances in Human Computer Interactions classes, people don’t tend to read paragraphs of text. Legal documents tend to be very long and tedious, even if the language was simple, many people still wouldn’t read them due to lack of time or interest. Is there a way to use Visual Analytics and Gamification to help educate people about Laws, outside of the US Government course in High School that is likely a relic in most adult’s minds? So much of the field seems to be creating tools for law enforcement to catch people in violation of laws. But could there be some use for it in law education as well?
Sample Tweets, chosen for variety to quickly demonstrate the contents of the dataset.
Rule 41-
- “ALRIGHT, WOULD SOMEONE *PLEASE* MURDER WHOEVER IS BEHIND THIS RULE 41 FUCKSHIT? #LastRT You’ll be a hero to the human race.”
- “FBI Hacking. Rule 41: One Rule to Hack Them All”
- “RT @mattblaze: If ever there was a job for the Internet Outrage Machine, the Rule 41 changes are it. Allows hacking botnet VICTIMS. “
- “Federal Rules of Criminal Procedure #Rule41 means malware victims may also be #hacked by the FBI too. #digitalrights”
- “Frightening expanse of snooping powers in the land of the free”
- “With public opinion turning against crypto backdoors, the government wants to engage in even more expansive hacking”
- What does Rule 34 look like for Rule 41?
- “When #Anonymous says “hack the planet” we’re not suggesting the FBI should do it!”
- “People both inside and outside of the United States should be equally concerned about this proposal.”
- “Not good at all! America founded on distrust of Govt (king). It’s what USA Constitution is all about. Giving it up? “
- “The proposed Rule 41 amendment is potentially the largest expansion of the G’s search & seizure power in our history.”
- “Broad increase of government hacking abilities proposed to Rule 41 of Federal Rules of Criminal Procedure #Orwellian”
- “SCOTUS Approves Broader Hacking Powers For FBI. What Could Go Wrong?”
- “This just gets scarier and scarier”
- “If passed Rule 41 cont.w/blanket warrant hacks every1 is suspect til proven innocent.”
- “Supreme Court Approves Rule 41 Changes, Putting FBI Closer To Searching Any Computer Anywhere With A Single Warrant”
- “Change to Rule 41 could make consumers feel less comfortable storing sensitive data on cloud services like Google’s”
- “I do not support Rule 41 and you should not support it either.”
- “With Rule 41, Little-Known Committee Proposes to Grant New Hacking Powers to the Government”
- “Rule 41 change nets Supreme Court thumbs-up, but where’s it going next?”
- “Geekity: Rule 41: For Child Porn on the Net, More Government Tyranny”
- “Fundamental changes in rules wrt US government hacking powers requiring legislation disguised as procedural changes”

Leave a comment