Exploring YouTube Data: A Data Driven Approach

Abstract

Video watching had emerged as one of the most frequent media activities on the Internet. Yet, little is known about how users watch online video. Using two distinct YouTube datasets, a set of random YouTube videos crawled from the Web and a set of videos watched by participants tracked by YouTube developer App, This paper examine whether and how indicators of collective preferences and reactions are associated with view duration of videos. This paper also shows that video view duration is positively associated with the videos view count, the number of likes per view, and the negative sentiment in the comments. These metrics and reactions have a significant predictive power over the duration the video is watched by individuals. Our findings provide a more precise understandings of user engagement with video content in social media beyond view count.

Introduction

Video watching is perhaps the most popular web-based activity, through video hosting and sharing services such as YouTube, Facebook, Netflix, Vimeo, and others. As of 2015, YouTube alone has more than 1 billion viewers every day, watching hundreds of millions of hours of content. It is forecasted that video will represent 80 percent of all traffic by 2019. Yet, little is known about how users engage with and watch online video. We use two distinct datasets from YouTube to investigate how users engagement in watching a video (i.e., view duration) is associated with other video metrics such as the number of views, likes, comments, and the sentiment of comments. A number of research efforts have investigated view count as a key indicator of popularity or quality of videoparticularly looking at its relationships with other popularity or preference metrics (e.g., the number of likes and comments). For example, the number of comments/favourites/ratings and average rating are significant predictors of video view counts on YouTube; the sequence of comments and its structure are strongly associated with view counts and view counts can be predicted through socially shared viewing behaviours around the content such as how many times a video was rewound or fast-forwarded as well as the duration of the session in a tool that allows people watch videos together in sync and real time. Although views, likes, comments, and other such measures can be considered as indicators of general popularity and preferences, there has been growing interest in using deeper post-click user engagement (e.g., how long a user watched a video) to estimate more accurate relevance and interest and to improve ranking and recommendation. For example, YouTube has started to use dwell time (the length of time that a user spends on a video, e.g., video watching session length) instead of click events to better measure the engagement with video content. Beyond video, Facebook is using dwell time on external links to combat Clickbaitstories with arousing headlines that attract users to click and share more than usual, but are not consumed in depth.

Data Collection

For the Individual Logs dataset, our view duration dependent variable was computed differently. In this case, we have used an individual, but approximate, view duration measurement. In particular, we used the users dwell time for each video on the videos page, as was measured by the extension, as an approximation for the actual view time for the video by that user. We modelled the data by automating queries and keyword-based searches to gather videos and their corresponding comments. Python scripts using the YouTube APIs were used to extract information about each video (comments and their timestamps). We collected 1000 comments per video (YouTube allows a maximum of 1000 comments per video to be accessed through the APIs), and used keywords like ‘Federer’, Nadal’, Obama’ etc., to collect the data for specific keywords. The timestamp and author name of each video were also collected. The final dataset used for the sentiment analysis had more than 3000 videos and more than 7 million comments. We performed data pre-processing on the collected comments. YouTube comments comprise of several languages depending on the demography of the commenter. However, to simplify the sentiment analysis, we modified the data collection scripts to collect only English comments. From the collected English comments, only comments in the standardUTF-8 encoding were selected in order to remove comments with unwanted characters. The steps below explain the procedure to collect the comments with their respective timestamps and author names for the keywords specified by the user. In steps 2-4, the Google APIs for YouTube are used to configure the query with the number of videos to be fetched, the language of interest for comments, the search keyword, and how the comments are to be sorted. Step 5 collects the IDs of the videos related to the specified keyword. Steps 6 and 7 collect the comments associated with these videos and extract the timestamps, author names and comment text from the comment entries. All the comments for a single keyword are aggregated into one dataset which is used as the test set as explained in the following:

Step 1: Prompt the user to specify the search keyword (keywords) and number of videos (numVideos)
Step 2: Set maxNumVideos = max(50; numVideos) (As Google limits the maximum number of videos fetched in one iteration to 50)
Step 3: Set up the YouTube client to use the YouTube-Service() API to communicate with the YouTube servers
Step 4: Use the YouTubeVideoQuery() API to set the query parameters like language, search keyword,etc
Step 5: Perform successive queries to get the videoID of each video related to the keyword
Step 6: Collect the comments associated with each videoID using the GetYouTubeVideoCommentFeed() API (maximum limit of comments per video is 1000)
Step 7: Extract the comments with their respective timestamps and author names

From 105 days of observation , it can be seen ,that a particular video trended for 14 days at most. We can see there are 604 videos those had appeared in the YouTube trending video list for only once. One of the interesting point I would like to share, correlation between likes &comment count = 0.71 . And correlation between dislikes & comment_count = 0.83 . So we can claim, that more people involved in conversation when they were disliking a video rather than liking a video. Most of these cases ,video might be controversial or a fake news,etc.

From the above plot we can see, there is a very strong relationship between views & likes. And the value of the correlation between them is 0.82. Since log10 applied on the x-axis & and there are few videos in YouTube trending list with 0 likes, thats why we have to pass the variable (likes+1) instead of likes into the scale_x_log10() function. That would help to overcome infinite values(since log10(0) = Inf). Therefore in the above plot on x-axis scale 1 represent 0. We can see there are many outliers on y-axis for x = 1. Many of those video authors might be disabled video rating ,so users cant like or dislike the video. Another point to see, after 10^4=10000 likes ,variance of likes decreases as views increases.

References

Alhabash, S.; Baek, J.-h.; Cunningham, C.; and Hagerstrom, A.2015. To Comment or Not to Comment?: How Virality, Arousal Level, and Commenting Behavior on YouTube Videos Affect Civic Behavioral Intentions. Computers in Human Behavior.
Arapakis, I.; Lalmas, M.; Cambazoglu, B. B.; Marcos, M.-C.; and Jose, J. M. 2014. User Engagement in Online News: Under the Scope of Sentiment, Interest, Affect, and Gaze. Journal of the Association for Information Science and Technology.
Baym, N. K. 2013. Data Not Seen: The Uses and Shortcomings of Social Media Metrics. First Monday.
Berger, J., and Milkman, K. L. 2012. What Makes Online Content Viral? Journal of Marketing Research.
Chatzopoulou, G.; Sheng, C.; and Faloutsos, M. 2010. A First Step Towards Understanding Popularity in YouTube. In Proc. Of INFOCOM.
Cisco. 2015. Global IP Traffic Forecast. http://www.cisco.com/c/en/us/solutions/service-provider/visual-networking-indexvni/ index.html.
Cramer, H. 2015. Effects of Ad Quality & Content-Relevance on Perceived Content Quality. In Proc. of CHI.
De Choudhury, M.; Sundaram, H.; John, A.; and Seligmann, D. D. 2009. What Makes Conversations Interesting?: Themes, Participants and Consequences of Conversations in Online Social Media. In Proc. of WWW.
El-Arini, K., and Tang, J. 2014. News Feed FYI: Clickbaiting. http://newsroom.fb.com/news/2014/08/news-feed-fyiclick-baiting.
Hutto, C., and Gilbert, E. 2014. Vader: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. In Proc. Of ICWSM.

Need help with assignments?

Our qualified writers can create original, plagiarism-free papers in any format you choose (APA, MLA, Harvard, Chicago, etc.)

Order from us for quality, customized work in due time of your choice.

Click Here To Order Now