Characterizing Web-based Video Sharing Workloads
S. Mitra, M. Agrawal, A. Yadav, N. Carlsson, D. Eager, and A. Mahanti,
ACM Transactions on the Web, Vol. 5, No. 2 (May 2011), pp. 8:1-8:27.
Video sharing services that allow ordinary Web
users to upload video clips of their choice and watch video clips
uploaded by others have recently become very popular.
This paper identifies invariants in
video sharing workloads, through comparison of the
workload characteristics of four popular video sharing services.
Our traces contain meta-data on approximately 1.8 million
videos which together have been viewed approximately 6 billion times.
Using these traces, we study the similarities and differences in use
of several Web 2.0 features such as ratings, comments, favorites,
and propensity of uploading content.
In general, we find that active contribution, such as video uploading
and rating of videos, is much less prevalent than passive use.
While uploaders in general are skewed with respect to the number
of videos they upload, the fraction of multi-time uploaders is
found to differ by a factor of two between two of the sites.
The distributions of life-time measures of video popularity are found
to have heavy-tailed forms that are similar across the four sites.
Finally, we consider implications for system design of the
identified invariants. To gain further insight into caching in
video sharing systems, and the relevance to caching of life-time
popularity measures, we gathered an additional data set tracking
views to a set of approximately 1.3 million videos from one of the
services, over a twelve week period.
We find that life-time popularity measures have some relevance for
large cache (hot set) sizes (i.e., a hot set defined according to one
of these measures is indeed relatively ``hot''), but that this
relevance substantially decreases as cache size decreases, owing to
churn in video popularity.
Datasets
Datasets used in our paper is made available here for use by
the wider research community. The datasets consist of publicly available
meta-data associated with videos from the Dailymotion,
Veoh, Metacafe,
and Yahoo! video Web sites.
If you use our datasets in your research, please drop Siddharth Mitra
a line at "sidmitra DOT del AT gmail dot com", and include a
reference to our paper in your work.
Dailymotion Music Category (collected 22 March, 2008; 1,194,186 videos, cf. Section 4.1.1)
-
Download Dailymotion data file.
- Format: VID | SID | VIEWS | AGE | SRCPGNP | RATING | RATINGCOUNT | COMMENTS | DURATION | FAVORITED
VID is the video identifier,
SID is the source identifier,
VIEWS is the number of views to the video,
AGE is the number of minutes since the video was uploaded,
RATING is the average rating assigned to the video,
RATINGCOUNT is the number of times the video was rated,
COMMENTS is the number of comments made to the video,
DURATION is the playback length of the video in units of seconds, and
FAVORITED is the number of times the video was marked as a favourite.
Note the field SRCPGNP is a crawler generated field identifying the
page from which the data was collected (i.e., this is not a video meta-data).
Yahoo! (collected 13-15 March, 2008; 99,207 videos, cf. Section 4.1.2)
- Download Yahoo! data file.
- Format: VID | SID | VIEWS | AGE | CATEGORY | RATING | RATINGCOUNT | COMMENTS | DURATION
VID is the video identifier,
SID is the source identifier,
VIEWS is the number of views to the video,
AGE is the number of days since the video was uploaded,
RATING is the average rating assigned to the video,
RATINGCOUNT is the number of times the video was rated,
COMMENTS is the number of comments made to the video,
and DURATION is the playback length of the video in units of seconds.
Metacafe (collected April 2008; 239,250 videos, cf. Section 4.1.3)
- Download Metacafe data file.
- Format: VID | DURATION | COMMENTS | RATINGS | AGE | VIEWS
VID is the video identifier,
DURATION is playback length of the video in seconds,
COMMENTS is the number of comments to the video,
RATINGS is the average rating assigned to the video, and
AGE is the age of the video (in minutes) measured
as time since upload, and
VIEWS is the number of views to the video.
Veoh (collected 18 March 2008; 269,531 videos, cf. Section 4.1.3)
- Download Veoh data file.
- Format: VID | AGE | DURATION | VIEWS | RATING | RATINGCOUNT
VID is the video identifier,
AGE is the number of days since the video was uploaded,
DURATION is the playback length of the video in units of seconds,
VIEWS is the number of views to the video,
RATING is the average rating assigned to the video, and
RATINGCOUNT is the number of times the video was rated.
Dailymotion Longitudinal Dataset (cf. Section 6.3)
- Download Dailymotion longitudinal data file.
- Format: VID | AGE | VIEWS_1 | VIEWS_2 | VIEWS_3 | ... | VIEWS_13
VID is the video identifier,
AGE is the number of minutes since the video was uploaded as measured at time of first meta-data collection,
VIEW_1 is the number of views to the video at time of first meta-data collection,
VIEW_2 is the number of views to the video one week following first collection,
VIEW_3 is the number of views to the video two weeks following first collection,
and so on. In total, we have 12 snapshots of view counts, each exactly one week apart.
The first meta-data collection occured between 13-20 July 2008. The file has a header row providing information on the data file's various columns.