apprecommender_with_popcon.txt

TL;DR
------------------------------------------------------------------------

This is a proposal for popcon to calculate and publish an additional aggregate
data set based on user submissions. popcon data is also very valuable, and in
our opinion it is currently underused. It is also very sensitive, due to being
provided by individual Debian users, so in this message we describe exactly how
we propose to create this dataset, and which measures we took to avoid leaking
information that could be used to identify users.

Our primary objective is to publish package installation clusters, meaning
basically sets of packages that are usually installed together. Our goal is to
use this data for package recommendation, but it could also be used for other
ends; for example, one could use that data to identify packages that are
usually used together, and prioritize system upgrade tests that involve those
packages.

Introduction
------------------------------------------------------------------------

Recommender systems are widely used to help people find what they need, mainly
when they have so many options that it is hard for them to know all of their
options. Systems with a large number of options, like Netflix and Amazon, use
recommender systems to help their users to find good options for them. These
systems identify the user choices, then find other users that makes similar
choices, then take the options chosen by these similar users that the original
user did not choose yet and recommend these options to the user. This
recommendation strategy, which uses data from other users with similar choices,
is known as collaborative.

AppRecommender is an package recommender for Debian which was created by Tassia
Camões Araújo. AppRecommender basically recommends new packages based on the
ones the user has already installed. In order to do that, AppRecommender
implements a few strategies, amongst them the content based, based on the
package metadata alone (description + debtags), and a collaborative strategy.
Tassia made a study with users to identify the quality of the strategies, and
as a result she found that a collaborative strategy produces better results
than the content based strategy.

The collaborative strategy of AppRecommender was implemented using
popularity-contest data privately, but in practice today AppRecommender has
only content based strategies. In order to use a collaborative strategy we
would need to have the popularity-contest data on the user's computer, so this
strategy was removed because such data could not be properly distributed.

AppRecommender is currently the subject of two Google Summer of Code projects,
and the idea of using the popularity-contest data resurfaced.

This is a proposal to use popularity-contest data on AppRecommender without
breaking the privacy terms, and to contribute with popularity-contest by adding
a new source of statistics to the popularity-contest server that can be used by
other applications.

How popularity-contest submissions look like
-----------------------------------------------------------------------

A popularity-contest submission is provided with the following format:

POPULARITY-CONTEST-0 TIME:1469645265 ID:596876bc37dd6d3f5cd97feebeb555e0 ARCH:amd64 POPCONVER:1.64 VENDOR:Debian
1469620800 1465819200 gnome-themes-standard /usr/lib/x86_64-linux-gnu/gtk-2.0/2.10.0/engines/libadwaita.so
1469620800 1465516800 libnl-route-3-200 /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200.22.0
1469620800 1465516800 libruby2.3 /usr/lib/ruby/2.3.0/rdoc/generator/template/darkfish/fonts/Lato-Regular.ttf
1469620800 1465516800 dconf-gsettings-backend /usr/lib/x86_64-linux-gnu/gio/modules/libdconfsettings.so
1469620800 1465516800 python-dbus /usr/lib/python2.7/dist-packages/_dbus_glib_bindings.x86_64-linux-gnu.so
1469620800 1465516800 cinnamon /usr/lib/x86_64-linux-gnu/cinnamon/libcinnamon.so
1469620800 1465516800 libclutter-1.0-0 /usr/lib/x86_64-linux-gnu/libclutter-1.0.so.0.2600.0
1469620800 1465516800 udev /lib/systemd/systemd-udevd
1469620800 1466942400 cups-daemon /usr/sbin/cupsd

Each line of a popularity-contest submission has the following information:

* atime: The most recent time that a package file was accessed.
* ctime: the last time a package file was updated or created.
* package name: The name of the package
* mru file: The most recent used file on the package.

This data basically lists the most recent applications a user has  used. This
data is generated daily, and submitted to the popularity-contest server. With
that information for a huge number of users, popularity-contest can them
publish some statistics about the use of some packages, such as the number of
users that have a given package and the most popular packages on the
distribution.

Proposal: clustering popcon data
------------------------------------------------------------------------

Another possible use of the popcon data would be clustering. In statistics, a
cluster is a subset of the items in a dataset that have similar characteristics
according to predefined criteria.

We propose preprocessing the popcon data to create clusters of packages that
are usually installed and used together. Given an user, we can find the
clusters of packages that are more similar to the user's packages, and then
recommend the packages in those clusters that the user doesn't already have
installed.

How the clusterization works
------------------------------------------------------------------------

The first step is to create a list of all possible packages. We are currently
using binary packages from unstable and stable. For a small example, assume we
would only consider the following list of packages: git, gedit, vagrant, and
vim.

The second step on the process is mapping packages and popcon submissions.  For
each submission, we generate a list of 0 and 1 values, based on whether that
submission contains or does not contain the corresponding package in the
package list. For example:

  Packages:           git,  gedit,  vagrant, vim.
  submission 1:       1     0       1        0
  submission 2:       0     0       1        1

We can seee that submission 1 only contains git and vagrant, and not gedit and
vim. submission 2 has vagrant and vim, but not git nor gedit.

This way each submission is transformed into a binary vector. At the end, this
will produce a matrix where each row represents a submission and each column, a
package.

  1     0       1      0
  0     0       1      1

This matrix is them used to actually create the user clusters.  The actual
clustering process is a little complicated, but you can review the actual code
on the patch attached to this bug report:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=XXXXXX

TODO: submit a bug report to popularity-contest with the clusterization script.
Add the number above.

The process produces two data files. The first file has all the coordinates of
the calculated clusters. This is used to place a new user on a cluster, i.e.
identify how similar a given user is to each cluster.

The clusters coordinates are on cartesian plane, where the dimensions are the
Debian packages, for this example, are the git, gedit, vagrant and vim,
therefore a cluster its a number group that represents how this cluster are
nearest from these packages.

The coordinates of the calculated clusters file have the following format for
each line, where each line represents one cluster:

  0.3; 0.15; 0.7; 0.42

The second lists the clusters in which each package appears. For example:

  vim: 0:3, 1:2, 3:1

This says that vim is on clusters 0, 1 and 3. It also records the number of
times the package appears on each given cluster. For example, on cluster 0, 3
users have the package vim.

The main reason behind these files is to be able to identify to which cluster
an user belongs to, also then based on that, calculate which packages can be
recommended.

Privacy risks involved with publishing the cluster data
------------------------------------------------------------------------

The most relevant privacy risk involved in publishing the cluster data is
allowing popcon users to be identifier through analysis of the cluster data.
This is a problem that can involve several risks to users, because if a user
has been identified by a malicious person, this person can attack the user, one
type of attack its using a security flaw in one of the user packages.  Other
risk is: if an user is identified, like use these informations to identify the
user into another system.

Another risk is an attacker modifying the cluster data while it is being
transferred to the user, with a man in the middle attack, or via a direct
attack on the server. The both attacks are methods when the attacker can
grant access to cluster data, and then its possible to him makes changes on
the clusters data and manipulates the recommendation. An example of this
manipulation its the attacker makes that all clusters have only the packages
that he wants to be recommended.

Measures taken to minimize those privacy risks
------------------------------------------------------------------------

To minimize the risk of identifying an users, some measures are taken.
One of these measures is the anonymization, that involves removing any
identifying information from the data. Therefore the only data of users that
is available on clusters is their list of packages.

Even then, some extra measures are taken to protect the users, such as not
considering packages used by less than a given percentage of users. This avoids
identifying users of packages that have very few users.

Another measure is to randomly sort popularity-contest submissions before using
them to generate the clusters, and removing a portion of this submissions. This
reduces the chances of identifying an individual user, because the clusters
does not contain data from of all users.

Yet another measure is randomization, where a random package is added for each
user submission. Although this decreases the clusters precision, this also
improves the users privacy, because this action adds a degree of uncertainty
about users' packages.

To avoid tampering of the cluster data between the server and user machines,
the cluster data is first hashed with SHA256, and the resulting release file
with the hashes is gpg-signed by the popcon server (just like secure APT
repositories). After downloading the data, AppRecommender first validates the
gpg signature of the release file, and then validates the hashes of the data
files.