NotesRecursing the Rabbit Hole
https://notes.pairml.com/
Tue, 25 Oct 2022 04:24:02 +0000Tue, 25 Oct 2022 04:24:02 +0000Jekyll v3.8.5Deep Learning without Poor Local Minima<h3 id="importance">Importance</h3>
<p>It has long been an open question to understand why is training a deep neural network tractable. Specifically, we know that deep neural networks are non-convex functions with possible local minima. We also know that stochastic gradient descent is prone to getting stuck in such local minima. What has confounded researchers for a long time is the fact that <strong>while searching such a huge solution space, why does a neural network not get stuck in local minima” more often</strong> and learn bad solutions. This work tries to give some understanding about these open questions.</p>
<h3 id="prior-works">Prior works</h3>
<p>There has an earlier paper in the field published back in 1989, that proves similar statements but under much tighter assumptions. This work builds on top of that using different techniques although and relaxing the assumptions and using only 2 or the original 7 used in the prior work.</p>
<h3 id="interesting-results">Interesting Results</h3>
<p>Part of the main result says “Every critical point that is not a global minimum is a saddle point”, which is interesting because it can be loosely translated to “there are no local maxima” which can be quite unintuitive to imagine at first.</p>
<blockquote>
<p>There are no local maxima.</p>
</blockquote>
<h3 id="prerequisites">Prerequisites</h3>
<p>A brief understanding of the following topics would be good to have:</p>
<ul>
<li>Linear Algebra. duh!</li>
<li>First order and second order necessary conditions for local minima.</li>
<li>Familiarity with graph interpretation of neural networks.</li>
</ul>
<h3 id="contributions">Contributions</h3>
<p>This paper proves the following statements for squared loss function of deep linear networks with any depth and any widths:</p>
<ul>
<li>The function is non-convex and non-concave.</li>
<li>Every local minimum is a global minimum.</li>
<li>Every critical point that is not a global minimum is a saddle point.</li>
<li>Bad saddle points exist for deeper networks s.t. Hessian has no negative eigenvalue</li>
</ul>
<p>Shallow networks don’t have this issue of bad saddle points. This work further proves the same 4 statements via reduction about deep non-linear neural networks</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://papers.nips.cc/paper/2016/hash/f2fc990265c712c49d51a18a32b39f0c-Abstract.html" target="_blank">Deep Learning without Poor Local Minima</a></small></p>
Tue, 10 Nov 2020 00:00:00 +0000
https://notes.pairml.com/2020/11/10/deep-learning-without-poor-local-minima/
https://notes.pairml.com/2020/11/10/deep-learning-without-poor-local-minima/machine-learningmathematicspaperstheoremsGenerative Adversarial Networks<h3 id="introduction">Introduction</h3>
<p>The basic <strong>adversarial</strong> framework of the GAN architecture can be broken down into the following <strong>two players</strong>:</p>
<ul>
<li>A <strong>generative</strong> model <script type="math/tex">G</script>, that tries to capture the latent data distribution.</li>
<li>A <strong>discriminative</strong> model <script type="math/tex">D</script>, that estimates the probability that a sample came from training data rather than <script type="math/tex">G</script>.</li>
</ul>
<p>The framework is adversarial in the sense that the training procedure for <script type="math/tex">G</script> tries to <strong>maximize the probability of <script type="math/tex">D</script> making a mistake</strong>. The framework thus corresponds to a minimax two-player game.</p>
<h3 id="related-generative-models">Related Generative Models</h3>
<ul>
<li>Restricted Boltzmann Machines</li>
<li>Deep Boltzmann Machines</li>
<li>Deep Belief Networks</li>
<li>Denoising Autoencoders</li>
<li>Contractive Autoencoders</li>
<li>Generative Stochastic Network</li>
</ul>
<h3 id="notations">Notations</h3>
<ul>
<li>Easiest to implement GANs when the models are <strong>multilayer perceptrons</strong> for both generator and discriminator.</li>
<li><script type="math/tex">p_g</script> is the generator’s distribution over data <script type="math/tex">x</script>.</li>
<li><script type="math/tex">p_z(z)</script> is an input noise function and <script type="math/tex">G(z; \theta_g)</script> is the mapping to data space.</li>
<li><script type="math/tex">G</script> is differentiable function represented by a paramter <script type="math/tex">\theta_g</script>.</li>
<li><script type="math/tex">D</script> is another differentiable function that outputs a scalar.</li>
<li><script type="math/tex">D(x)</script> represents the probability of assigning the correct label to both training examples and samples from <script type="math/tex">G</script>.</li>
<li><script type="math/tex">G</script> is simultaneously trained to minimize <script type="math/tex">log(1-D(G(z)))</script></li>
</ul>
<h3 id="optimization-objective">Optimization Objective</h3>
<p>The training framework between <script type="math/tex">D</script> and <script type="math/tex">G</script> can be represented by a two player minimax game in value function <script type="math/tex">V(G,D)</script>,</p>
<script type="math/tex; mode=display">\min_G \max_D V(D, G) = \mathbb{E}_{x\sim p_{data}(x)} [log D(x)] + \mathbb{E}_{z\sim p_z(z)} [log(1 - D(G(z)))]</script>
<p><img src="/assets/2019-10-29-generative-adversarial-networks/fig-2-Generative-Adversarial-Network-GAN.png?raw=true" alt="GAN Architecture" /></p>
<h3 id="implementation-details">Implementation Details</h3>
<ul>
<li><script type="math/tex">G</script> and <script type="math/tex">D</script> are trained iteratively one after the other</li>
<li><script type="math/tex">D</script> is not optimized to completion as it would lead to overfitting</li>
<li>Alternate between <script type="math/tex">k</script> steps of optimizing <script type="math/tex">D</script> and one step of <script type="math/tex">G</script></li>
<li>Results in <script type="math/tex">D</script> near its optimal, so long as <script type="math/tex">G</script> changes slowly.</li>
<li>Early in learning when <script type="math/tex">G</script> is poor <script type="math/tex">D</script> can reject samples with high confidence which causes <script type="math/tex">log(1-D(G(z)))</script> to saturate</li>
<li>Instead of minimizing <script type="math/tex">log(1-D(G(z)))</script>, maximize <script type="math/tex">log(D(G(z)))</script> for stronger gradients early in the learning.</li>
</ul>
<p><img src="/assets/2019-10-29-generative-adversarial-networks/fig-1-gan-algorithm.png?raw=true" alt="Algorithm for GAN training" /></p>
<h3 id="theoretical-results">Theoretical Results</h3>
<p>For a fixed <script type="math/tex">G</script>, the optimal discriminator can be found by differentiating the objective function w.r.t. <script type="math/tex">D(x)</script>. The objective function is of the form,</p>
<script type="math/tex; mode=display">f(y) = a\,log\,y + b\,log\,(1-y)</script>
<p>Differentiating w.r.t $y$ gives,</p>
<script type="math/tex; mode=display">\frac{df(y)}{dy} = \frac{a}{y} - \frac{b}{1-y}</script>
<p>Since we are maximising this, the maximum can be found by estimating the point of 0 derivative, i.e,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{df(y)}{dy} &= 0 \\
\frac{a}{y} - \frac{b}{1-y} &= 0 \\
\frac{a}{y} &= \frac{b}{1-y} \\
y &= \frac{a}{a+b}
\end{align} %]]></script>
<p>So the optimal discriminator for a fixed <script type="math/tex">G</script> is given by,</p>
<script type="math/tex; mode=display">D_G^*(x) = \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}</script>
<p>For this maximized <script type="math/tex">D</script>, the optimization objective can be rewritten as,</p>
<script type="math/tex; mode=display">C(G) = \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}\right] + \mathbb{E}_{x\sim p_g} \left[log\frac{p_{g }(x)}{p_{data}(x)+p_{g}(x)}\right]</script>
<p>We can show that this expression is minimized for <script type="math/tex">p_g=p_{data}</script>. The value of <script type="math/tex">D_G^*(x)</script> is <script type="math/tex">1/2</script> at <script type="math/tex">p_g=p_{data}</script> and <script type="math/tex">C(G) = -log\,4</script>.</p>
<p>To see that this is the minimu possible value, consider the following modification to the <script type="math/tex">C(G)</script> expression above,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
C(G) &= \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}\right] + \mathbb{E}_{x\sim p_g} \left[log\frac{p_{g }(x)}{p_{data}(x)+p_{g}(x)}\right] + log\,2 \cdot 2 - log\,4 \\
&= - log\,4 + \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}\right] + log\,2 + \mathbb{E}_{x\sim p_g} \left[log\frac{p_{g }(x)}{p_{data}(x)+p_{g}(x)}\right] + log\,2 \\
&=- log\,4 + \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)} + log\,2\right] + \mathbb{E}_{x\sim p_g} \left[log\frac{p_{g }(x)}{p_{data}(x)+p_{g}(x)} + log\,2\right] \\
&=- log\,4 + \mathbb{E}_{x\sim p_{data}} \left[log \frac{p_{data}(x)}{\frac{p_{data}(x)+p_{g}(x)}{2}}\right] + \mathbb{E}_{x\sim p_g} \left[log\frac{p_{g }(x)}{\frac{p_{data}(x)+p_{g}(x)}{2}}\right] \\
&= -log\,4 + KL\left(p_{data}||\frac{p_{data}+p_g}{2}\right) + KL\left(p_{g}||\frac{p_{data}+p_g}{2}\right)\\
&= -log\,4 + 2\cdot JSD(p_{data}||p_g)
\end{align} %]]></script>
<p>The last term is the Jensen-Shannon divergence between two distributions which is always non-negative and zero only when the two distributions are equal. So <script type="math/tex">C^* = -log\,4</script> is the global minimum of <script type="math/tex">C(G)</script> at <script type="math/tex">p_g=p_{data}</script>, i.e. generative model perfectly replicating the data distribution.</p>
<h3 id="complexity-comparison-of-generative-models">Complexity Comparison of Generative Models</h3>
<p><img src="/assets/2019-10-29-generative-adversarial-networks/fig-3-comparison-of-generative models.png?raw=true" alt="" /></p>
<h3 id="disadvantages">Disadvantages</h3>
<ul>
<li>There is no explicit representation of <script type="math/tex">p_g(x)</script></li>
<li><script type="math/tex">G</script> must be synchronized well with <script type="math/tex">D</script> during training. There are possibilities of <script type="math/tex">D</script> being too strong leading to zero gradient for <script type="math/tex">G</script> or <script type="math/tex">D</script> being too weak which causes <script type="math/tex">G</script> to collapse to many values of <script type="math/tex">z</script> to the same value of <script type="math/tex">x</script> which would not have enough diversity to model <script type="math/tex">p_{data}</script></li>
</ul>
<h3 id="follow-up-citations">Follow-up Citations</h3>
<ul>
<li>RBMs and DBMs
<ul>
<li>A fast learning algorithm for deep belief nets by Hinton et al.</li>
<li>Deep boltzman machines by Salakhutdinov et al.</li>
<li>Information processing in dynamical systems: Foundations of harmony theory by Smolensky</li>
</ul>
</li>
<li>MCMC
<ul>
<li>Better mixing via deep representations by Bengio et al.</li>
<li>Deep generative stochastic networks trainable by backprop by Bengio et al.</li>
</ul>
</li>
<li>Encodings
<ul>
<li>What is the best multi-stage architecture for object recognition? by Jarett et al.</li>
<li>Generalized denoising auto-encoders as generative models by Bengio et al.</li>
<li>Deep sparse rectifier neural networks by Glorot et al.</li>
<li>Maxout networks by Goodfellow et al.</li>
</ul>
</li>
<li>Optimizations
<ul>
<li>Auto-encoding variational bayes by Kingma et al.</li>
<li>Stochastic backpropagation and approximate inference in deep generative models by Rezende et al.</li>
</ul>
</li>
<li>Learning deep architectures for AI by Bengio Y.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf" target="_blank">Generative Adversarial Nets by Goodfellow et al.</a></small><br /></p>
Tue, 29 Oct 2019 00:00:00 +0000
https://notes.pairml.com/2019/10/29/generative-adversarial-networks/
https://notes.pairml.com/2019/10/29/generative-adversarial-networks/ganmachine-learningpapersprivacy-gansPrivacy Preserving Predictive Modeling GANs<h3 id="introduction">Introduction</h3>
<p>In the current era of increasing data sharing and ubiquity of machine learning there is a very little focus on privacy of the subjects whose data is being shared at such large scale. While most data released promises anonymization of the PIIs, there is much work in literature to point of that simple anonymization techniques fail to mask users if there is an access to auxilliary dataset in which such features are not masked.</p>
<p>In order to fix this issue there has been study in the field of privacy preservation for the datasets in public domain to ensure public trust in sharing their information which eventually is the root to building amazing machine learning models and drawing insights from big data.</p>
<p>This paper discusses a new architecture of GAN which tries to achieve this very objective of anonymization of the private data but in a deep learning setting where the work in overseen by an objective function for the encoder which embeds in itself the notion of anonymizing sensitive columns while trying to maximize the predivity of non-sensitive data columns.</p>
<h3 id="about-the-paper">About the paper</h3>
<p>The GAN framework presented consists of three components (explained in detail in later sections):</p>
<ul>
<li>Encoder</li>
<li>Ally that predicts the desired variables,</li>
<li>Adversary that predicts the sensitive data.</li>
</ul>
<p>The objective of the GAN framework is two-fold:</p>
<ul>
<li>learn low-dimensional representation of data points (users in this case) that excels at classifying a desired task (whether a user will answer quiz question correctly here)</li>
<li>prevent an adversary from recovering sensitive data (each users identity in this case)</li>
</ul>
<h3 id="background">Background</h3>
<p>Netflix dataset one of the famous datasets which is used for various starter tutorials on recommendation system. The history of privacy breach in this particular case is related to this seemly harmless data. While the company had ensured data anonymization to prevent breach of privacy. It was later discovered that it is easy to locate the users with good accuracy using auxilliary data from other related datasets such as IMDB which does not anonymize its dataset. Similarly there are cases in the domain of insurance companies where it has been proven that reverse engineering is a viable effort despite the anonymization.</p>
<p>It has similarly been shown that one can identify anonymous users in online social networks with upto 90% accuracy. These metrics point to the fact that it is possible for an attacker to uncover sensitive attributes from user data.</p>
<p>One of the popular work that has caught traction in this field is called Differential Privacy (DP), which proposes to add random noise to raw data, where the noise (generally from Laplacian Distribution) level controls the trade off between predictive quality and user privacy. But it has been found that this mechanism also reduces the utility of the data for predictive modeling and increases the sample complexity</p>
<p>The GAN model presented in this particular work is an effort to achieve the privacy (to prevent de-anonymization) while preserving the predictive aspects of the dataset (to overcome the drawbacks of techniques like DP).</p>
<h3 id="contributions">Contributions</h3>
<p>The authors apply this GAN architecture in a online MOOC setting. The objectives of the work include:</p>
<ul>
<li>Use the student data to predict whether or not they will answer a quiz correctly.</li>
<li>Ensure that the encoded data does not achieve a good convergence on sensitive data such as user identity.</li>
</ul>
<p><img src="/assets/2019-10-14-learning-informative-and-private-representations/fig-1-gan-architecture.png?raw=true" alt="GAN Architecure" /></p>
<p>The work is different from DP in two key aspects:</p>
<ul>
<li>It is data-dependent, i.e., it learns representations from user data</li>
<li>Directly uses raw user data without relying on feature engineering</li>
</ul>
<p>The objective of the GAN is to generate representations that minimize the loss function of the ally while maximizing the loss function of adversary.</p>
<p>One key advantage mentioned about the architecture is that it is model agnostic, i.e. each module can instantiate a specific differential function (e.g. neural networks) based on the needs of the particular application</p>
<h3 id="algorithm">Algorithm</h3>
<p><img src="/assets/2019-10-14-learning-informative-and-private-representations/fig-2-algorithm.png?raw=true" alt="" /></p>
<h3 id="datasets-and-objectives">Datasets and Objectives</h3>
<p>This particular paper presents the empirical results on dataset from the course Networks: Friends, Money and Bytes on Coursera MOOC platform. This has a total of 92 in-video quiz questions among 20 lectures. Each lecture has 4-5 videos. A total of 314,632 clickstreams were recorded for 3976 unique students.</p>
<p>Two types of data are collected about students:</p>
<ul>
<li>Video-watching clickstream: behavior is recorded as a sequence of clickstreams based on actions available in the scrub bar.</li>
<li>Question submissions: answers submitted by a student to an in-video question</li>
</ul>
<p>The final <strong>objective</strong> is defined as a mapping from student’s interaction (clickstream) on a video to their performance on questions (data acquired regarding question submissions)</p>
<p>The data collected can have both time-varying as well as static attributes. Time varying attributes include the series of clickstream before a question is answered, while the static attribute will included metrics like fraction of course completed, amount of time spent etc.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li>Accuracy on binary prediction of questions answered</li>
<li>AUC-ROC curve to assess the tradeoff between true and false positive rates</li>
<li>K Ranks and Mean average precision at K (MAP@K) to measure performance of privacy preservation</li>
</ul>
<h3 id="baselines">Baselines</h3>
<ul>
<li>Only one baseline benchmark is included in the work which is Laplace Mechanism in DP (Differential Privacy) which simply adds Laplace noise to the data.</li>
</ul>
<h3 id="findings-and-conclusions">Findings and Conclusions</h3>
<ul>
<li>The new architecture outperforms DP in terms of prediction task on question answers. It actually performs slightly better than the original features themselves.</li>
<li>With parameter <script type="math/tex">\alpha \to 1</script> in the GAN architecture, encoder is biased towards prediction than sensitive data obfuscation which is theoretically correct.</li>
<li>Larger <script type="math/tex">\epsilon</script> in DP means adding smaller noise component to the actual data, and it can be seen that models are better at predictive performance under such a setting.</li>
<li>Larger sizes of encoding dimension ensures more preserved information towards both prediction and sensitive data with identical <script type="math/tex">\alpha</script> values. This confirms the fact that the size of representations controls the amount of information contained in data representation.</li>
<li>Raw clickstream data with LSTM performs better than the hand-crafted features in terms of the tradeoff between prediction quality vs user privacy.</li>
</ul>
<h3 id="follow-up-citations">Follow-up Citations</h3>
<ul>
<li>J. Bennett, S. Lanning et al., “The netflix prize,” in Proceedings of KDD
cup and workshop, vol. 2007. New York, NY, USA, 2007, p. 35.</li>
<li>A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008, pp. 111–125.</li>
<li>“De-anonymizing social networks,” in Security and Privacy, 2009
30th IEEE Symposium on. IEEE, 2009, pp. 173–187.</li>
<li>Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise
to sensitivity in private data analysis,” in Proc. Theory of Cryptography
Conference, Mar. 2006, pp. 265–284.</li>
<li>“Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference. Springer, 2006, pp. 265–284.</li>
<li>C. Huang, P. Kairouzyz, X. Chen, S. L., and R. Rajagopal, “Context-aware generative adversarial privacy,” arXiv preprint arXiv:1710.09549, Dec. 2017.</li>
</ul>
<h1 id="references">REFERENCES</h1>
<p><small><a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8622089&tag=1" target="_blank">Learning Informative and Private Representations via Generative Adversarial Networks</a></small></p>
Mon, 14 Oct 2019 00:00:00 +0000
https://notes.pairml.com/2019/10/14/learning-informative-and-private-representations/
https://notes.pairml.com/2019/10/14/learning-informative-and-private-representations/GANmachine-learningpapersprivacy-gansSocial Learning Networks<h3 id="active-recall">Active Recall</h3>
<ol>
<li><a href="#introduction">What is a SLN?</a></li>
<li><a href="#characteristics-of-sln">What do the nodes, connections and weights represent in a SLN graph?</a></li>
<li><a href="#types-of-sln-graphs">What are the types of graphs used to represent information in SLN?</a></li>
</ol>
<h3 id="introduction">Introduction</h3>
<ul>
<li>An SLN is a type of <strong>social network between student, instructors and modules of learning</strong>. The network consists of dynamics of learning behaviour over a variety of <strong>graphs that represent the relationships</strong> among people and processes involved in learning</li>
</ul>
<h3 id="characteristics-of-sln">Characteristics of SLN</h3>
<ul>
<li>People in network <strong>learn through interaction</strong>.</li>
<li>Consists of functionality models and <strong>graph-theoretic models</strong>.</li>
<li><strong>Nodes</strong> represent learners, learning concepts, or both learners and concepts.</li>
<li><strong>Links</strong> represent the connection between the nodes.</li>
<li>Links can be <strong>undirected to denote similarity</strong> or <strong>directed to denote flow of information</strong>.</li>
<li><strong>Link weights</strong> gives magnitude to the connections extracted through some network measurements.</li>
</ul>
<h3 id="applications-for-sln">Applications for SLN</h3>
<ul>
<li>Massive increase in popularity of <strong>MOOCs</strong> over the past decades has led to access of unprecedented data that can be easily used in the context of SLNs. Among these data, the discussion forums are of utmost importance as they are the primary means for interaction among students or between students and instructors and help extract SLNs.</li>
<li>The blended learning offered by <strong>FLIP</strong> that combines the elements of online and traditional instruction is also gaining popularity. In this system, watching the lectures online becomes homework and class time is used for discussions instead.</li>
<li>In the last decade, there has also been a steady rise in the popularity of Q&A Sites which have emerged with complementary search engines that allows users to enter questions in NLP format.</li>
</ul>
<p><strong>1. MOOCs Discussion Forums</strong></p>
<ul>
<li>Similar to Q&A sites but have a different set of terminologies.</li>
<li>Each course has a <strong>discussion forum</strong> (hierarchy 1). Discussion forum comprises of <strong>threads</strong> (hierarchy 2): the first <strong>post</strong> is by the creator of thread followed by <strong>comments</strong> (hierarchy 3), hence leading to a 3-level hierarchy.</li>
<li>A user interacting with these forums has one of following options:
<ul>
<li>Create a thread</li>
<li>Create a post</li>
<li>Comment on an existing post</li>
</ul>
</li>
<li>When threads are created, each forum allows annotation of these threads with a number categories. These categories are often inconsistent as there is no effective mechanism to force learners to abide by them consistently.</li>
<li>Posts and comments can also <strong>receive up votes and down votes</strong> from users and staff.</li>
</ul>
<p><strong>2. FLIP</strong></p>
<ul>
<li>The interaction in these classes can be used to extract the SLN as the structure of posts and comments here would resemble that of MOOCs but would be different on the following accounts:
<ul>
<li><strong>Size and sparsity</strong>: smaller number of learners and denser connectivity among them.</li>
<li><strong>Informational vs conversational discussions</strong>: While MOOC discussions are more conversational in nature, SLN from FLIP would typically not include conversational discussions because students can talk informally outside the class and also because the discussions are presided over by an instructor.</li>
</ul>
</li>
</ul>
<p><strong>3. Q&A Sites</strong></p>
<ul>
<li>Q&A sites allow the following functions to it’s users:
<ul>
<li>post question</li>
<li>answer question</li>
<li>comment on answer</li>
<li>up/down vote post</li>
<li>allow asker to choose the best or acceptable answer</li>
<li>for quality assurance there are points associated with receiving up or down votes.</li>
</ul>
</li>
<li>Major differences can be seen between SLNs from such sites and sites from educational settings:
<ul>
<li><strong>Incentive structure</strong>: well-defined automated set of incentives to encourage participation. These may or may not be a part of conversation in MOOCs or FLIP, more so because it is not a scalable concept for these forums as participation in MOOCs and FLIP can not be pushed for assessment of student academically.</li>
<li><strong>Broader concept list</strong>: information propagated in an SLN for a course will be limited to materials associated with the subject. Also each course has it’s own forum where only the students enrolled can participate. Q&A sites have some focal specificity but typically one can expect the number of concepts emerging in these SLNs to be much more broader.</li>
<li><strong>Single learning modality</strong>: In a Q&A site, SLN is the only means of learning, whereas in an educational setting it is one of the four modalities: lecture videos, assessment and text resources.</li>
</ul>
</li>
</ul>
<h3 id="types-of-sln-graphs">Types of SLN Graphs</h3>
<p>These are the most commonly used graphs but are not comprehensive.</p>
<p><strong>1. Undirected graph among learners</strong></p>
<ul>
<li>Nodes represent learners</li>
<li>Undirected links indicate the presence or absence of some characteristics between them. Properties could be age, geographic location, education level, whether or not they have interacted etc.</li>
<li>For nodes \(i\) and \(j\), one can say for example,
<ul>
<li>\(prop_k(i,j)\) is a binary variable that is 1 iff \(i\) and \(j\) satisfy property \(k\) in set of properties \(K\),</li>
<li>\(P \leq |K|\) is a threshold constant i.e. node \(i\) and \(j\) are connected if and only if they both satisfy at least \(P\) criiteria specified in \(K\).</li>
</ul>
</li>
</ul>
<script type="math/tex; mode=display">(i, j) \in G \Leftrightarrow \sum_{k \in K} prop_k(i, j) \geq P \tag{1} \label{1}</script>
<p><strong>2. Directed graph among learners</strong></p>
<ul>
<li>Used to note flow of information in the SLN.</li>
<li>A directional link frok \(i\) to \(j\) represents an answer by \(j\) to a question posted by \(i\). Several restrictions can be added to this directional flow e.g. include only “best answers” given.</li>
<li>It could be a multi-graph where there is more than one link from \(i\) to \(j\) since learners can ask and answer more than one question each.</li>
</ul>
<p><strong>3. Undirected graph among learners and concepts</strong></p>
<ul>
<li>Nodes are used to represent both the learners as well as the concepts.</li>
<li>Key concepts are extracted in a number of ways:
<ul>
<li>running textual analysis to find keywords in discussions</li>
<li>using syllabus specified by the instructors</li>
</ul>
</li>
<li>Such graphs are generally bipartite graphs between learners and concepts.</li>
<li>A setting similar to \eqref{1} represents this graph, but each property \(k\) represent a condition on the participation of user \(i\) in concept \(j\).</li>
</ul>
<p><strong>4. Directed graph among learners and concepts</strong></p>
<ul>
<li>To depict structure of interactions in more details.</li>
<li>Concept nodes are used and each question or post by a learner is handled separately, which allows tow sets of links for each post: \((i_0, j) \in G\) for learner \(i_0\) who makes the initial post, and \((j, i_l) \in G\) for each learner \(i_l\) who commented on \(j\).</li>
<li>In a forum where up/down votes are allowed, these links can be weighted to match the net votes obtained.</li>
</ul>
<h3 id="research-objectives">Research Objectives</h3>
<p><strong>1. Predictions</strong></p>
<ul>
<li><strong>Performance</strong>: ability to predict performance on assessments - homework, quiz, or exam questions - a student has not taken.</li>
<li><strong>Drop-off rate</strong>: predict drop-off rates of a course. This could be for an individual student or for the volume of participation in the course as a whole. Metrics on interest in such a case would be the completion rate of assignments, lecture videos, or rate of involvement in discussion forums.</li>
</ul>
<p><strong>2. Recommendations</strong></p>
<ul>
<li><strong>Courses and topics</strong>: to help students locate courses of interests based on their interactions with the enrolled courses. This presents an opportunity to improve the learning experience by recommending new courses and redirecting to relevant discussion forums.</li>
<li><strong>Study buddies</strong>: to help students locate study partners over MOOCs where the number of students participating can vary in terms of engagement, interests, demographics, geography etc.</li>
</ul>
<p>Recommendation algorithms could focus on similarities or, more importantly dissimilarities between users. For example, a learner who actively engages on discussion forums can be paired with one who is struggling in those topics.</p>
<p><strong>3. Peer-grading</strong></p>
<ul>
<li>In MOOCs, generally the teacher-to-student ratio is very small. As a result, it would be infeasible for a teaching staff to manually grade each submission.</li>
<li>This is generally tackled by only giving away machine gradable homework or exams, such as MCQs. But this limits the variety and quality of questions that can be posed.</li>
<li>A different approach would be where students score each others work. This method lacks efficacy so far, because:
<ul>
<li>different students have different grading quality</li>
<li>time commitment is required for grading</li>
</ul>
</li>
<li>Structure of SLN might help in locating quality-graders for each assignment related to a specific topic.</li>
</ul>
<p><strong>4. Personalization</strong></p>
<ul>
<li>Online education poses the question of trade-off between efficacy and scale in learning. It is statistically seen that only 10% of students enrolling in a MOOC ever complete the courses.</li>
<li>The ineffectiveness of MOOCs can be because of the following reasons:
<ul>
<li>teacher-to-student ratio is very low</li>
<li>learning is asynchronous</li>
<li>student population is very diverse and hard to personalize</li>
</ul>
</li>
<li>Advance technology is required for course individualization, to lift tradeoff curve and enable effective learning environment at massive scales rather than having a one-size-fits-all online course.</li>
<li>The information stored in SLNs can play a key role in such adaptation to become a part of learning experience.</li>
</ul>
<p><img src="/assets/2019-06-02-social-learning-networks/fig-1-flowchart-of-individualization.png?raw=true" alt="Fig-1: Flowchart of individualizaiton" /></p>
<ul>
<li>The key components of such an indivualization effort can be seen in the flowchart in Fig-1.
<ul>
<li><strong>Behavioural measurements</strong>: measurement of user behaviour while engaging with course material, e.g. video watching trajectory (pauses and jumps) can be captured, information that user enters in a discussion forum can also be collected.</li>
<li><strong>Data analytics</strong>: use machine learning techniques to generate a low-dimensional model of the high-dimensional process of learning. The latent space can be:
<ul>
<li>discovered throught data mining</li>
<li>defined in advance in terms of author-specified learning features</li>
</ul>
</li>
<li><strong>Content/presentation adaptation</strong>: based on the analysis, user’s updated profile dictates decisions on what content will be presented next and how it will be presented, e.g. different versions of text and video may be presented.</li>
</ul>
</li>
</ul>
<h3 id="methodologies">Methodologies</h3>
<p><strong>1. Data Collection</strong></p>
<ul>
<li>There are two basic modes of data collections, both with pros and cons:
<ul>
<li><strong>Use existing data</strong>: various open online course offerings over the past years remain open even after the sessions end and gives access to the discussion forums etc. Similary the SLN data from Q&A portals is accessible through the respective websites. This data can easily be crawled and scraped by writing scripts to extract the information from the pages. Major drawbacks of this methodology are:
<ul>
<li>no opportunity to excite the state of SLN formation for subsequent data analysis</li>
<li>only data on open courses is available</li>
<li>public data is only accessible upto a certain measurement granularity</li>
</ul>
</li>
<li><strong>Generate new data</strong>: To overcome the cons of earlier method, one can collaborate with the educators. Or alternatively, a team could invest resources in creating a brand new online education platform to host courses for a number of instructors.</li>
</ul>
</li>
</ul>
<p><strong>2. Analysis</strong></p>
<ul>
<li>This approach varies widely based on the research objective being tackled and generally involves large-scale machine learning methods.</li>
<li>Linear regression models can be used to determine which course properties are correlated with learner participation in the forums, which can be quantified using the number of posts that appeared each day for each course in the dataset.</li>
<li>Using user-quiz pair matrix, algorithms can be trained:
<ul>
<li>baseline predictor for solving least square optimization to minimize error in terms of student and quiz biases.</li>
<li>neighbourhood predictorthat extends the baseline to leverage student-student and quiz-quiz similarities</li>
</ul>
</li>
</ul>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>SLN data from the online interactions can lead to tracking of information in the network that can assist various objectives that lead to better learning outcomes.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://ieeexplore.ieee.org/document/6814139" target="_blank">Social Learning Networks: A brief survey</a></small><br /></p>
Sun, 02 Jun 2019 00:00:00 +0000
https://notes.pairml.com/2019/06/02/social-learning-networks/
https://notes.pairml.com/2019/06/02/social-learning-networks/machine-learningpapersslnBreaking down Tesseract OCR<h3 id="introduction">Introduction</h3>
<ul>
<li>It was originally an HP research project between 1984 and 1994, which was presented at 1995 UNLV Annual Test of OCR Accuracy where it performed beyond expectations.</li>
<li>Purpose of tesseract was integration with the flatbed HP scanners with objectives such as compression which was not possible with the then existing commercial OCR solutions which were struggling with accuracy.</li>
<li>During a phase of development, work concentrated on improving rejection efficiency than on base-level accuracy.</li>
<li>Finally in 2005, Tesseract was released as an open-source project by HP available at Google Code until it was finally moved to <a href="https://github.com/tesseract-ocr/tesseract" target="\_blank">Github</a> for open-source contribution.</li>
</ul>
<h3 id="architecture">Architecture</h3>
<ul>
<li>
<p>Because of HP’s proprietary layout analysis technology, Tesseract did not have it’s own dedicated layout analyser. As a result, Tesseract assumes the inputs to be <strong>binary image with optional polygonal text regions defined</strong>.</p>
</li>
<li>
<p><strong>Connected Component Analysis</strong> is the first step in which the outlines of the components are stored. Outlines are gathered together, purely by nesting, into <strong>Blobs</strong>.</p>
</li>
<li>
<p>Blobs are organized into text lines, and the the lines and regions are analyzed for <strong>fixed pitch</strong> or <strong>proportional text</strong>. The lines are broken into words differently based on the kind of character spacing. Fixed pitch text is chopped immidiately by character cells. Proportional text is broken into words using <strong>definite spaces or fuzzy spaces</strong>.</p>
</li>
<li>
<p><strong>Recognition</strong> proceeds as a <strong>two-pass process</strong>. During the <strong>first pass</strong>, attempt is made to recognize each word. The words that are satifactorily identified are passed to an <strong>adaptive classifier</strong> as training data. As a result the adaptive classifier gets a chance at improving results among text lower down on the page. In order to utilize the training of adaptive classifier on the text near the top of the page as <strong>second pass</strong> is performed, during which words that were not recognized well enough are classified again.</p>
</li>
<li>
<p><strong>Final Phase</strong> resolves the <strong>fuzzy spaces</strong>, and checks <strong>alternative hypotheses</strong> for the x-height to locate small-cap text.</p>
</li>
</ul>
<h3 id="line-and-word-finding">Line and Word Finding</h3>
<p><strong>1. Line Finding</strong></p>
<ul>
<li>
<p>Algorithm is designed so that skewed page can be recognized without having to deskew, thus preventing any loss of image quality.</p>
</li>
<li>
<p><strong>Blob filtering</strong> and <strong>line construction</strong> are key parts of this process.</p>
</li>
<li>
<p>Under the assumption that most blobs have uniform text size, a simple <strong>percentile height filter</strong> removes drop-caps and vertically touching characters and <strong>median height</strong> approximates the text size in the region.</p>
</li>
<li>
<p>Blobs smaller than a certain fraction of the median height are filtered out, being most likely punctuation, diacritical marks and noise.</p>
</li>
<li>
<p>The filtered blobs are more likely to fit a model of <strong>non-overlapping, parallel, but sloping lines</strong>. Sorting and processing the blobs by x-coordinates makes it possible to assign blobs to a unique text line, while tracking the slope across the page.</p>
</li>
<li>
<p>Once the lines are assigned, a <strong>least median of squares fit</strong> is used to estimate the baselines, and filtered-out blobs are fitted back into appropriate lines.</p>
</li>
<li>
<p>Final step merges blobs that overlap by at least half horizontally, putting diacritical marks together with the correct base and correctly associating parts of some broken characters.</p>
</li>
</ul>
<p><strong>2. Baseline Fitting</strong></p>
<ul>
<li>Using the text lines, baselines are fitted precisely using a <strong>quadratic spline</strong>, which allows Tesseract to handle pages with curved baselines.</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-1-curved-fitted-baseline.png?raw=true" alt="Fig-1: Curved Fitted Baseline" /></p>
<ul>
<li>Baseline fitting is done by partitioning the blobs into groups of reasonable continuous displacement for the original straight baseline. A quadratic spline is fitted to the most populous partition by a least square fit.</li>
</ul>
<p><strong>3. Fixed Pitch Detection and Chopping</strong></p>
<ul>
<li>Lines are tested to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using pitch, and disables the <strong>chopper</strong> and <strong>associator</strong> on these words for the <strong>word recognition step</strong>.</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-2-fixed-pitch-chopped-word.png?raw=true" alt="Fig-2: Fixed Pitch Chopped Word" /></p>
<p><strong>4. Proportional Word Finding</strong></p>
<ul>
<li>Detecting word boundaries in a not-fixed-pitch or proportional text spacing is highly non-trivial task.</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-3-difficult-word-spacing.png?raw=true" alt="Fig-3: Difficult Word Spacing" /></p>
<ul>
<li>
<p>For example, the gap between the tens and units of ‘11.9%’ is similar size to general space, but is certainly larger the kerned space between ‘erated’ and ‘junk’. Another case can be noticed that there is no horizontal gap between the bounding box of ‘of’ and ‘financial’.</p>
</li>
<li>
<p>Tesseract solves most of these problems by measuring <strong>gaps in a limited vertical range between baseline and mean line</strong>. Spaces close to a threshold are made <strong>fuzzy</strong>, where the decisions are made after <strong>word recognition</strong>.</p>
</li>
</ul>
<h3 id="word-recognition">Word Recognition</h3>
<ul>
<li>
<p>A major part of any word recognition algorithm is to identify how a word should be segmented into characters.</p>
</li>
<li>
<p>The initial segmented outputs from line finding is classified first. The non-fixed pitch text in the remaining text is classified using other word recognition steps.</p>
</li>
</ul>
<p><strong>1. Chopping Joined Characters</strong></p>
<ul>
<li>
<p>Tesseract attempts to improve the result by chopping the blob with worst confidence from the character classifier.</p>
</li>
<li>
<p><strong>Chop points</strong> are found from concave vertices of a poligonal approximation of the outline, which may have a concave vertex opposite or a line segment. It may take upto 3 pairs of chop points to successfully separate joined characters from ASCII set.</p>
</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-4-candidate-chop-points-and-chop.png?raw=true" alt="Fig-4: Candidate Chop Points and Chop" /></p>
<ul>
<li>Chops are executed in priority order. Any chop that fails to improve the confidence of the result is undone, but not completely discarded so that it can be re-used by the <strong>associator</strong> if needed.</li>
</ul>
<p><strong>2. Associating Broken Characters</strong></p>
<ul>
<li>
<p>After the potential chops have been exhausted, if the word is still not good enough, it is given to the <strong>associator</strong>, which makes a <strong>best first search</strong> of the segmentation graph of the possible combinations of the maximally chopped blobs into candidate characters.</p>
</li>
<li>
<p>The search pulls candidate new states from a priority queue and evaluates them by classifying unclassified combinations of fragments.</p>
</li>
<li>
<p>The chop-then-associate method is inefficient but it gives a benefit of simpler data structures that would be required to maintain the full segmentation graph.</p>
</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-5-broken-characters.png?raw=true" alt="Fig-5: Broken Characters recognized by Tesseract" /></p>
<ul>
<li>This ability of Tesseract to successfully classify broken characters gave it an edge over the contemporaries.</li>
</ul>
<h3 id="static-character-classifier">Static Character Classifier</h3>
<p><strong>1. Features</strong></p>
<ul>
<li>
<p>Early version of Tesseract used <strong>topological features</strong>, which are independent of font and size but are not robust to issues found in real-life images.</p>
</li>
<li>
<p>Another idea for classification involved use of segments of the polygonal approximation as features, but this method is also not robust to damaged characters.</p>
</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-6-poligonal-approximation-features.png?raw=true" alt="Fig-6: Differences in Polygonal Approximation for same character" /></p>
<ul>
<li>
<p>Solution to these problems lie in the fact that the features in the unknown need not be the same as the features in the training data. During training, <strong>segments of a polygonal approximation</strong> are used for features, but during recognition, features of a small, fixed length are extracted from the outline and matched many-to-one against the clustered prototype features of the training data.</p>
</li>
<li>
<p><strong>The process of small features matching large prototypes is easily able to cope with recognition of damaged words.</strong> It’s main problem is that the computational cost of computing the distance between an unknown and a prototype is very high.</p>
</li>
</ul>
<p><strong>2. Classfication</strong></p>
<ul>
<li>
<p>This stage proceeds as a <strong>two-step process</strong>. First step involves a <strong>class pruner</strong> that creates a shortlist of character classes that the unknown might match.</p>
</li>
<li>
<p>The classes shortlisted in step one are taken further to the next step, where the actual similarity is calculated from the feature bit vectors. <strong>Each prototype character class is represented by a logical sum-of-product expression with each term called a configuration</strong>.</p>
</li>
</ul>
<p><strong>3. Training Data</strong></p>
<p>The classifier is trained on a mere 20 samples of 94 characters from 8 fonts in a single size, but with 4 attributes (normal, bold, italic, bold italic), making the total of 60160 training samples.</p>
<h3 id="linguistic-analysis">Linguistic Analysis</h3>
<ul>
<li>
<p>Whenver the word recognition module is considering a new segmentation, the linguistic model (called <strong>permuter</strong>) choses the best available word string in the categories: <strong>Top frequent word, Top dictionary word, Top numeric word, Top UPPER case word, Top lower case word (with optional initial upper)</strong>, where the final decision for segmentation is simply the word with the lowest total distance rating.</p>
</li>
<li>
<p>Since words from different segmentations may have different number of characters in them, it would be hard to compare these words directly (even if a classifier claims to produce probabilities, which Tesseract does not).</p>
</li>
<li>
<p>Tesseract instead produces two numbers to solve this issue, namely,</p>
<ul>
<li><strong>Confidence</strong>, is minus the normalized distance from the prototype. It is confidence in the sense that greater the number better a metric it is.</li>
<li><strong>Rating</strong> is product of normalized distance from the prototype and total outline length in the unknown character.</li>
</ul>
</li>
</ul>
<h3 id="adaptive-classifier">Adaptive Classifier</h3>
<ul>
<li>
<p>OCR engines are benefitted from use of an adaptive classifier because the static classifier has to be good at generalizing to any kind of font, its ability to discriminate between different characters or between characters and non-characters is weakened.</p>
</li>
<li>
<p>Tesseract has a font-sensitive adaptive classifier that is trained using the output of the static classifiers which is commonly used to obtain greater discrimination within each document, where the number of fonts is limited.</p>
</li>
<li>
<p>It uses the same features and classifier as the static classifier to train the adaptive classifier. <strong>The only significant difference between the two classifiers apart from the training data is that the adaptive classifier uses the isotropic baseline/x-height normalization, whereas the static classifier normalizes the characters by the centroid (first moment) for position, and second moments for anisotropic size normalization.</strong></p>
</li>
<li>
<p>The baseline normalization helps distinguish the upper and lower case characters and also improves immunity to noise specks.</p>
</li>
</ul>
<p><img src="/assets/2019-01-15-breaking-down-tesseract-ocr/fig-7-baseline-and-moment-normalized.png?raw=true" alt="Fig-7: Baseline and Moment Normalized letters" /></p>
<ul>
<li>The main benefit of character moment normalization is <strong>removal of font aspect ratio</strong> and some degree of <strong>font stroke width</strong>. It also makes recognition of subscripts and superscripts easier, but requires an additional feature to distinguish the uppercase and lowercase characters.</li>
</ul>
<h1 id="references">REFERENCES</h1>
<p><small><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf" target="\_blank">An Overview of the Tesseract OCR Engine</a></small></p>
Tue, 15 Jan 2019 00:00:00 +0000
https://notes.pairml.com/2019/01/15/breaking-down-tesseract-ocr/
https://notes.pairml.com/2019/01/15/breaking-down-tesseract-ocr/visionocrmachine-learningpapersSocial Bias in Machine Learning<h3 id="introduction">Introduction</h3>
<p>Discrimination, injustice, oppression are some of the dark words that have been an integral part of human history. While there is a active effort to make world a fair place in every sphere of life, it is almost impossible to make the data that has been recorded over the years fair to all the caste, creeds, races and religions because the history is written in ink. Since the world was far more biased as we age backwards in time, it would not be incorrect to say that a historical record of data would often reflect these biases in terms of minority and majority classes. These are the very same data that is continuously used in training most of our machine learning models without actually giving a conscious thought to the fairness of the algorithm i.e. whether or not the algorithm reflects the biases that prevailed back then. Recently machine learning has seen its utilitization in a lot of important decision making pipelines such as predicting time of recidivism, college acceptance, loan approvals etc., and hence it becomes increasingly important to question the machine learning models being developed in terms of implicit bias that they might be inheriting from the data that they train on. In order to do away with such biases in a machine learning algorithm one needs to understand how exactly does bias creep in, what are the various metrics through which it can be measured and what are the methods through which one can remove such unfairness. This post is an attempt to summarize such issues and possible remedies.</p>
<h3 id="background">Background</h3>
<p>Since machine learning is now being used to make a lot policy decisions that affect the life of people on an everyday basis, it should be made sure that unfairness is not a part of such decision making. It is found that training machine learning algorithms with the standard utility maximization and loss minimization objectives sometimes result in algorithms that behave in a way that a fair human observer would deem biased. A very recent example of such a case was <a href="https://www.ml.cmu.edu/news/news-archive/2018/october/amazon-scraps-secret-artificial-intelligence-recruiting-engine-that-showed-biases-against-women.html" target="\_blank">cited</a> by Amazon which notices a gender bias in its recruiting engine algorithms.</p>
<h3 id="its-all-in-the-data">It’s all in the Data</h3>
<p>One of the potential reasons for such biases in these algorithms can be attributed to the training data itself. Since the algorithms are big numerical puzzles that are trained to recognize and mimic the statistical patterns over the history, it is only natural for such a trained system to display biased characteristics. Even some of the state of the art solutions in the field of NLP and Machine Learning are not free from biases and unfairness. For example, it has been shown that word2vec embeddings learnt from huge corpuses of text often show gender bias as the euclidean distance between words that signifies correlation between words, suggests strong correlation between words like homemaker, nanny with she and maestro, boss with he. Any system built on top of such a word embedding is very likely to propagate this bias on a daily basis at some level.</p>
<p>One of the contested ways of dealing with this issue is to retrain the models continuously with new data, which relies on the assumption that historical bias is on a process of correcting itself.</p>
<p>Another major question that continuously arise is based on the fact the these machine learning algorithms work well when the amount of data they train on is huge. While this is true in an overall sense, if we break down the number of data points one has for minority class it becomes more apparent that the algorithms does not have enough supporting instances to learn as good a representation about minority classes as it would about the majority and hence could lead in unfair judgements because of lack of data.</p>
<blockquote>
<p>There is general tendency for automated decisions to favor those who belong to statistically dominant groups.</p>
</blockquote>
<p>Statistical patterns that apply to majority population might be invalid for the the minority group. It can also happen that a variable that is positively correlated with target in general population maybe negatively correlated with target in the minority group. For example, a real name might be a short common name in one culture and a long unique name in another. Hence same rules for detecting fake names would not work across such groups.</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-2-survival-distribution.png?raw=true" alt="Fig-1: Survival Distribution" /></p>
<p>Consider a very simple dataset from Kaggle called <a href="https://www.kaggle.com/c/titanic" target="\_blank">titanic</a>. This is a basic dataset where based on a bunch of features given one has to <strong>predict the survival probability of an individual who was on titanic</strong>. The survival distribution on the training data shows that in past <strong>during the titanic incident a female candidate had much higher chances of surviving than a male candidate</strong>. It would be rather obvious <strong>for an algorithm trained on this data that being female is a strong indicator of survival</strong>. If the same algorithm was used to predict survival on an impending sinking incident where candidates who have higher survival probability would be boarded on rescue boats first, it is bound to make biased decisions.</p>
<p>Also it can be seen that being male is negatively correlated to surviving while being a female is positively correlated, because graph 2 in fig-1 shows that more males died than survived and by contrast, more females survived than died. So <strong>if the algorithm was to learn only from majority of the data belonging to males, it would predict badly for the female population</strong>.</p>
<h3 id="undeniable-complexities">Undeniable Complexities</h3>
<p>One way to counter the sample size disparity might be to learn different classifiers for different sub-groups. But it is not as simple as it sounds because of the reason that learning and testing for individual sub-group might require acting on the protected attributes which might in itself be objectionable. Also the definition of minority is fuzzy as there could be many different overlapping minorities and no straightforward way of determining group membership.</p>
<h3 id="noise-vs-modeling-error">Noise vs Modeling Error</h3>
<p>Say a classifier achieves 95 percent accuracy. In the real world scenario this 5 percent error rate would point to a really well trained classifier. But what is often overlooked is that there might be two different kinds of underlying reasons behind the error rate. One could be the general case of noise that the classifier was not able to model and hence was not able to predict and account for. Other possible reason could be that while the model is 100 percent accurate on majority class, it is only 50 percent accurate on minority class. This systematic error in the minority class would be a clear case of algorithmic unfairness.</p>
<p>The bigger issue of the matter here is that there is no principled or book methodology for distinguishing noise from the modeling errors. Such questions can only be answered by great deal of domain knowledge and experience.</p>
<h3 id="edge-cases-always-exist">Edge Cases always exist</h3>
<p>It is also true to assume that in a very unexpected way it is possible for bias to creep into the algorithms even if the training data is labelled correctly and is free of any issues that could be pointed out as unbiased. A recent <a href="https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas" target="\_blank">example</a> of this is when google photos by mistake labeled two black people as gorillas. Obviously, the machine was never trained with any training data that should lead to such inferences, but because the number of trained parameters are so high, it often becomes intractable and unimaginably hard to understand why a system behaves haphazardly in certain conditions. This uncertainty of outcomes can also be a cause of bias in situations that could not be predicted in advance.</p>
<h3 id="what-is-fairness">What is Fairness?</h3>
<p>Fairness in classification involves studying algorithms not only from a perspective of accuracy, but also from a perspective of fairness.</p>
<blockquote>
<p>The most difficult part of this is to define what is fairness.</p>
</blockquote>
<p>Consideration for fairness often leads to compromise on accuracy but it’s a necessary evil that is not going anywhere in the near future. What if often more surprising is that many of these metrics have a trade off among themselves.</p>
<h3 id="fairness-of-process-vs-fairness-of-outcome">Fairness of Process vs Fairness of Outcome</h3>
<ul>
<li>
<p>An <strong>aware</strong> algorithm is one that uses the information regarding the protected attribute (such as gender, ethnicity etc.) in the process of learning. An <strong>unaware</strong> algorithm will not.</p>
</li>
<li>
<p>While the motivation regarding unaware algorithm is that being fair means disregarding the protected attribute, it often does not work just by removing the protected attribute. Sometimes there is a strong correlation between protected attribute and some other feature. So in order to train a truly unaware algorithm, one needs to remove the correlated feature group as well.</p>
</li>
<li>
<p>This process of manually engineering a feature list that conveys no information about the protected attribute can also be automated using machine learning techniques discussed in following sections.</p>
</li>
</ul>
<h3 id="are-unaware-algorithms-the-solution">Are Unaware Algorithms the Solution</h3>
<ul>
<li>
<p>There could be inherent differences between the populations defined by these masked protected attributes, which would only render this process undesirable.</p>
</li>
<li>
<p>The aware approaches use these proctected attributes and have a better chance of understanding depence of outcome on them.</p>
</li>
<li>
<p>This can be seen as a distinction between <strong>fairness of process</strong> vs <strong>fairness of outcomes</strong>. The unaware algorithms ensure a fairness of process, because under such a scheme the algorithm does not use any of the protected attributes for decision making. However, such fairness in process does not guarantee a fair outcome towards the protected and un-protected sub-groups.</p>
</li>
<li>
<p>The aware approaches on the contrary use these protected attributes and hence not a fair process, but it can reach an outcome that is more fair towards the minorities.</p>
</li>
</ul>
<h3 id="mathematical-fairness-statistical-parity">Mathematical Fairness: Statistical Parity</h3>
<p>A mathematical version of absolute fairness can be a statistical condition where the chances of success or failure is same for both the majority and minority classes (or more classes in case of multi-class scenarios). This can be written as,</p>
<script type="math/tex; mode=display">Pr[h(x) = 1 \vert x \in P^C] = Pr[h(x) = 1 \vert x \in P] \tag{1} \label{1}</script>
<p>The main drawback of such models is given by the argument that is that <strong>does one really want to equalize the outcomes across all sub-groups?</strong>. For example, predicting the success chances of a basketball player irrespective of his height is not really a very strong model, because the discrimination in various domains do not really fall in a black or white region but may lie in the gray region somewhere in between. Another example might be predicting the chances of child birth without using the features such as gender and age would be a really poor algorithm. So, <strong>enforcing the statistical parity is not always the solution</strong>.</p>
<h3 id="cross-group-calibration">Cross-group Calibration</h3>
<ul>
<li>Instead of equalizing the outcomes themselves, one can look to equalize some other statistics of the algorithm’s performance, for example <strong>error rates across groups</strong>.</li>
</ul>
<blockquote>
<p>A fair algorithm would make as many mistakes on a minority group as it does on the majority group.</p>
</blockquote>
<p>A useful tool for such an analysis is the confusion matrix as shown below</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-1-confusion-matrix.png?raw=true" alt="Fig-2: Confusion Matrix" /></p>
<p>Some of the metrics based on the confusion matrix are:</p>
<ul>
<li>
<p><strong>Treatment equality</strong> is achieved by a classifier that yields a ratio of false negatives and false positives (in table, c/b or b/c) that is same for both protected group categories.</p>
</li>
<li>
<p><strong>Conditional procedure accuracy equality</strong> is achieved when conditioning on the known outcome, the classifier is equally accurate across protected group categories. This is equivalent to the false negative rate and false positive rate being same for all protected categories.</p>
</li>
</ul>
<p>Since all the columns and rows of a confusion matrix should add up to the total number of observations, many of these fainess metrics have a trade-off relationship. This basically means <strong>zero-sum game</strong>, one increases at the cost of the other and there is no win-win situation. Based on the use-case one has to decide which metrics should be optimized for as there is no blanket solution to the group.</p>
<h3 id="example-titanic">Example: Titanic</h3>
<p><a href="https://www.kaggle.com/shamssam/algorithmic-fairness-in-ml" target="\_blank"><strong>Kaggle Notebook</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1">### libraries
</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sn</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">classification_report</span><span class="p">,</span> <span class="n">confusion_matrix</span>
<span class="kn">from</span> <span class="nn">xgboost</span> <span class="kn">import</span> <span class="n">XGBClassifier</span>
<span class="n">df_train</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'train.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="s">'PassengerId'</span><span class="p">)</span>
<span class="n">df_train</span><span class="o">.</span><span class="n">Sex</span> <span class="o">=</span> <span class="n">df_train</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="s">'female'</span>
<span class="n">df_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Name'</span><span class="p">,</span> <span class="s">'Ticket'</span><span class="p">,</span> <span class="s">'Cabin'</span><span class="p">,</span> <span class="s">'Embarked'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_valid</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df_train</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">df_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="c1"># aware classification
</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">XGBClassifier</span><span class="p">()</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">X_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"OVERALL"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"FEMALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_female</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_male</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">)))</span>
<span class="c1"># output
</span>
<span class="c1"># ========================================
# OVERALL
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.85 0.90 0.87 165
# 1 0.82 0.75 0.78 103
</span>
<span class="c1"># avg / total 0.84 0.84 0.84 268
</span>
<span class="c1"># Accuracy: 0.8395522388059702
# ========================================
# FEMALE
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.45 0.41 0.43 22
# 1 0.83 0.86 0.84 76
</span>
<span class="c1"># avg / total 0.75 0.76 0.75 98
</span>
<span class="c1"># Accuracy: 0.7551020408163265
# ========================================
# MALE
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.90 0.97 0.94 143
# 1 0.75 0.44 0.56 27
</span>
<span class="c1"># avg / total 0.88 0.89 0.88 170
</span>
<span class="c1"># Accuracy: 0.888235294117647
</span>
<span class="c1"># unaware classification
</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">XGBClassifier</span><span class="p">()</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">X_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"OVERALL"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"FEMALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_female</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_male</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">)))</span>
<span class="c1"># output
</span>
<span class="c1"># ========================================
# OVERALL
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.73 0.84 0.78 165
# 1 0.66 0.51 0.58 103
</span>
<span class="c1"># avg / total 0.71 0.71 0.70 268
</span>
<span class="c1"># Accuracy: 0.7126865671641791
# ========================================
# FEMALE
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.32 0.82 0.46 22
# 1 0.90 0.50 0.64 76
</span>
<span class="c1"># avg / total 0.77 0.57 0.60 98
</span>
<span class="c1"># Accuracy: 0.5714285714285714
# ========================================
# MALE
# ========================================
# precision recall f1-score support
</span>
<span class="c1"># 0 0.91 0.84 0.87 143
# 1 0.39 0.56 0.46 27
</span>
<span class="c1"># avg / total 0.83 0.79 0.81 170
</span>
<span class="c1"># Accuracy: 0.7941176470588235
</span></code></pre></div></div>
<p>Confusion matrices for the cases using awareness and without awareness of protected attribute (sex in this case) is shown below.</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-3-aware-confusion-matrix.png?raw=true" alt="Fig-3: Aware Confusion Matrix" /></p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-4-unaware-confusion-matrix.png?raw=true" alt="Fig-4: Unaware Confusion Matrix" /></p>
<p>Note:</p>
<ul>
<li>Conditional accuracy in the code output shows that the system is very biased both in aware in unaware scenarios.</li>
<li>treatment equality is more divergent in aware case than in unaware case.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://towardsdatascience.com/a-gentle-introduction-to-the-discussion-on-algorithmic-fairness-740bbb469b6" target="_blank">A Gentle Introduction to the Discussion on Algorithmic Fairness
</a></small><br />
<small><a href="https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de" target="\_blank">How big data is unfair</a></small><br />
<small><a href="http://fairness-measures.org/" target="\_blank">Fairness Measures</a></small><br /></p>
Tue, 09 Oct 2018 00:00:00 +0000
https://notes.pairml.com/2018/10/09/algorithmic-fairness/
https://notes.pairml.com/2018/10/09/algorithmic-fairness/machine-learningalgorithmic-fairnessIntroduction to Computer Architecture<h3 id="what-is-a-computer">What is a Computer?</h3>
<p>A computer is a general purpose device that can be programmed process information, yield meaningful results.</p>
<p>The three important take-aways being:</p>
<ul>
<li>programmable device</li>
<li>process information</li>
<li>yield meaningful results</li>
</ul>
<p>So the important parts for the working of a computer are:</p>
<ul>
<li>Program: a list of instructions given to computer</li>
<li>Information Store: the data it has to process</li>
<li>Computer: processes information into meaningful results.</li>
</ul>
<p>A fully functional computer includes at the very least:</p>
<ul>
<li>Processing Unit (CPU)</li>
<li>Memory</li>
<li>Hard disk</li>
</ul>
<p>Other than these some input output (I/O) devices can also be a part of the system, such as:</p>
<ul>
<li>Keyboard: Input</li>
<li>Mouse: Input</li>
<li>Monitor: Output</li>
<li>Printer: Output</li>
</ul>
<h3 id="memory-vs-hard-disk">Memory vs Hard Disk</h3>
<ul>
<li>Storage Capacity: more on hard disk, less on memory</li>
<li>Volatile: data on hard disk is non-volatile, while on memory is volatile</li>
<li>Speed: speed of access and other operations are slower on hard disk when compared to memory.</li>
</ul>
<h3 id="brain-vs-computer">Brain vs Computer</h3>
<ul>
<li>Brain is capable of doing a lot of abstract work that computers cannot be programmed to do.</li>
<li>Speed of basic calculations is much higher in a computer which is its primary advantage.</li>
<li>Computers do not get tired or bored or disinterested.</li>
<li>Humans can understand complicated instructions in a variety of semantics and languages.</li>
</ul>
<h3 id="program">Program</h3>
<ul>
<li>Write a instruction in a high level language like C, C++, Java etc. (done by human interface)</li>
<li>Compile it into an executable (binary) that converts it into byte-code, i.e. the language computers understand. (done by compilers)</li>
<li>Execute the binary. (done by processor)</li>
</ul>
<h3 id="instruction-set-architecture-isa">Instruction Set Architecture (ISA)</h3>
<p>The semantics of all the instructions supported by a processor is known as instruction set architecture (ISA). This includes the semantics of the instructions themselves along with their operands and interfaces with the peripherals.</p>
<blockquote>
<p>ISA is an interface between software and hardware.</p>
</blockquote>
<p>Examples of ISA:</p>
<ul>
<li>arithmetic instructions</li>
<li>logical instructions</li>
<li>data transfer/movement instructions</li>
</ul>
<p>Features of ISA:</p>
<ul>
<li>Complete: it should be able to execute the programs a user wants to write</li>
<li>Concise: smaller set of instructions, currently they fall in the range 32-1000</li>
<li>Generic: instructions should not be too specialized for a given user or a given system.</li>
<li>Simple: instructions should not be complicated</li>
</ul>
<p>There are two different paradigms of designing an ISA:</p>
<ul>
<li>RISC: Reduced Instruction Set Computer has fewer set of simple and regular instructions in the range 64 to 128. eg. ARM, IBM PowerPC. Found in mobiles and tablets etc.</li>
<li>CISC: Complex Instruction Set Computer implements complex instructions which are highly irregular, take multiple operands. Also the number of instructions are large, typically 500+. eg. Intel x86, VAX. Used in desktops and bigger computers.</li>
</ul>
<h3 id="completeness-of-isa">Completeness of ISA</h3>
<p><strong>How do we ensure the completeness of an ISA?</strong> Say, there are two instructions addition and subtraction, while it is possible to implement addition using substraction (a + b = a - (0 - b)), the same cannot be said otherwise. This basically means that <strong>in order to complete an ISA one needs a set of instructions such that no other instruction is more powerful than the set</strong>.</p>
<p><strong>How do we ensure that one has a complete instruction set such that one can write any program?</strong> The answer to this lies in finding a <strong>Universal ISA</strong> which would inturn constitute a <strong>Universal Machine</strong> which can be used to write any program known to mankind (Universal Machine has a set of basic actions where each such action can be interpretted as an instruction).</p>
<h3 id="turing-machine">Turing Machine</h3>
<p>Alan Turing, the father of computer science discovered a the theoretical device called <strong>turing machine</strong> which is the most powerful machine known because theoretically it can compute the results of all the programs one can be interested in.</p>
<p>A turing machine is a hypothetical machine which consists of an <strong>infinite tape consisting of cells</strong> extending in either directions, a <strong>tape head to maintain pointer on the tape that can move left or right</strong>, a <strong>state cell the saves the current state</strong> of the machine, and an <strong>action table to write down the set of instructions</strong>. It is posed as an thesis ( <strong>Church-Turing Thesis</strong> and not a theorem) that has not been counter in the past 60 years that</p>
<blockquote>
<p>Any real-world computation can be translated into an equivalent computation involving Turing machine.</p>
</blockquote>
<p>Also,</p>
<blockquote>
<p>Any computer that is equivalent to a Turing machine is said to be Turing Complete.</p>
</blockquote>
<p>So the answer to <strong>Can we build a complete ISA</strong> lies in the question <strong>can we design a Universal Turing Machine (UTM) that an simulate turing machine</strong>, i.e. the all one needs to do is to build a turing machine (seemingly simple architecture) that can implement other turing machines (manage tape, tape-head, cell and action table).</p>
<p>So analogously speaking, the current computers are an attempt to implement this universal turing machine (UTM), where the <strong>generic action table of the UTM is implemented as CPU</strong>, the <strong>the simulated action table of turing machine to be implemented is the Instruction memory</strong>, the <strong>working area or the UTM on the tape is the data memory</strong>, and the <strong>simulated state register of the implemented turing machine is the program counter (PC)</strong>.</p>
<h3 id="elements-of-computers">Elements of Computers</h3>
<ul>
<li>Memory (array of bytes), contains
<ul>
<li>program, which is a set of instructions</li>
<li>program data, i.e. variables, constants etc.</li>
</ul>
</li>
<li>Program Counter (PC)
<ul>
<li>points to an instruction the program</li>
<li>after execution of one instruction it points to the next one</li>
<li>branch instructions make PC jump to another instruction (not in sequence)</li>
</ul>
</li>
<li>CPU contains
<ul>
<li>program counter</li>
<li>instruction execution unit</li>
</ul>
</li>
</ul>
<h3 id="single-instruction-isa">Single Instruction ISA</h3>
<ul>
<li>sbn - subtract and branch if negative</li>
</ul>
<p>This basically leads to the following psuedocode</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbn(a, b, line_no):
a = a-b
if (a<0):
goto line_no
else:
goto next_statement
</code></pre></div></div>
<ul>
<li>Addition using SBN</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>intialize
temp = 0
1: sbn temp, b, 2
exit: exit
2: sbn a, temp, exit
</code></pre></div></div>
<ul>
<li>Add 1-10 using SBN</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>initialize
one = 1
index = 10
sum = 0
1: sbn temp, temp, 2 \\ sets temp = 0
2: sbn temp, index, 3 \\ sets temp = -index
3: sbn sum, temp, 4 \\ sets sum += index
4: sbn index, one, exit \\ sets index -= 1
5: sbn temp, temp, 6 \\ sets temp = 0
6: sbn temp, one, 1 \\ the for loop, since 0 - 1 < 0
exit: exit
</code></pre></div></div>
<p>This is similar to writing <strong>assembly level programs</strong>, which are low level programs.</p>
<h3 id="mutliple-instruction-isas">Mutliple Instruction ISAs</h3>
<p>They typicall have:</p>
<ul>
<li>Arithmetic Instructions: Add, Subtract, Multiply, Divide</li>
<li>Logical Instructions: And, Or, Not</li>
<li>Move Instructions: Transfer between memory locations</li>
<li>Branch Instructions: Jump to new memory locations based on program instructions</li>
</ul>
<h3 id="design-of-practical-machines">Design of Practical Machines</h3>
<ul>
<li>While Harvard Machine has seperate data and instruction memories, Von-Neumann Machine has a single memory to serve both the purposes.</li>
<li>The problems with these machines is that they assume memory to be one large array of bytes. In practice these are slower because as the size of the structure increases the speed of processing decreases. The possible solution of this lies in having several smaller array of name locations called <strong>registers</strong> that can be used by instructions. Hence these smaller arrays are faster.</li>
</ul>
<p>So,</p>
<ul>
<li>CPU contains a set of registers which are named storage locations.</li>
<li>values are loaded from memory to registers.</li>
<li>arithmetic an logical instructions use registers for input</li>
<li>finally, data is stored back in the memory.</li>
</ul>
<p>Example program in machine language,</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r1 = mem[b] \\ load b
r2 = mem[c] \\ load c
r3 = r1 + r2
mem[a] = r3
</code></pre></div></div>
<p>where</p>
<ul>
<li>r1, r2, r3 are registers</li>
<li>mem is the array of bytes representing memory</li>
</ul>
<p>As a result the modern day computers are similar to Von-Neumann Machines with the addition of register in the CPU.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://onlinecourses.nptel.ac.in/noc18_cs29/unit?unit=6&lesson=8" target="_blank">NPTEL: Introduction to Computer Architecture</a></small><br /></p>
Wed, 25 Jul 2018 00:00:00 +0000
https://notes.pairml.com/2018/07/25/introduction-to-computer-architecture/
https://notes.pairml.com/2018/07/25/introduction-to-computer-architecture/computer-sciencenptelnptel-computer-architectureIntroduction to Survival Analysis<h3 id="introduction">Introduction</h3>
<p>Survival analysis refers to the set of statistical analyses that are used to analyze the length of time until an event of interest occurs. These methods have been traditionally used in analysing the survival times of patients and hence the name. But they also have a utility in a lot of different application including but not limited to analysis of the time of recidivism, failure of equipments, survival time of patients etc. Hence, simply put the phrase <strong>survival time</strong> is used to refer to the type of variable of interest. It is often also referred by names such as <strong>failure time</strong> and <strong>waiting time</strong>.</p>
<p>Such studies generally work with <strong>data leading upto an event of interest</strong> along with several other characteristics of individual data points that may be used to explain the survival times statistically.</p>
<p>The statistical problem (survival analysis) is to construct and estimate an appropriate model of the time of event occurance. A survival model fulfills the following expectations:</p>
<ul>
<li>yield predictions of number of individuals who will fail (undergo the event of interest) at any length of time since the beginning of observation (or other decided point in time).</li>
<li>estimate the effect of observable individual characteristics on the survival time (to check the relevance of one variable holding constant all others).</li>
</ul>
<p>It is often observed that the survival models such as proportional hazard model are capable of <strong>explaining the survival times in terms of observed characteristics</strong> which is better than straight-forward statistical inferences such as <strong>rates of event occurence without considering characteristic features</strong> of data.</p>
<h3 id="basics">Basics</h3>
<p>Assume <strong>survival time T is a random variable</strong> following some distribution <strong>characterized by cumulative distribution function \(F(t, \theta)\)</strong>, where</p>
<ul>
<li>\(\theta\) is the set of <strong>parameters to be estimated</strong></li>
<li>\(F(t, \theta) = P(T \leq t) = \) <strong>probability that there is a failure</strong> at or before time \(t\), for any \(t \geq 0\)</li>
<li>\(F(t, \theta) \to 1\) as \(t \to \infty\), since \(F(t, \theta)\) is a <strong>cumulative distribution function</strong></li>
<li>Above tendency leads to an <strong>implicit assumption that all candidates would eventually fail</strong>. While this assumptions works selectively based on settings (true for patient survival times, not true for time of repayment of loans) and hence needs to be relaxed where it does not hold true.</li>
</ul>
<p><strong>Survival times are non-negative</strong> by definition and hence the distributions (like exponential, Weibull, gamma, lognormal etc.) characterising it are defined for value of time \(t\) from \(0\) to \(\infty\).</p>
<p>Let \(f(t, \theta)\) be the <strong>density function</strong> correponding to the distribution function \(F(t, \theta)\), then the <strong>survival function</strong> is given by,</p>
<script type="math/tex; mode=display">S(t, \theta) = 1 - F(t, \theta) = P(T \gt t) \tag{1} \label{1}</script>
<p>which gives the <strong>probability of survival</strong> until time \(t\) (\(S(t, \theta) \to 0\) as \(t \to \infty\) because, \(F(t, \theta) \to 1\) as \(t \to \infty\)).</p>
<p>Another useful concept in survival analysis is called <strong>hazard rate</strong>, defined by,</p>
<script type="math/tex; mode=display">h(t, \theta) = \frac{f(t, \theta)} {1-F(t, \theta)} = \frac{f(t, \theta)} {S(t, \theta)} \tag{2} \label{2}</script>
<blockquote>
<p>Hazard rate represents the density of a failure at time \(t\), conditional on no failure prior to time \(t\), i.e., it indicates the probability of failure in the next unit of time, given that no failure has occured yet.</p>
</blockquote>
<p><strong>While \(f(t, \theta)\) roughly represents the proportion of original cohort that should be expected to fail between time \(t\) and \(t+1\), hazard rate \(h(t, \theta)\) represents the proportion of survivors until time \(t\) that should be expected to fail in the same time window, \(t\) to \(t+1\).</strong></p>
<p>The relationship betwee the cumulative distribution function and the hazard rate is given by,</p>
<script type="math/tex; mode=display">F(t, \theta) = 1 - exp \left[ - \int_0^t h(x, \theta) dx \right] \tag{3} \label{3}</script>
<p>and</p>
<script type="math/tex; mode=display">h(t, \theta) = - \frac {d\,ln\,[1 - F(t, \theta)]} {dt} \tag{4} \label{4}</script>
<p>The fact that \(F(t, \theta)\) is a cdf puts some restrictions on the hazard rate,</p>
<ul>
<li>hazard rate is non-negative function</li>
</ul>
<script type="math/tex; mode=display">H(t, \theta) = \int_0^t h(x, \theta) dx \tag{5} \label{5}</script>
<ul>
<li>the integrated hazard in \eqref{5} is finite for finite \(t\) and tends to \(\infty\) as \(t\) approaches \(\infty\).</li>
</ul>
<h3 id="state-dependence">State Dependence</h3>
<ul>
<li>Positive state dependence or an increasing hazard rate \(dh(t)/dt \gt 0 \) indicates that the <strong>probability of failure during the next time unit increases</strong> as the length of time at risk increases.</li>
<li>Negative state dependence or a decreasing hazard rate \(dh(t)/dt \lt 0 \) indicates that the <strong>probability of failure in the next time unit decreases</strong> as the length of time at risk decreases.</li>
<li>No state dependence indicates a <strong>constant hazard rate</strong>.</li>
</ul>
<blockquote>
<p>Only exponential distribution displays no state dependence.</p>
</blockquote>
<h3 id="censoring-and-truncation">Censoring and Truncation</h3>
<p>A common feature of data on survival times is that they are censored or truncated. Censoring and truncation are statistical terms that refer to the <strong>inability to observe the variable of interest for the entire population</strong>.</p>
<ul>
<li>A standard example to understand this can be understood in the form of a case of an individual shooting at a round target with a rifle and the variable of interest is the distance by which the bullet misses the center of the target.</li>
<li>If all shots hit the target, this distance can be measure for all the shots and there is no problem of censoring or truncation.</li>
<li>If some shots miss the target, but we know the number of shots fired, <strong>the sample is censored</strong>. In this case either the distance of shot from center is known or it is known that it was atleast as large as the radius of the target.</li>
<li>Similarly if one does not know how many shots were fired but only have information about distance for shots that hit the target, <strong>the sample is truncated</strong>.</li>
</ul>
<blockquote>
<p>Censored sample has more information than a truncated sample.</p>
</blockquote>
<p>Survival times are often censored because not all candidates would fail by the end of time during which the data was collected. This <strong>censoring of data must be taken into account</strong> while making the estimations because it is <strong>not legitimate to drop such observations</strong> with unobserved survival times <strong>r to set survival times for these observations equal to the length of the follow-up period</strong> (when the data was collected).</p>
<ul>
<li>Infrequently so, but there is also a chance of getting information about a candidate during a follow-up collection who was not a part of the original population. In such cases the <strong>survival time is truncated</strong> because there is no information of the candidate or his survival time.</li>
</ul>
<h3 id="problem-of-estimation">Problem of Estimation</h3>
<p>The initial assumption specifies a cumulative distribution function \(F(t, \theta)\), or equivalently a density \(f(t, \theta)\) or hazard \(h(t, \theta)\) that is of a known form except that it depends on a unknown parameter \(\theta\). Estimation of this parameter is first step for the model to make any meaningful prediction about the survival time of new candidate</p>
<p>Consider a case of estimation of parameter for a censored sample which is defined as follows,</p>
<ul>
<li>sample has \(N\) individuals with follow-up periods \(T_1, T_2, \cdots, T_N\). These follow-ups may be all equal, but they usually are not.</li>
<li>\(n\) is number of individuals who fail, numbered \(1, 2, \cdots, n\) and individuals numbered \(n+1, n+2, \cdots, N\) are the non-failures.</li>
<li>for the candidates who fail, there exists a survival time \(t_i \leq T_i, \, i \in [1, n]\)</li>
<li>for the non-failures, survival time \(t_i\) is not observed but it is known that it is greater than the length of the follow-up period \(T_i\), \(i \in [n+1, N]\).</li>
</ul>
<p>If it is assumed that <strong>all the outcomes are independent</strong> of each other the likelyhood function of the sample is,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n f(t_i, \theta) \prod_{i=n+1}^N S(T_i, \theta) \tag{6} \label{6}</script>
<blockquote>
<p>Likelyhood function is a general statistical tool that expresses the probability of outcomes observed in terms of unknown parameters that are to be estimated, i.e., it is function of the parameters to be estimated, which serves as a measure of how likely it is that the statistical model, with a given parameter value, would generate the given data.</p>
</blockquote>
<p>A common used estimator of \(\theta\) is the <strong>Maximum Likelyhood Estimator (MLE)</strong> which is defined as the value of \(\theta\) that maximizes the likelyhood function.</p>
<p>The MLE have been shown to display the following desirable properties over a large sample (<strong>as the sample size approaches infinity</strong>),</p>
<ul>
<li>Unbiased</li>
<li>Efficient</li>
<li>Normally Distributed</li>
</ul>
<p>As mentioned, the properties of MLE are only <strong>relevant when the sample size is large</strong>. It is often <strong>observed that the sample sizes in these studies are much smaller</strong> and hence reliance on large sample properties of estimator is more tenuous.</p>
<p>The above survival model uses observed survival time \(t_i\) while <strong>ignoring the specific timing of the observed returns</strong>. So the analysis of the fact of failure or non-failure, ignoring the timing of observed failures, would properly be based on the likelyhood function</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n F(T_i, \theta) \prod_{i=n+1}^N S(T_i, \theta) \tag{7} \label{7}</script>
<p>Estimation using \eqref{7} is a legitimate procedure and does not cause any bias or inconsistency, but the estimates are inefficient relative to MLEs from \eqref{6}.</p>
<p><strong>The estimates of \(\theta\) gotten by maximizing \eqref{7} will be less efficient (have larger variance) than the estimates of \(\theta\) gotten by maximizing \eqref{6}, atleast for large sample sizes. Hence if the information on time of return is available, it should be used.</strong></p>
<p>If the <strong>truncated</strong> case of sample is considered, then there is <strong>no information on all the individuals who do not fail</strong>. Formally, one starts with a cohort of \(N\) candidates, where <strong>\(N\) is unknown</strong>, and the only <strong>observations available are the survival times \(t_i\) for the \(n\) individuals who fail</strong> before the end of follow-up period. The \(n\) individuals appear in sample because \(t_i \leq T_i\), and the appropriate density is therefore</p>
<script type="math/tex; mode=display">f(t_i, \theta \mid t_i \leq T_i) = \frac{f(t_i, \theta)}{P(t_i \leq T_i)} = \frac{f(t_i, \theta)}{F(T_i, \theta)} \tag{8} \label{8}</script>
<p>And the corresponding <strong>likelyhood function</strong> which can be maximized to obtain the MLEs is given by,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n f(t_i, \theta \mid t_i \leq T_i) = \prod_{i=1}^n \frac{f(t_i, \theta)}{F(T_i, \theta)} \tag{9} \label{9}</script>
<h3 id="explanatory-variables">Explanatory Variables</h3>
<p>Information on <strong>explanatory variables may or may not be used</strong> in estimating survival time models. Some models that are based on the <strong>implicit assumption that distribution of survival time is the same for all individuals</strong>, do not use explanatory variables.</p>
<p><strong>But practically it is observed that some individuals are more prone to failing than others and hence if information on individual charestistics and environmental variables is available, it should be used.</strong></p>
<p>This information can be incorporated is survival models by letting the parameter \(\theta\) depend on these individual characteristics and a new set of parameters. E.g. exponential model depends on a single parameter, say \(\theta\), and <strong>\(\theta\) can be assumed to depend on the individual characteristics</strong> as in linear regression.</p>
<h3 id="non-parametric-hazard-rate-and-kaplan-meier">Non-Parametric Hazard Rate and Kaplan Meier</h3>
<p>Before beginning any formal analyses of the data, it is often instructive to check the hazard rate. For this purpose, the <strong>time until failure are rounded to the nearest quantized time unit</strong> (month, week, day etc.). Following this it is easy to count the <strong>number of candidates at risk at the beginning of the said time period</strong> (i.e. the number of individuals who have not yet failed or been censored at the beginining of the time unit) and the <strong>number of individuals who fail during the time period</strong>.</p>
<p>Then, the <strong>non-parametric hazard rate</strong> can be estimated as the ratio of number of failures during the time period to the number of individuals at risk at the beginning of time period, i.e., if the number of individuals at risk at the beginning of time \(t \, (t = 1, 2, \cdots)\) is denoted by \(r\), and the number of individuals who fail during this time \(t\) is denoted by \(n_t\), then the estimated hazard for time \(t\), \(\hat{h}(t)\) is given by,</p>
<script type="math/tex; mode=display">\hat{h}(t) = \frac{n_t}{r} \tag{10} \label{10}</script>
<p>Such estimated hazard rates are prone to high variability. Also this high variability makes the purely non parametric estimates unattractive as they are less likely to give an accurate prediction on a new dataset. The parametric models such as exponential, Weibull or lognormal take care of this high variability and makes the model more tractable.</p>
<p><strong>But the plots of non parametric estimates of hazard rate provides a good initial guide as to which probability distribution may work well for a given usecase.</strong></p>
<p>As noted earlier, the hazard function, density function, and distribution function are alternative but equivalent ways of characterizing the distribution of the time until failure. Hence, once the hazard rate is estimated, then implicitly so is the density and the distribution function. It is possible to solve explicitly for the estimated density of distribution function in terms of the estimated hazard function. The resulting estimator (called <strong>Kaplan Meier</strong> or <strong>product limit</strong> estimator in statistical literature which is nothing but the non-parametric estimate) of the distribution function is given by,</p>
<script type="math/tex; mode=display">\hat{F}(t) = 1 - \prod_{j=1}^t [1 - \hat{h}(j)] \tag{11} \label{11}</script>
<h3 id="models-without-explanatory-variables">Models without Explanatory Variables</h3>
<p>There are various models that do not consider the explanatory variables, and instead <strong>assume some specific distribution</strong> such as exponential, Weibull, or lognormal for the length of time until failure. Essentially, the distribution of time until failure is known, except for some <strong>unknown parameters that have to be estimated</strong>. Hence, models of this type are called parametric models, which are different from the models discussed before as the later have no associated parameters or distribution.</p>
<p>The unknown parameters are <strong>estimated by maximizing the likelyhood function</strong> of the form \eqref{6}.</p>
<blockquote>
<p>In case of exponential distribution, MLEs cannot be written in closed form (i.e. expressed algebraically), and so the maximization of likelyhood function is done numerically.</p>
</blockquote>
<p>Once the characteristic parameters have been estimated, one can determine the following (which cannot be determined in case of non-parametric estimates like Kaplan Meier):</p>
<ul>
<li><strong>mean time</strong> until failures</li>
<li><strong>proportion of population that should be expected to fail</strong> within any arbitrary period of time.</li>
</ul>
<p>While the <strong>advantage of such models lies in the smoothness of predictions</strong>, the <strong>disadvatage is the fact that it can be wrong and inturn lead to statements that are systematically misleading</strong>.</p>
<p><strong>Exponential Distribution</strong></p>
<p>The exponential distribution has density,</p>
<script type="math/tex; mode=display">f(t) = \theta \, e^{-\theta t} \tag{12} \label{12}</script>
<p>and <strong>survivor function</strong>,</p>
<script type="math/tex; mode=display">S(t) = e^{-\theta t} \tag{13} \label{13}</script>
<p>where</p>
<ul>
<li>the parameter is constrained, \(\theta \gt 0\)</li>
<li>mean: \(1 / \theta\) and variance: \(1 / \theta^2\)</li>
<li>only distribution with a <strong>constant hazard rate</strong>, specifically \(h(t) = \theta\) for all \(t \geq 0\)</li>
<li>such hazard rates are generally seen in some physical processes such as radioactive decay.</li>
<li>it is often not the most reasonable distribution for survival models.</li>
<li>exponential distribution requires estimation of single parameter \(\theta\).</li>
</ul>
<p>Consider a sample of \(N\) individuals, of which \(n\) have failed before the end of the follow-up period. The observed failure times be denoted by \(t_i\, (i=1, 2, \cdots, n)\) and the censoring times (length of follow up) for the non-failures de denoted by \(T_i\, (i = n+1, \cdots, N)\). Then the likelyhood function \eqref{6} can be written as</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \theta\,e^{-\theta t_i} \prod_{i=n+1}^N e^{-\theta T_i} \tag{14} \label{14}</script>
<p>Maximizing \eqref{14} w.r.t. \(\theta\) yields MLE in closed form:</p>
<script type="math/tex; mode=display">\hat{\theta} = \frac {n} {\sum_{i=1}^n t_i + \sum_{i=n+1}^N T_i} \tag{15} \label{15}</script>
<p>For large samples \(\hat{\theta}\) is normal with mean \(\theta\) and variance</p>
<script type="math/tex; mode=display">\frac{\theta^2}{\sum_{i=1}^N [1 - exp(-\theta T_i)]} \tag{16} \label{16}</script>
<p>which for large \(N\) is adequately approximated by \(\theta^2/n\).</p>
<ul>
<li>Exponential distribution is highly skewed.</li>
<li>Mean may not be a good measure of central tendency for exponential distribution.</li>
<li>Median may be more preferrable indicator in most cases.</li>
</ul>
<blockquote>
<p>Logarithm of likelyhood or log-likelyhood is used as a value to measure the goodness of fit. A higher value (more positive or less negative) for this variable indicates that the model fits the data better.</p>
</blockquote>
<p><strong>Weibull Distribution</strong></p>
<p>In statistical literature, a very common alternative to the exponential distribution is the Weibull distribution. It is a generalization of the exponential distribution. By using Weibull distribution one can test to check if a simpler exponential model is more appropriate.</p>
<ul>
<li>A variable \(T\) has Weibull distribution if \(T^{\tau}\) has an exponential distribution for some value of \(\tau\).</li>
<li>increasing hazard rate if \(\tau \gt 1\) and decreasing hazard rate if \(\tau \lt 1\). Also, if \(\tau = 1\) the hazard rate is constant and the Weibull distribution reduces to the exponential.</li>
<li><strong>Weibull distribution has a monotonic hazard rate</strong>, i.e it can be increasing, constant or decreasing but it cannot be increasing at first and then decreasing after some point.</li>
</ul>
<p>The density of Weibull distribution is given by,</p>
<script type="math/tex; mode=display">f(t) = \tau \theta^{\tau} \, t^{\tau -1} e^{-(\theta t)^\tau} \tag{17} \label{17}</script>
<p>and the survivor function is,</p>
<script type="math/tex; mode=display">S(t) = e^{-(\theta t)^\tau} \tag{18} \label{18}</script>
<p>The likelyhood function for Weibull distribution can be derived by substituting \eqref{17} and \eqref{18} in \eqref{6}.</p>
<p><strong>Lognormal Distribution</strong></p>
<p>If \(z\) is distributed as \(N(\mu, \sigma^2)\), then \(y = e^z\) has a lognormal distribution with mean</p>
<script type="math/tex; mode=display">\phi = exp(\mu + {1 \over 2} \sigma^2) \tag{19} \label{19}</script>
<p>and variance,</p>
<script type="math/tex; mode=display">\tau^2 = exp(2 \mu + \sigma^2) [exp(\sigma^2) -1] = \phi^2 \psi^2 \tag{20} \label{20}</script>
<p>where</p>
<script type="math/tex; mode=display">\psi^2 = exp(\sigma^2) - 1 \tag{21} \label{21}</script>
<p>The <strong>density</strong> of \(z = ln \, y\) is the density of \(N(\mu, \sigma^2)\) given by,</p>
<script type="math/tex; mode=display">f(ln \, y) = (1 / \sqrt{2\pi} \sigma) exp [-(1/2 \sigma^2) (ln\, y - \mu)^2] \tag{22} \label{22}</script>
<p>Generally there is <strong>no advantage to working with the density of \(y\) itself, rather than \(ln \, y\)</strong>. Thus, one can simply assume that log of survival time is distributed normally, and hence the likelyhood function \eqref{6} becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
L = &- {n \over 2} ln(2\pi) - {n \over 2} ln(\sigma^2) - {1 \over 2\sigma^2} \sum_{i=1}^n (ln\, t_i - \mu)^2 \\
&+ \sum_{i=n+1}^N ln \, F \left[ \frac {\mu - ln\, T_i} {\sigma} \right]
\end{align}
\tag{23} \label{23} %]]></script>
<ul>
<li>where <strong>\(F\) is the cumulative distribution function</strong> for \(N(0, 1)\) distribution.</li>
<li><strong>No analytical solution</strong> exists for the maximization of \eqref{23} w.r.t. \(\mu\), and \(\sigma^2\), so it <strong>must be maximized numerically</strong>.</li>
<li>the hazard function for lognormal distribution is complicated; it <strong>increases first and then decreases</strong>.</li>
</ul>
<p><strong>Other distributions</strong></p>
<p>Although exponential, Weibull and lognormal are among the three most used distributions, there are various other well-known probability distributions possible, such as</p>
<ul>
<li>log-logistic</li>
<li>LaGuerre</li>
<li>distributions based on Box-Cox power transformation of the normal</li>
</ul>
<p>There are various ways of measuring how well models fit the data:</p>
<ul>
<li>value of likelyhood (or log-likelyhood) function</li>
<li>maximum difference between the fitted value and actual cumulative distribution function</li>
<li>standard Kolmogorov-Smirnov test of goodness of fit</li>
<li>chi-square goodness-of-fit statistic based on predicted and actual failure times.</li>
</ul>
<p>Over time it has been observed that even though some of these parametric distributions <strong>might fit the data</strong> better than others and excel on various metrics of good fit of data, these <strong>do not give any explaination about the reasons governing the distribution</strong> or any <strong>insight into the affecting parameters</strong> that lead to the different survival times in a population. Hence, these parametric models without the explanatory variables are not considered to be an effective tool for analysis.</p>
<h3 id="models-with-explanatory-variables">Models with Explanatory Variables</h3>
<ul>
<li>
<p>Explanatory variables are in general added to survival models in an attempt to make more accurate predictions: the practical experiments over time corroborate the fact that individual characteristics, previous experiences and environmental setup helps predict whether or not a person will fail.</p>
</li>
<li>
<p>An analysis of survival time without using the explanatory variables amounts to an analysis of its <strong>marginal distribution</strong>, whereas an analysis using explanatory variable amounts to an analysis of the <strong>distribution of survival time conditional on these variables</strong>.</p>
</li>
</ul>
<blockquote>
<p>Variance of the conditional distribution is less than the variance of the marginal distribution, i.e. expect more precise distribution from former.</p>
</blockquote>
<ul>
<li>
<p>Another more fundamental reason may include the interest of understanding the effect of explanatory variables on the survival time.</p>
</li>
<li>
<p>More generally, these variables might be the demographics or environmental characteristics.</p>
</li>
</ul>
<h3 id="proportional-hazards-model">Proportional Hazards Model</h3>
<ul>
<li>
<p>allows one to estimate the effects of individual characteristics on survival time without having to assume a specific parametric form of distribution of time until failure.</p>
</li>
<li>
<p>For an individual with the vector of characteristics, \(x\), the proportional hazards model assumes a hazard rate of the form,</p>
</li>
</ul>
<script type="math/tex; mode=display">h(t \mid x) = h_0(t) e^{x_i^\prime \beta} \tag{24} \label{24}</script>
<p>where \(h_0(t)\) is completely arbitrary and unspecified baseline hazard function. <strong>Thus, the model assumes that the hazard functions of all individuals differ only by a factor of proportionality,</strong> i.e. if an individuals hazard rate is 10 times higher than another’s at a given point of time, then it must be 10 times higher at all points in time. <strong>Each hazard function follows same pattern over time.</strong></p>
<p>However, there is no restriction on what this pattern can be, i.e. it puts no restriction on the \(h_0(t)\) curve, which determines the shape of \(h(t \vert x)\) curve. <strong>\(\beta\) can be estimated without specifying \(h_0(t)\), and \(h_0(t)\) can be estimated non-parametrically and thus with flexibility.</strong></p>
<p>Consider a sample of \(N\) individuals, \(n\) of whom fail before the end of their follow-up period. Let the observations be ordered such that individual 1 has the shortest failure time, individual 2 has the second shortest failure time, and so forth. Thus, for individual \(i\), failure time \(t_i\) is observed, with,</p>
<script type="math/tex; mode=display">t_1 \lt t_2 \lt \cdots \lt t_n \tag{25} \label{25}</script>
<p>A vector \(x_i\) represents individual characteristics for each individual \(i = 1, 2, \cdots, N\), irrespective of whether they failed.</p>
<p>For each observed failure times, \(t_i\), \(R(t_i)\) is defined as set of all individuals who were at risk just prior to time \(t_i\), i.e., it includes the individuals with failure times greater than or equal to \(t_i\), as well as the individuals whose follow-up is at least of length \(t_i\).</p>
<p>Using these definitions, the <strong>partial-likelihood</strong> function proposed by Cox can be defined for any failure time \(t_i\), as the probability that it is individual \(i\) who fails, given that exactly one individual from set \(R(t_i\)) fails, is given by,</p>
<script type="math/tex; mode=display">\frac {h(t_i \vert x_i)} {\sum_{j \in R(t_i)} h(t_i \vert x_j)} = \frac {exp(x_i^\prime \beta)} {\sum_{j \in R(t_i)} exp(x_j^\prime \beta)} \tag{26} \label{26}</script>
<p>The partial-likelyhood function is formed by multiplying \eqref{26} over all \(n\) failure times,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \frac {exp(x_i^\prime \beta)} {\sum_{j \in R(t_i)} exp(x_j^\prime \beta)} \tag{27} \label{27}</script>
<p>The estimate of \(\beta\) by maximizing \eqref{27} numerically w.r.t \(\beta\) is the <strong>partial maximum-likelyhood estimate</strong>. The word <strong>partial</strong> in partial likelyhood refers to the fact that not all available information is used in estimating \(\beta\), i.e., it only depends on knowing which individuals were at risk when each observed failure occured. The exact numerical values of the failure times \(t_i\) or of the censoring times for the non recedivists are not needed; only their <strong>order matters</strong>.</p>
<p>Once \(\beta\) is estimated, \(h_0(t)\), the baseline hazard function can be estimated non-parametrically. The estimated baseline hazard function is constant over the intervals between failure times. One can also calculate <strong>survivor function</strong> \(S_0(t)\) or equivalently the baseline cumulative distribution function \(F_0(t)\), that corresponds to the estimated baseline hazard function.</p>
<p><strong>The estimated survivor function is a step function that falls at each time at which there is a failure.</strong></p>
<p>The point of proportional hazard model is that the survivor function is estimated non-parametrically (i.e. not imposing any structure on its pattern over time, except that it must decrease as \(t\) increases) and estimation of \(\beta\) can proceed seperately from estimation of survivor function.</p>
<h3 id="split-population-models">Split Population Models</h3>
<p>The models considered so far assume some cumulative distribution function, \(F(t)\) for the survival time, that gives the probability of a failure upto and including time \(t\), and it approaches one as \(t\) approaches infinity. This basically means that every individual must eventually fail, if they were observed for long enough time. This assumption is not true in all cases.</p>
<p><strong>Split Population Models</strong> (or split models) do not imply that every individual would eventually fail. Rather the population is divided into two groups, one of which would never fail.</p>
<p>Mathematically, let \(Y\) be an observable indicator with two values, one implying ultimate failure and zero implying perpetual success. Then,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(Y=1) &= \delta \\
P(Y=0) &= 1 - \delta
\end{align}
\tag{28} \label{28} %]]></script>
<p>where \(\delta\) is the proportion of the population that would eventually fail, and \(1 - \delta\) is the proportion that would never fail.</p>
<p>Let \(g(t \vert Y=1)\) be density of survival times for the ultimate failures, and \(G(t \vert Y=1)\) be the corresponding cumulative distribution function. If one considers exponential model to represent them, then</p>
<script type="math/tex; mode=display">\begin{align}
g(t \vert Y=1) = \theta e^{-\theta t} \\
G(t \vert Y=1) = 1 - e^{-\theta t}
\end{align}
\tag{29} \label{29}</script>
<p>It can also be noted that \(g (t \vert Y = 0)\) and \(G(t \vert Y=0)\) are not defined.</p>
<p>Let \(T\) be the length of the follow up period and let \(R\) be an observable indicator equal to one if there is failure by time \(T\) and zero if there is not. The probability for individuals who do not fail during the follow up period, i.e, the event of \(R = 0\) is given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(R=0) &= P(Y=0) + P(Y=1)P(t \gt T \vert Y=1) \\
&= 1 - \delta + \delta e^{-\theta T}
\end{align}
\tag{30} \label{30} %]]></script>
<p>Similarly, probability density for people who fail with survival time \(t\) is given by,</p>
<script type="math/tex; mode=display">P(Y=1)P(t \lt T \vert Y=1) g(t \vert t \lt T, Y=1) = P(Y=1) g(t \vert Y=1) = \delta \theta e^{-\theta t} \tag{31} \label{31}</script>
<p>So the likelyhood function is made up of \eqref{29} for those who do not fail and \eqref{30} for those who do. It is given by,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \delta \theta exp(-\theta t_i) \prod_{i = n+1}^N (1 - \delta + \delta exp(-\theta T_i)) \tag{32} \label{32}</script>
<p>The maximum likelyhood estimate of both \(\theta\) and \(\delta\) can be obtained by maximizing \eqref{32} numerically. It can be noted that when \(\delta = 1\), \eqref{32} reduces to \eqref{14}, the original exponential survival time model.</p>
<p>The split population model can be seen as a model of two seperate subpopulations, one with hazard rate \(\theta\) and other with zero. A more generalized model exists where the subpopulations exist with two non-zero hazard rates namely, \(\theta_1\) and \(\theta_2\). Such models help to account for population that is heterogenous in nature.</p>
<p>Split models can also be based on other distributions such as lognormal etc. Also, it is possible to include explanatory variables into a split model. In such cases, the explanatory variables maybe taken to affect the probabiliy of failure, \(\delta\) or distribution of time until failure.</p>
<p>For example, for a given feature vector \(x_i\) of explanatory variables, using <strong>logit/individual lognormal model</strong>, \(\delta\) is modeled using,</p>
<script type="math/tex; mode=display">\delta_i = 1/(1+exp(x_i^\prime \alpha)) \tag{33} \label{33}</script>
<p>and parameter \(\mu\) of the lognormal distribution is given by,</p>
<script type="math/tex; mode=display">\mu_i = x_i^\prime \beta \tag{34} \label{34}</script>
<p>Here, the parameter \(\alpha\) gives the effect of \(x_i\) on the probablity of failure, and \(\beta\) gives the effect of \(x_i\) on the time until failure.</p>
<p>Such models are of importance because they let one distinguish between effects of explanatory variable on probability of eventual failure from effects on time until failure who eventually do fail.</p>
<h3 id="heterogeneity-and-state-dependence">Heterogeneity and State Dependence</h3>
<p>The two major causes of observed declining hazard rates are:</p>
<ul>
<li>state dependence</li>
<li>heterogeneity</li>
</ul>
<p>The phenomenon of an actually decreasing hazard rate over time due to an actual change in behavior over time at individual level is referred to as <strong>state dependence</strong>.</p>
<p>The second possible reason is <strong>heterogeneity</strong>. This basically means that the hazard rates are different across individuals, i.e., some individuals are more prone to failure than others. Naturally, individuals with higher hazard rates tend to fail earlier, on average, than individuals with lower hazard rates. As a result the average hazard rate of the surviving group will decrease with length of time simply because the most failure prone individuals have been removed already. This is true even without state dependence, i.e, each individual has a constant hazard rate but hazard rate varies across individuals. Even such a group would display decreasing hazard rate.</p>
<p>It is important to understand the difference because a decrease in a hazard rate due to state dependance means a success of the underlying program, while decrease due to heterogeneity does not imply that the program is effective in preventing failure, because it is happening by the virtue of the data at hand.</p>
<h3 id="time-varying-covariates">Time Varying Covariates</h3>
<p>Until now explanatory variables affecting the time until failure do not potray changing values over time, but is a possibility that can not be denied.</p>
<p>The types of explanatory variables can be categorizaed as follows:</p>
<ul>
<li>variables that do not change over time, e.g race, sex etc.</li>
<li>variables that change over time but not within a single follow-up period, e.g. number of times followed up etc.</li>
<li>variables that change continuously over time, such as age, education etc.</li>
</ul>
<p>The last type of variables make it reasonable to use a statistical model that allows covariates to vary over time. Such incorporation is relatively straightforward in hazard-based models such as proportional hazard models. At each point in time, hazard rate is determined by the values of explanatory variables at that time.</p>
<p>However, it is much more difficult to introduce time-varying components into parametric models because these models are parameterized in terms of density and cumulative distribution function, and the density of distribution function at time \(t\) depends on the whole history of the explanatory variables up to time \(t\). <strong>In the presence of time varying covariates, a parameterization of the hazard rate would be much more convenient.</strong></p>
<p><strong>Panel or Longitudinal Data:</strong> data on individuals over time without reference to just a single follow-up. Such datasets include a large number of time-varying explanatory variables.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://link.springer.com/article/10.1007/BF01083132#" target="_blank">Survival Analysis: A Survey</a></small><br /></p>
Mon, 23 Jul 2018 00:00:00 +0000
https://notes.pairml.com/2018/07/23/survival-analysis/
https://notes.pairml.com/2018/07/23/survival-analysis/machine-learningpapersfeaturedGoogle Smart Reply<h3 id="introduction">Introduction</h3>
<p>Smart reply is an end to end method for automatically generating <strong>short yet semantically diverse</strong> email repsonses. The feature also depends on some novel methods for <strong>semantic clustering of user-generated content</strong> that requires minimal amount of explicitly labeled data.</p>
<p>Google reveals that around 25% of the email responses are 20 tokens or less in length. The high frequency of short replies was the major motivation behind developing an automated reply assist feature. The system exploits concepts of machine learning such as fully-connected neural networks, LSTMs etch.</p>
<p>Major challenges that have been addressed in building this features includes the following:</p>
<ul>
<li>High <strong>repsonse quality</strong> in terms of language and content.</li>
<li><strong>Utility</strong> maintained by presenting a variety of responses.</li>
<li><strong>Scalable architecture</strong> to serve millions of emails google handles without significant latencies.</li>
<li>Maintaining <strong>privacy</strong> by ensuring that no personal data is leaked while generating training data. <strong>Only aggregate statistics are inspected.</strong></li>
</ul>
<h3 id="smart-reply">Smart Reply</h3>
<p>Smart reply consists of the following components:</p>
<ul>
<li>Response Selection: An LSTM network processes the incoming messages and produces the most likely responses. <strong>To improve scalability and increase speed of processing, only approximate best responses are found.</strong></li>
<li>Response Set Generation: In order to maintain high quality, the responses are selected from a response state <strong>generated offline using semi-supervised graph learning approach</strong>.</li>
</ul>
<p><img src="/assets/2018-07-22-google-smart-reply/fig-1-smart-reply.png?raw=true" alt="Fig-1: Lifecycle of a message" width="50%" /></p>
<ul>
<li>Diversity: After generating the most likely responses, a smaller set of responses are chosen among them to <strong>maximize the utility which requires enforcing diverse semantic intents</strong> among the presented options.</li>
<li>Triggering Model: A feedforward neural network decides whether or not to suggest responses, which further improves the utility by not showing suggestions when they are unlikely to be used.</li>
</ul>
<h3 id="background">Background</h3>
<p>The entire application of smart reply can be basically broken down into two core tasks:</p>
<ul>
<li>predicting responses</li>
<li>identifying a target response space</li>
</ul>
<p>While the task of finding the apt response has be attempted before, it has never been applied to a production environment at such a scale. It is this widespread use of the application that requires it to deliver high quality responses at all the instances. This is achieved by choosing the responses from a set of pre-identified response space.</p>
<p>Which leads to the second core task of identifying the target response space. This is achieved by using an algorithm called <strong>Expander Graph Learning Approach</strong>. It is used because it scales well to really large datasets and large output sizes. Generally used for knowledge expansion and classification tasks, smart reply is the first attempt to use it for semantic intent clustering.</p>
<h3 id="selecting-responses">Selecting Responses</h3>
<p>The fundamental aim of smart reply is to find the most likely response given an original message text. i.e. given an original message \(o\) and the set of all possible responses \(R\), find,</p>
<script type="math/tex; mode=display">r^* = argmax_{r \in R} P(r|o) \tag{1} \label{1}</script>
<p>In order to acheive this a model is built to score the responses and then response with the highest score is picked.</p>
<p><strong>LSTM Model</strong></p>
<ul>
<li>Since a sequence of tokens \(r\) is being scored conditional on another sequence of characters \(o\), the task is a natural fit for <strong>sequence to sequence learning</strong>.</li>
<li>Input to the model is the original message \(\{o_1, o_2, \cdots o_n\}\)</li>
<li>The output is the conditional probability distribution of sequence of response tokens given the input:</li>
</ul>
<script type="math/tex; mode=display">P(r_1, r_2, \cdots, r_m | o_1, o_2, \cdots, o_n) \tag{2} \label{2}</script>
<p>The distribution in \eqref{2} can be further factorized as,</p>
<script type="math/tex; mode=display">P(r_1, \cdots, r_m | o_1, \cdots, o_n) = \prod_{i=1}^m P(r_i|o_1, \cdots, o_n, r_1, \cdots, r_{i-1}) \tag{3} \label{3}</script>
<p>In practice, the sequence of original message is fed to the LSTM, which then encodes the entire message in a vector representation. Then given this state, a softmax output is computed, which is interpretted as \(P(r_1|o_1, \cdots, o_n)\)(probability distribution of the first response token).</p>
<p>Similarly, as the response tokens are fed in, softmax at each timestep \(t\) is interpretted as \(P(r_t|o_1, \cdots, o_n, r_1, \cdots, r_{t-1})\)</p>
<p>Using the factorization in \eqref{3}, these softmax scores can be used to compute \(P(r_1, r_2, \cdots, r_m | o_1, o_2, \cdots, o_n)\).</p>
<p>Training involves the following points:</p>
<ul>
<li>maximize the log probability of observed responses, given their respective original messages, i.e.</li>
</ul>
<script type="math/tex; mode=display">\sum_{(o, r)} log \, P(r_1, \cdots, r_m | o_1, \cdots, o_n) \tag{4} \label{4}</script>
<ul>
<li>train using stochastic gradient descent using AdaGrad.</li>
<li>training is done on a distributed system because of the size of the dataset.</li>
<li><strong>recurrenct projection layer</strong> helped improve quality and time of convergence.</li>
<li><strong>gradient clipping</strong> helps stabalize training.</li>
</ul>
<p><strong>Inference</strong>: At the time of inference one can feed in the original message and then use the output of the softmaxes to get a probability distribution over the vocabulary at each timestep. These can be used in a variety of ways:</p>
<ul>
<li>to draw a random sample from the response distribution. This is done by sampling one token at each timestep to feed it back into the model.</li>
<li>to approximate the most likely response given the original message. This can be done greedily by taking most likely token at each timestep and feeding it back in. A less greedy strategy is to use <strong>beam search</strong>, i.e. take the top \(b\) tokens and feed them in, then retain the best \(b\) response prefixes and repeat.</li>
<li>to determine the likelyhood of a specific response candidate. Done by feeding each token of the candidate and using softmax output to get the likelyhood of next candidate token.</li>
</ul>
<h3 id="challenges">Challenges</h3>
<p><strong>Response Quality</strong></p>
<ul>
<li>In order to surface responses to the users, responses must be always high quality in terms of style, tone, diction, and content. Since the models are trained on real-world data, one has to account for the possibility where the most response is not necessarily a high quality response. Even the most frequent responses might not be appropriate to suggest to users because it could contain poor grammar, spelling or machanics (like <em>you’re the best!</em>) or it could also convey a sense of familiarity that is likely to be offensive (like <em>thanks hon!</em>) etc.</li>
<li>While restricting the vocabulary can take care of issues such as profanity or spell errors, it would not be sufficient in averting a politically incorrect statement that can be formed in a wide variety of ways.</li>
<li>Hence, smart reply uses a semi-supervised learning to build the target repsonse space \(R\) comprising of only high quality responses.</li>
<li>Hence the model described is used to choose the best response among \(R\), instead of best response from any sequence of words in the vocabulary.</li>
</ul>
<p><strong>Utility</strong></p>
<ul>
<li>Suggestions are most useful when they are highly specific to the original message and express a diverse intent.</li>
<li>Generally the outputs from LSTM observed tend to (1) favor common but unspecific responses and (2) have little diversity.</li>
<li>Specificity of the responses is increased by penalizing the responses that are applicable to a broad range of incoming messages.</li>
<li>In order to increase the breadth of options presented to users, diversity is enforced by exploiting the semantic structure of \(R\).</li>
<li>Utility of responses is also boosted by passing the incoming message first through a triggering model which decides whether or not it is appropriate for suggestions to pop up.</li>
</ul>
<p><strong>Scalability</strong></p>
<ul>
<li>Scoring every candidate \(r \in R\) would require \(O(|R | l)\) LSTM steps where \(l\) is the length of the longest response.</li>
<li>This would mean a growing response time as the number of responses in \(R\) increases over time.</li>
<li>In general, an efficient algorithm for this purpose should not be a function of \(|R|\)</li>
<li>In order to achieve this, the responses among \(R\) are organized as a trie, followed by a left-to-right beam-search but retain only the hypotheses that appear in the trie.</li>
<li>This search process has a complexity of \(O(bl)\) where both \(b\) and \(l\) are in a range of 10-30, which greatly reduces the time it would take to generate the responses.</li>
<li>Although the search only approximates the best responses in \(R\), its results are very similar to what one would get by scoring and ranking all \(r \in R\), even for a small \(b\).</li>
<li>Also first pass through the triggering model, reduces the average time a message has to spend in LSTM computations.</li>
</ul>
<h3 id="response-set-generation">Response Set Generation</h3>
<ul>
<li>The goal of this step is to generate a structured response set that effectively captures various intents conveyed by people in natural language conversations.</li>
<li>The target response space is required to capture both variablity in language and intents.</li>
<li>The results are used in two ways - (1) define a response space and (2) promote diversity among chosen suggestions.</li>
<li>Response set is constructed by aggregating the most frequently used sentences among the preprocessed data.</li>
</ul>
<p><strong>Canonicalizing Email Responses</strong></p>
<ul>
<li>Involves generating a set of canonicalized responses that capture the variability in language.</li>
<li>This is done by performing a dependency parse on all the sentences and then using the syntactic structure to generate a canonicalized representation.</li>
<li>Words, phrases that are modifiers or not attached to the head words are ignored.</li>
</ul>
<p><strong>Semantic Intent Clustering</strong></p>
<ul>
<li>partition the responses into semantic clusters where each cluster represents a meaningful response intent.</li>
<li>all the messages within a cluster share the same semantic meaning but may appear different in structure.</li>
<li>this helps digest the entire information present in frequent responses into a coherent set of semantic cluster</li>
<li>because of the lack of data available to train a classifier, a supervised model cannot be trained to predict the semantic cluster of a candidate response.</li>
<li>another hindrance in performing supervised learning is that the semantic space classes cannot be all defined a priori.</li>
<li>hence the semi-supervised technique is used for achieving this.</li>
</ul>
<p><strong>Graph Construction</strong></p>
<ul>
<li>Start by manually defining the clusters sampled from top frequent responses.</li>
<li>A small number of responses are added as seed for the clustering.</li>
<li>This leads to a base graph, where <strong>frequent responses are represented by nodes, \(V_R\)</strong>. Lexical features (n-grams and skip grams upto a length of 3) are extracted for the responses and populated in graph as the <strong>feature nodes, \(V_F\)</strong>. Edges are created between the pair of nodes, \((u,v)\) where \(u \in V_R\) and \(v \in V_F\). Similarly, nodes are created for manually labelled examples, \(V_L\).</li>
</ul>
<p><strong>Unsupervised Learning</strong></p>
<ul>
<li>The constructed graph captures the relationship between the canonicalized responses via feature nodes.</li>
<li>Semantic intent for each repsonse node is learnt by propagating intent information from manually labelled examples through the graph.</li>
</ul>
<p>The algorithm works to minimize the following objective function for the response nodes:</p>
<script type="math/tex; mode=display">s_i \lVert \hat{C_i} - C_i \rVert^2 + \mu_{pp} \lVert \hat{C_i} - U \rVert^2 + \mu_{np} \left( \sum_{j \in \mathcal{N}_{\mathcal{F}} (i)} w_{ij} \lVert \hat{C_i} - \hat{C_j} \rVert^2 + \sum_{k \in \mathcal{N}_{\mathcal{R}} (i)} w_{ik} \lVert \hat{C_i} - \hat{C_k} \rVert^2\right) \tag{5} \label{5}</script>
<p>where</p>
<ul>
<li>\(s_i\) is an <strong>indicator function</strong> equal to 1 if node \(i\) is a seed else 0.</li>
<li>\(\hat{C_i}\) is the <strong>learnt semantic cluster distribution</strong> for response node \(i\).</li>
<li>\(C_i\) is the <strong>true label distribution</strong> (i.e. for the manually provided examples)</li>
<li>\(\mathcal{N}_{\mathcal{F}} (i)\) and \(\mathcal{N}_{\mathcal{R}} (i)\) represent the feature and response neighbourhood of node \(i\).</li>
<li>\(\mu_{np}\) is the predefined penalty for neighbouring nodes with divergent label distributions.</li>
<li>\(\hat{C_j}\) is the learnt label distribution for feature neighbour \(j\).</li>
<li>\(w_{ij}\) is the weight of feature \(j\) in response \(i\).</li>
<li>\(\mu_{pp}\) is the penalty for label distribution deviating from prior, Uniform Distribution \(U\).</li>
</ul>
<p>Similarly, the objective is to reduce the following objective function for the feature nodes:</p>
<script type="math/tex; mode=display">\mu_{pp} \lVert \hat{C_i} - U \rVert^2 + \mu_{np} \left( \sum_{j \in \mathcal{N}_{\mathcal{F}} (i)} w_{ij} \lVert \hat{C_i} - \hat{C_j} \rVert^2 + \sum_{k \in \mathcal{N}_{\mathcal{R}} (i)} w_{ik} \lVert \hat{C_i} - \hat{C_k} \rVert^2\right) \tag{6} \label{6}</script>
<p>\eqref{5} and \eqref{6} are alike except that \eqref{6} does not have the first term as there are no seed labels for the feature nodes.</p>
<p>The objective functions \eqref{5} and \eqref{6} are jointly optimized for all the nodes. In order to discover the new clusters the algorithm is run in phases, in which randomly 100 new responses are sampled among the unlabeled nodes. These are treated as the potential new clusters and labeled with there canonicalized representations after which the algorithm is rerun and the process is repeated for the unlabeled nodes.</p>
<p><strong>Cluster Validation</strong></p>
<ul>
<li>Finally, the top \(k\) members from each semantic cluster are extracted and sorted by their label scores.</li>
<li>The set of (response, cluster label) pairs are then validated by human raters.</li>
</ul>
<h3 id="suggestion-diversity">Suggestion Diversity</h3>
<ul>
<li>The LSTM model is trained to returned the approximate best response among the target response set.</li>
<li>The responses are <strong>penalized if they are too general</strong> to be valuable to any user.</li>
<li>The next <strong>challenge lies in choosing a small number of responses</strong> to display to the user which maximizes the utility.</li>
<li>A straight-forward way of doing this can be to <strong>choose the top \(N\) responses</strong> and present them to the user. But in practice it is observed that such responses tend to be very similar. It is obvious to anyone that the likelihood of one of the repsonses being useful is greatest when none of the responses presented to the users are redundant, i.e. it would be wasteful to present a user with three responses that are a variation of same sentence.</li>
<li>The second and more optimal approach to suggest responses to users would <strong>include enforcing diversity</strong>. This is achieved by:
<ul>
<li>omitting redundant responses.</li>
<li>enforcing negative or positive responses.</li>
</ul>
</li>
</ul>
<p><strong>Omitting Redundant Responses</strong></p>
<ul>
<li>The strategy states that a user should <strong>never see two responses with the same intent</strong>.</li>
</ul>
<blockquote>
<p>Intent can be thought of as a cluster of responses that have a common communication purpose.</p>
</blockquote>
<ul>
<li>In smart reply, every suggested responses is associated with a exactly one intent. These intents are learnt using the semi-supervised learning algorithm explained <a href="#response-set-generation">above</a>.</li>
<li>The actual diversity strategy simple: the top responses are iterated over in order of decreasing score. Each response is added to suggestion list unless its intent is already covered by a response in the suggestion list.</li>
</ul>
<p><strong>Enforcing Negatives and Positives</strong></p>
<ul>
<li>It is observed that the LSTM trained has a strong tendency towards positive responses, whereas negative responses generally get a low score.</li>
<li>It might be reflective of the style of email conversations: positive replies are more common and when the replies are negative people prefer more indirect wording.</li>
<li>Since, it is important to give out and option of repsonding negatively, the following strategy is followed:</li>
</ul>
<blockquote>
<p>If the top two responses (chosen from different intents) contain atleast one positive and none of the three responses are negative, the third response is replaced with a negative one.</p>
</blockquote>
<ul>
<li>
<p>A positive response is the one that is clearly affirmative. In order to find the negative response to be included as the third option, a second LSTM pass is performed, in which the search is restricted to only to the negative responses in the target set.</p>
</li>
<li>
<p>It might also be the case that an incoming message triggers exclusively negative responses. In which case, an analogous strategy for enforcing positives is employed.</p>
</li>
</ul>
<h3 id="triggering">Triggering</h3>
<ul>
<li>This is a second model (in this case a fully-connected feed-forward neural network which produced probability score) that is responsible for filtering messages that are bad candidates for suggesting responses. These might include emails that require longer responses, or emails that do not require a response at all.</li>
<li>On an average this system only decides that 11% of the incoming messages should get processed for smart reply. This selectivity further helps to speed up the process of analyzing the incoming emails, and decrease the time spent on LSTM and hence inturn reduce the infrastructure costs.</li>
<li>The two main objectives that this system should fulful are:
<ul>
<li>it should be accurate enough to decide when a smart reply should not be generated</li>
<li>it should be fast.</li>
</ul>
</li>
<li>The choice of model is because it has been repeatedly observed that these ANN outperform linear models such as SVMs or linear regression on NLP tasks.</li>
</ul>
<p><strong>Data and Features</strong></p>
<ul>
<li>Data includes the set of emails in the pair \((o, y)\), where \(o\) is an incoming message and \(y\) is a boolean true or false based on whether or not a email was replied to. For the positive class, only the messages that were replied to from a mobile device are considered.</li>
<li>Since the number of emails that are not replied to are found to be higher, the negative class examples are downsampled to match the number of positive class examples.</li>
<li><strong>Features</strong> (unigrams and bigrams) are extracted from message body, subject and headers. Other <strong>social signals</strong> such as whether or not the sender is in receipent’s address book etc is also used.</li>
</ul>
<p><strong>Network Architecture and Training</strong></p>
<ul>
<li>Feed forward neural network with embedding layer and three fully connected hidden layers</li>
<li>Feature hashing is used to bucket rare words that are not present in the vocabulary.</li>
<li>Embeddings are aggregated by summation within a features (like bigram etc.)</li>
<li>Activation function: ReLu and Dropout layers are used.</li>
<li>Trained using AdaGrad optimization technique.</li>
</ul>
<h3 id="evaluation-and-results">Evaluation and Results</h3>
<p><strong>Data</strong></p>
<ul>
<li>For the LSTM model data consists of incoming messages and its responses by a user.</li>
<li>
<p>For the triggering model, messages are used with the label describing whether or not they were replied to from a mobile device.</p>
</li>
<li>The following <strong>preprocessing</strong> techniques are used:
<ul>
<li>Language detection: non-english messages are discarded.</li>
<li>Tokenization: messages and subjects are broken down into words and punctuations</li>
<li>Sentence segmentation: sentence boundaries are detected in the message body</li>
<li>Normalization: infrequent words and entities like personal informations are replaced by special tokens.</li>
<li>Quotation removal: Quoted original messages and forwarded messages are removed.</li>
<li>Salutation/close removal: salutations and closing notes are removed.</li>
</ul>
</li>
<li>After preprocessing the size of the training data is <strong>238 million</strong> messages, which includes 153 million messages that have no response.</li>
</ul>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>Standard binary performance metrics are observed for triggering model: Precision, recall and area under the ROC curve.</li>
<li>AUC of triggering model is 0.854</li>
<li>For the LSTM model Precision, Mean Reciprocal Rank and Precision@K is observed.</li>
<li>A model with lower perplexity assigns a higher likelyhood to the test responses, and hence should be better at predicting responses. Perplexity of smart reply is 17.0 (by comparison, and n-gram model with katz backoff and maximum order of 5 has a perplexity of 31.4)</li>
</ul>
<blockquote>
<p>A perplexity equal to \(k\) means that when the model predicts the next word, there are on average \(k\) likely candidates.</p>
</blockquote>
<ul>
<li>In an ideal scenario the perplexity of the system would be 1, i.e. one knows exactly what should be the next word. The perplexity on a set of \(N\) test samples is computed using the following formula:</li>
</ul>
<script type="math/tex; mode=display">P_r = exp\left( - {1 \over W} \sum_{i=1}^N ln (\hat{P} (r_1^i, \cdots, r_m^i| o_1^i, \cdots, o_n^i)) \right) \tag{7} \label{7}</script>
<p>where</p>
<ul>
<li>\(W\) is the total number of words in the \(N\) samples.</li>
<li>\(\hat{P}\) is the learnt distribution</li>
<li>
<p>\(r^i\) and \(o^i\) are the \(i-th\) repsonse and original message.</p>
</li>
<li>The model is also evaluated on the response ranking. Simply put, the rank of the actual response with respect to other responses in R is evaluated. Using this, the <strong>mean reciprocal rank</strong> (MRR) is calculated using:</li>
</ul>
<script type="math/tex; mode=display">MRR = {1 \over N} \sum_{i=1}^N {1 \over rank_i} \tag{8} \label{8}</script>
<ul>
<li>
<p>Additionally, Precision@K (for a given value of K, the number of cases for which target response \(r\) is within the topK responses that were ranked by the model) is also computed.</p>
</li>
<li>
<p>On a daily basis, the smart reply system generates 12.9k unique suggestions that belong to 376 unique semantic clusters, out of which the users utilized, 31.9% of the suggestions and 83.2% of the unique clusters.</p>
</li>
<li>Among the selected responses, 45% are the 1st responses, 35% 2nd responses, and 20% 3rd responses.</li>
<li>If using only the straight-forward approach instead of enforcing diversity, the click through rates drop by roughly 7.5%.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://ai.google/research/pubs/pub45189" target="_blank">Smart Reply: Automated Response Suggestion for Email</a></small><br />
<small><a href="https://www.blog.google/products/gmail/save-time-with-smart-reply-in-gmail/" target="_blank">Save time with Smart Reply in Gmail</a></small></p>
Sun, 22 Jul 2018 00:00:00 +0000
https://notes.pairml.com/2018/07/22/google-smart-reply/
https://notes.pairml.com/2018/07/22/google-smart-reply/NLPmachine-learningpapersLarge Scale Learning<h3 id="introduction">Introduction</h3>
<p>The popularity of machine learning techniques have increased in the recent past. One of the reasons leading to this trend is the exponential growth in data available to learn from. Large datasets coupled with a high variance model has the potential to perform well. But as the size of datasets increase, it poses various problems in terms of space and time complexities of the algorithms.</p>
<blockquote>
<p>It’s not who has the best algorithm that wins. It’s who has the most data.</p>
</blockquote>
<p>For example, consider the update rule for parameter optimization using gradient descent from (3) and (4) in the <a href="/2017/08/23/multivariate-linear-regression/" target="\_blank">multivariate linear regression post</a>,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha {1 \over m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{1} \label{1}</script>
<blockquote>
<p><a href="https://www.kaggle.com/shamssam/gradient-descent-for-regression" target="\_blank">Kaggle Kernel Implementation</a></p>
</blockquote>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">batch_update_vectorized</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">forward</span><span class="p">()</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">)</span>
<span class="p">)</span> <span class="o">/</span> <span class="n">m</span>
<span class="k">def</span> <span class="nf">batch_update_iterative</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">update_theta</span><span class="p">)</span> <span class="o">==</span> <span class="n">torch</span><span class="o">.</span><span class="n">DoubleTensor</span><span class="p">:</span>
<span class="n">update_theta</span> <span class="o">+=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">return</span> <span class="n">update_theta</span><span class="o">/</span><span class="n">m</span>
<span class="k">def</span> <span class="nf">batch_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="n">prev_cost</span>
<span class="n">num_epochs</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_update_vectorized</span><span class="p">()</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prev_cost</span> <span class="o">-</span> <span class="n">cost</span><span class="p">)</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="n">cost</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>From \eqref{1} above, it can be seen that for each step of gradient descent, summation has to be performed over entire dataset of \(m\) examples. While for small datasets it might seem inconsequential, but as the size of datasets increases this would have very high impact on the training time.</p>
<p>In such cases, it would also be helpful to plot <a href="/2018/04/02/evaluation-of-learning-algorithm/#learning-curves">learning curves</a>, to check if actually training the model with such high number data samples is really helpful, because if the model has high bias then similar result could be acheived by using a smaller dataset. It would be more helpful to incrase variance of the model in such cases.</p>
<p>On the other hand, if the learning curves show that using the larger dataset is indeed helpful, it would be more productive to use more computationally efficient algorithms to train the model such as the ones mentioned in the following sections.</p>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<p>The gradient descent rule presented in \eqref{1}, also known as <strong>batch gradient descent</strong>, has the disadvantage that for each update the summation of update term has to be performed over all the training data.</p>
<p>Stochastic gradient descent is an approximation of the batch gradient descent. Each epoch in this algorithm is begun with a random shuffle of the data followed by the following update rule,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{2} \label{2}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">stochastic_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">num_epochs</span><span class="o">=</span><span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="n">prev_cost</span><span class="o">-</span><span class="n">cost</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span><span class="o">=</span><span class="bp">True</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>i.e. for each training data in the sample dataset, as soon as the cost correponding to that instance is calculated it is used to make an approximate update to the parameters instead of waiting for the summation to finish. While this is not as accurate as the batch gradient descent in reaching the global minimum, it always converges within its close proximity.</p>
<blockquote>
<p>In practice, stochastic gradient descent speeds up the process of convergence over the traditional batch gradient descent.</p>
</blockquote>
<p>While learning rate is kept constant in most implementations of stochastic gradient descent, it is observed in practice that it helps to taper off the value of learning rate as the iteration proceeds. It can be done as follows,</p>
<script type="math/tex; mode=display">\alpha = \frac {constant_1} {iteration\_number + constant_2} \tag{3} \label{3}</script>
<h3 id="mini-batch-gradient-descent">Mini-Batch Gradient Descent</h3>
<p>While batch gradient descent sums over all the data for a single update iteration of the parameters, the stochastic gradient descent does it by considering individual training examples as and when they are encountered. The <strong>mini-batch gradient descent</strong> takes the mid-way and uses the summation from only <strong>b training examples (i.e. batch size)</strong> for every update iteration. Mathematically it can be presented as follows,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha {1 \over b} \sum_{i=1}^{i+b} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{4} \label{4}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">mini_batch_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">num_epochs</span><span class="o">=</span><span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">/</span> <span class="n">batch_size</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">:</span> <span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">])</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">:</span> <span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="n">prev_cost</span><span class="o">-</span><span class="n">cost</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span><span class="o">=</span><span class="bp">True</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<ul>
<li>
<p>Compared to stochastic gradient descent, the mini-batch gradient descent will be faster only if vectorized implementation is used for the updates.</p>
</li>
<li>
<p>Compared to batch gradient descent, the mini-batch gradient descent is faster due to the obvious reason of lesser number of summations that are to be performed for a single update iteration. Also, if both the implementations are vectorized, mini-batch gradient descent will have lower memory usage. The speed of operations depends on the trade-off between the matrix operation complexities and memory usage.</p>
</li>
<li>
<p>Generally it is observed that mini-batch gradient descent converges faster than both stochastic and batch gradient descent.</p>
</li>
</ul>
<h3 id="online-learning">Online Learning</h3>
<p>Online learning is a form of learning when the system has a continuous stream of training data. It implements the stochastic gradient descent forever using the input stream of data and discarding it once the parameter updates have been done using it.</p>
<p>It is observed that such an online learning setting is <strong>capable of learning the changing trends</strong> of data streams.</p>
<p>Typical domains where online learning can be successfully implemented include, search engines (predict click through rate i.e. CTR), recommendation websites etc.</p>
<p>Many of the listed problems can be modeled as a standard learning problem with fixed dataset, but often such data streams are available in such abundance that there is little utility of storing the data in place of implementing an online training system.</p>
<h3 id="map-reduce-and-parallelism">Map Reduce and Parallelism</h3>
<p>Map-Reduce is a technique used in large scale learning when a single system is not enough to train the models required. Under this training paradigm, all the <strong>summation operations are parallelized over a set of slave systems by spliting the training data</strong> (batch or entire set) across the systems which compute on smaller datasets and feed the results to the <strong>master system that aggregates the results</strong> from all the slaves and combines them together. This parallelized implementation boosts the speed of algorithm.</p>
<p>If the network latencies are not high, then one can expect a boost in speed by upto \(n\) times by using a pool of \(n\) systems. So, in practice when the systems are on a network speed boost is slightly less than \(n\) times.</p>
<blockquote>
<p>Algorithms that can be expressed as a summation over the training sets can be parallelized using map-reduce.</p>
</blockquote>
<p>Besides a pool of computers, parallelization also works on multi-core machines with the added benifit of near-zero network latencies and hence faster.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/CipHf/learning-with-large-datasets" target="_blank">Machine Learning: Coursera - Learning with Large Dataset</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/DoRHJ/stochastic-gradient-descent" target="_blank">Machine Learning: Coursera - Stochastic Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent" target="_blank">Machine Learning: Coursera - Mini-Batch Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/fKi0M/stochastic-gradient-descent-convergence" target="_blank">Machine Learning: Coursera - Convergence of Stochastic Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/ABO2q/online-learning" target="_blank">Machine Learning: Coursera - Online Learning</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/10sqI/map-reduce-and-data-parallelism" target="_blank">Machine Learning: Coursera - Map Reduce and Data Parallelism</a></small></p>
Fri, 22 Jun 2018 00:00:00 +0000
https://notes.pairml.com/2018/06/22/large-scale-learning/
https://notes.pairml.com/2018/06/22/large-scale-learning/machine-learningandrew-ngbasics-of-machine-learning