Machine Learning Up to Date #24

Image based on Clip Art by Vector Toons, [CC BY-SA 4.0], via Wikimedia Commons [Source]

Here's ML UTD #24 from the LifeWithData blog! We help you separate the signal from the noise in today's hectic front lines of software engineering and machine learning.

LifeWithData strives to deliver curated machine learning & software engineering updates that point the reader to key developments without superfluous details. This enables frequent, concise updates across the industry without information overload.



Navigating ML Deployment

Image based on Clip Art by Vector Toons, [CC BY-SA 4.0], via Wikimedia Commons [Source]
We often think of ‘deployment’ as packaging software into an artifact and moving it to an environment to run on. For Machine Learning it can be better to think of deployment as “the action of bringing resources into effective action” (one of Oxford’s definitions of ‘deployment’). There are a range of patterns for using ML models to make business decisions. Deploying machine learning models can mean different things, depending on the context. Understanding the key prediction-making patterns can help to decide on which tools apply to your use case.
... keep reading
The Rundown

Upskilling Analysts

Erika's tweet [Source]
A few weeks ago, I got curious about how organizations can intentionally retrain analysts for data science roles. This post is the result of a few conversations with data leaders who have been there, done that, and my own research on the topic.
... keep reading
The Rundown

How should our company structure our data team?

[Source]
Three years ago, the data team at Snaptravel started with a software-engineer convert interested in supporting the company to make more data-driven decisions. At the time, the company’s mantra was to move fast, so refined decision-making with data took the backseat to shipping new features. The structure of our data team — one software engineer who periodically had to build front-end applications when needed — was perfectly aligned with the size and needs of a young seed-funded startup. [...] As we’ve grown, we’ve optimized our team’s organizational structure to reduce communication overhead while maximizing context in two areas: between various skill sets on the data team, and between our team members and the rest of the organization. We’ve adopted numerous frameworks along the way, which will be described below. We’ve also made lots of mistakes while we tried to scale: pervasive meetings, too many decision makers in those meetings, and different people coding the same metric in different ways.
... keep reading
The Rundown

Algorithms for Causal Reasoning in Probability Trees

The counterfactual probability tree generated by imposing Y ← 1, given the factual premise Z = 1 [Source]
Probability trees are one of the simplest models of causal generative processes. They possess clean semantics and -- unlike causal Bayesian networks -- they can represent context-specific causal dependencies, which are necessary for e.g. causal induction. Yet, they have received little attention from the AI and ML community. Here we present concrete algorithms for causal reasoning in discrete probability trees that cover the entire causal hierarchy (association, intervention, and counterfactuals), and operate on arbitrary propositional and causal events. Our work expands the domain of causal reasoning to a very general class of discrete stochastic processes.
... keep reading
The Rundown

Rethinking Attention with Performers

Standard attention compared to a low-rank decomposition of the attention matrix [Source]
Transformer models have achieved state-of-the-art results across a diverse range of domains, including natural language, conversation, images, and even music. The core block of every Transformer architecture is the attention module, which computes similarity scores for all pairs of positions in an input sequence. This however, scales poorly with the length of the input sequence, requiring quadratic computation time to produce all similarity scores, as well as quadratic memory size to construct a matrix to store these scores. [...] To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19. The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by our novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by random feature map decompositions (in particular, regular softmax-attention). We obtain strong accuracy guarantees for this method while preserving linear space and time complexity, which can also be applied to standalone softmax operations.
... keep reading

A Bayesian Perspective on Q-Learning

Interactively visualizing Q's distribution across learning parameterizations [Source]
The purpose of this article is to clearly explain Q-Learning from the perspective of a Bayesian. As such, we use a small grid world and a simple extension of tabular Q-Learning to illustrate the fundamentals. Specifically, we show how to extend the deterministic Q-Learning algorithm to model the variance of Q-values with Bayes' rule. We focus on a sub-class of problems where it is reasonable to assume that Q-values are normally distributed and derive insights when this assumption holds true. Lastly, we demonstrate that applying Bayes' rule to update Q-values comes with a challenge: it is vulnerable to early exploitation of suboptimal policies. This article is largely based on the seminal work from Dearden et al. Specifically, we expand on the assumption that Q-values are normally distributed and evaluate various Bayesian exploration policies. One key distinction is that we model _μ_ and _σ_2, while the authors of the original Bayesian Q-Learning paper model a distribution over these parameters. This allows them to quantify uncertainty in their parameters as well as the expected return - we only focus on the latter.
... keep reading