Four tips from scientific studies point the way to maximizing your gains if you are interested in increasing your arm size as efficiently as possible!

Image for post
Image for post
Photo by Karsten Winegeart on Unsplash

1 — Train your arms first.
While conventional workouts typically follow the larger muscles first, then smaller muscles, multiple studies have shown that muscles worked first get worked harder.
In particular, a study in Brazil compared having men train with arms first (biceps, triceps) vs arms last (bench press etc. first).

The results were quite clear. …


Short summary of how the US approved mRNA Covid vaccines work (Pfizer, Moderna)

With my work in Covid-19 related AI, I frequently get asked about the various covid vaccines, both approved and pending. Thus, I’m writing this article to provide a useful summary of the current mRNA vaccines, and will likely add another one on what’s pending — how they work, how and why they have been shown to be safe, and what’s potentially coming for vaccines in 2021.

US approved vaccines — Pfizer and Moderna — mRNA platform:

Image for post
Image for post
The basic path from mRNA to immune response (original image JMarchn, wikimedia) with my crude path additions. mRNA introduced to the bloodstream via injection (purple circle at top), picked up by dendritic cells that read the message and produce Covid-19 spikes on their surface. Immune cells then learn about the spike (light blue boxes) generating a prepared memory for future (actual) Covid-19 challenges.

The only two approved vaccines at the time of this writing…


AdaMod is a new deep learning optimizer that builds on Adam, but provides an automatic warmup heuristic and long term learning rate buffering. From initial testing, AdaMod is a top 5 optimizer and readily beats or exceeds vanilla Adam, while being much less sensitive to the learning rate hyperparameter, smoother training curve, and requires no warmup mode.

Image for post
Image for post
AdaMod converges to same point even using up to 2 order magnitude different learning rates vs SGDM and Adam end up with different results.

AdaMod is proposed in the paper “An Adaptive and Momental Bound Method for Stochastic Learning” by Ding, Ren, Lou and Sun.

How AdaMod works: AdaMod maintains an exponential, long term average of the adaptive learning rates themselves, and uses that to clip any…


Image for post
Image for post
Example of short term gradient changes on the way to the global optimum (center). Image from paper.

DiffGrad, a new optimizer introduced in the paper “diffGrad: An optimizer for CNN’s” by Dubey, et al, builds on the proven Adam optimizer by developing an adaptive ‘friction clamp’ and monitoring the local change in gradients in order to automatically lock in optimal parameter values that Adam can skip over.


Image for post
Image for post
CPC 2.0 in action — with only 1% of labeled data, achieves 74% accuracy (from the paper)

Current Deep Learning for vision, audio, etc. requires vast amounts of human labeled data, with many examples of each category, to properly train a classifier to acceptable accuracy.

By contrast, humans only need to see a few examples of a class to begin properly and accurately recognizing and classifying future examples of that class.

The difference is that humans are able to quickly generate accurate mental ‘representations’ of things, and then use those representations to flexibly account for future variations. After seeing a few images of bluejays for example, we can make a mental model or representation of a bluejay…


Summary: By replacing single convolutional kernels with a mixed grouping of 3x3–9x9 kernels, and a neural search ‘MixNet’ architecture, a new state of the art 78.9% accuracy on ImageNet top 1% is achieved under standard mobile metrics. MixNet-L outperforms ResNet-153 with 8x fewer params, and MixNet-M matches it exactly but with 12x fewer params and 31x fewer FLOPS.

Tan and Le of Google Brain recently show-cased a new depthwise convolutional kernel arrangement (MixConv) and a new NN architecture optimized for efficiency and accuracy using MixConvs in their paper: Mixed depthwise convolutional kernels.

This article will summarize the architecture of MixConv’s…


Hinton, Muller and Cornblith from Google Brain released a new paper titled “When does label smoothing help?” and dive deep into the internals of how label smoothing affects the final activation layer for deep neural networks. They built a new visualization method to clarify the internal effects of label smoothing, and provide new insight into how it works internally. While label smoothing is often used, this paper explains the why and how label smoothing affects NN’s and valuable insight as to when, and when not, to use label smoothing.

Image for post
Image for post
From the paper — label smoothing providing improvements in a wide range of deep learning models.

This article is a summary of the paper’s insights to help…


An executive summary of some timeless best practices for designing your GraphQL API’s, excerpted from GraphQL co-founder Lee Byron’s presentation at GraphQL Summit — Lessons from 4 years of GraphQL.

(The full presentation is excellent and available on YouTube here: https://www.youtube.com/watch?v=zVNrqo9XGOs).

In this presentation, Lee outlined some best practices based on 4 years of experience for key questions to leverage when designing your GraphQL interface or API’s.

Even with additional time passing from that presentation, these best practices are still ‘best practices’ imo and valuable insights. Thus I felt it might be helpful to summarize these and make them available…


TL;DR = your previous NLP models are parameter inefficient and kind of obsolete. Have a great day.

[*Updated November 6 with Albert 2.0 and official source code release]

Google Research and Toyota Technological Institute jointly released a new paper that introduces the world to what is arguably BERT’s successor, a much smaller/smarter Lite Bert called ALBERT. (“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”).

Image for post
Image for post
ALBERT setting new SOTA for SQuAD and RACE testing, and beating BERT by +14.5% on RACE….but wait until you compare parameter sizes below. (1M and 1.5M refer to training steps used).

ALBERT’s results are of themselves impressive in terms of final results (setting new state of the art for GLUE, RACE, SQuAD) but …the real surprise is the dramatic reduction in model/parameter size.

A combination…


As Google Brain’s EfficientNet paper showed, there are rapidly diminishing returns on investment for scaling up various aspects of CNN architectures (width, depth, resolution).

A new paper by Gao, Cheng, Zhao et al (Res2Net: a new multi-scale backbone architecture), however, shows that multi-scale or scaling within a given block, rather than the usual layer by layer, is an unexplored domain with additional payoffs especially for object recognition and segmentation.

Most architectures leverage scale on a layer by layer basis. …

Less Wright

PyTorch, Deep Learning, Object detection, Stock Index investing and long term compounding.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store