Research Triangle Analysts Analytics> Forward 2019

According to their website, Research Triangle Analysts (RTA) is a "math user" group for data enthusiasts in the Research Triangle. Their annual bar camp-style conference (Analytics> Forward 2019) was great. I didn't find out about it soon enough to prepare a talk, but I learned some useful things from attending presentations at it. There was minimal organization: no set schedule, no abstract reviews, no published proceedings, no website, etc. However, for only $12 and a short drive, it's really hard to argue with the efficiency of this type of conference. Also, since the attendees directly voted on the 1-minute pitches, it's also a more democratic form of reviewing than abstract commitees (albeit maybe less effective at identifying the best or worst ones, since the pitches didn't necessarily reflect the contents of the full 50-minute presentation).

Federated Machine Learning

Ian Cook, employed by Cloudera, presented an neat idea for federated ML models, as applicable to IoT devices, industrial equipment maintenance, and medical devices. By sending only the models from a centralized server to many devices, less bandwidth is required, and less personal data needs to be sent back (only model parameters, not all of the data). He presented a prototype developed by his coworkers, TurboFan Tycoon, which compared different types of maintenance: corrective (post-failure), preventative (fixed schedule), predictive (local modeling), federated (global shared modeling).

I previously worked on a torque tool project that evaluated fixed schedule maintenance vs. predictive maintenance, so I thought this was a pretty cool idea. However, I strongly discounted the privacy claims because the potential for re-identifying patients in (supposedly anonymized) medical trials has been shown repeatedly. There's just too much information left in the model, especially high entropy models like CNNs, for me to believe that there is any real privacy gained by sending coefficients instead of all the data. Another person suggested looking into 2010 Dobra and Feinberg's Generalized Shuffle Algorithm for examples of how anonymity is very hard to establish in ML models.

Introduction to Autoencoders & IoT Analytics

Scott N. Gerard presented an elder care application of Autoencoders - compressing the data from 30 sensors in an assited care living facility down into 10 bits. It was a decent review of autoencoders, which I've used before.

Statistical inference with Buzz and Doris

A famous study from the 1960s (PDF) explored whether two dolphins (Doris and Buzz) could communicate abstract ideas. It was really just an intro to statistics lecture recycled from a university class, but it was still nice to review the basics and hear about how p-hacking could be addressed (pre-registration of clinical trials, commitment to publish regardless of outcome, reporting more statistics than just a P value).

Keynote: Kaggle in the Real World

The winner of a Zillow mortgage contest explained some of his best ideas for how to win at Kaggle: use Docker containers to make your work re-useable and reproducible, throw away outliers at any stage including pre-processing, use cross-validation aggressively, discard unnecessary features (as determined by feeding bad data into a parameter to see if there is any ill effect on the model; if not, discard it), use mean encoding for categorical data (instead of one-hot or label encoding) to avoid curse of dimensionality problems, blend several models to create an ensemble of specialists (see "Stacking Made Easy").

Flood data from Hurrican Florence

James McManus showed some overlays of flood data on real estate parcels.

Note to myself: look up GISCafe newsletter.

NCDOT Drone response for emergencies

Note to myself: look up NC Drone Summit & Flight Expo (annual conference).