Presentation idea – Using Columnstore Indexes to Store and Analyze Billions of Torques

This is an abstract I wrote for a local area users group, only to find out that they had stopped meeting. Toledo doesn’t seem to be able to keep many computer-based user interest groups alive. I’m posting it here for my own records.
Torque tools (impact wrenches, electronic motors, rotary torque sensors) produce lots of data as a vehicle is being built. Torque data is used to control the movement of the production lines and to do quality control analysis. During production, traditional row-based storage makes the most sense, using the VIN and torque tool ID as the primary key in a table optimized for insertion speed and VIN-based lookups. However, post-production it makes more sense to switch to column-based storage for quality control analysis because aggregation functions (mean, std. dev.) need to be computed over large ranges of data (regardless of VIN), and many torque values are similar. Columnstore indexes are a good solution to the latter storage and analysis problem, and they have gotten much better in SQL Server 2016. I’ll describe how they’re useful for this problem for analyzing how well the torque tools are working using simple statistical techniques.

Thoughts on Project Euler in Python

I’ve been doing Project Euler problems to learn more about Python 3.5+ (as opposed to 2.7). The Project Euler website says, “Real learning is an active process and seeing how it is done is a long way from experiencing that epiphany of discovery. Please do not deny others what you have so richly valued yourself.” So I’m not supposed to post full solutions or even hints. However, I can make general comments that are not specific to any Project Euler problems.

I find it very helpful to have the following tools ready to go before doing any Project Euler problems: a list of primes up to 10**9 or so, a list of primitive Pythagorean triplets up to 10**6, and the latest version of a big integer library. Much of Project Euler is based on prime numbers, so I used primegen to create a text file with lots of primes in it. Reimplementing the Sieve of Eratosthenes gets really boring after a while, and the Sieve of Atkin is much better anyway.

Pythagorean triplets show up occasionally, so I have a static text file with primitive triplets up to 10**6. I used an existing pythagorean triplets generator, but the matrix formula isn’t too terrible to implement.

Python 3 has much improved support arbitrary length integers over Python 2. The int datatype is gone, replaced by long, then unified back to int again. The only real problem is remembering the change in the integer division operator, but I’ve used “from future import __division__” practically everywhere.

Generators and comprehensions have big improvements in Python 3+. I really like them for using a functional programming style, and they can improve program speed and reduce memory usage a bit.

Finally, regardless of language, I always try to review some common strategies from dynamic programming and graph search algorithms. It’s surprising how often these ideas are part of the best solutions to Project Euler problems.

Linear algebra review

It’s been a while since I tried to solve a system of equations without
using a numerical library, so I figured it was time to do a linear
algebra review.

1 & 2\\
3 & 4
c_{1} & c_{2}\end{array}\right]

This should be the easiest matrix to work with: small size, nice integer values, non-colinear columns and rows, non-zero determinant, full rank, etc. Let’s go through the basic definitions, just because it’s been a while. $A$ is 2×2 (MxN), which is small and square. Let’s check the column space of $A$ for colinearity by reducing to row echelon form by adding $-3r_{1}^{T}$ to row 2:

1 & 2\\
0 & -2

Then adding $r_{2}^{T}$ to row 1:

1 & 0\\
0 & -2

So the good news is that the column space of $A$ has linearly independent columns, $\left[\begin{array}{c}
\end{array}\right]$ and $\left[\begin{array}{c}
\end{array}\right]$, which form a basis for $\mathbb{R^{\mathrm{2}}}$, and rank($A$) is 2. So the null space of $A$ is empty. That’s the best possible outcome for a matrix because it means that an inverse exists.

But before exploring that, let’s think about the row space of $A$. Using the same row reductions as above, we can conclude that $A$ has linearly independent rows, $\left[\begin{array}{cc} 1 & 2\end{array}\right]$ and $\left[\begin{array}{cc} 3 & 4\end{array}\right]$, which form a basis for $\mathbb{R^{\mathrm{2}}}$, and rank($A^{T}$) is 2. So the left null space of $A$ is also empty. The row space of $A$ is isomorphic to the column space of $A^{T}$ by definition, and $A$ happens to have full rank, so an inverse exists. Let’s use Gauss-Jordan elimination to find it:

1 & 2 & | & 1 & 0\\
3 & 4 & | & 0 & 1

Starting with an augmented matrix $\left[\begin{array}{ccc}
A & | & I\end{array}\right]$, we can use row operations to find $\left[\begin{array}{ccc}
I & | & A^{-1}\end{array}\right]$. Add $-3r_{1}^{T}$ to row 2:

1 & 2 & | & 1 & 0\\
0 & -2 & | & -3 & 1

Then add $r_{2}^{T}$ to row 1:

1 & 0 & | & -2 & 1\\
0 & 2 & | & -3 & 1

Finally rescale row 2:

1 & 0 & | & -2 & 1\\
0 & 1 & | & -\frac{3}{2} & \frac{1}{2}


-2 & 1\\
-\frac{3}{2} & \frac{1}{2}

Now we need to check that $A^{-1}A=I$:
2 & 1\\
-\frac{3}{2} & \frac{1}{2}
1 & 2\\
3 & 4
\end{array}\right] & = & \left[\begin{array}{cc}
1 & 0\\
0 & 1

Of course, there is an explicit formula for the inverse of a 2×2 matrix:

a & b\\
c & d
\end{array}\right]^{-1}=\frac{1}{\det A}\left[\begin{array}{cc}
d & -b\\
-c & a

For our $A$, $\det A=ad-bc=-2$. Unfortunately, analytical inverses don’t exist for larger matrices, or they are so long and complex as to be of limited utility. But there is at least one important idea to take away from the inverse: it can only exist if the determinant is non-zero. This becomes a very important fact for the eigenvalue problem.

One last thing I wanted to write about for now: the L2 norm (or Euclidean norm) of a vector $x=\left[\begin{array}{cccc}
x_{1} & x_{2} & \cdots & x_{n}\end{array}\right]^{T}$ is defined as:

\left\Vert x\right\Vert _{2}^{2}=\sum_{i=1}^{n}x_{i}^{2}

Suppose that we’re fitting data $\left(a_{ij},y_{i}\right)$ to a known linear model$A$ and we want to determine the unknown coefficients $x$ that best fit the data using ordinary least squares:

\end{array}\right] & = & \left[\begin{array}{cccc}
a_{1,1} & a_{1,2} & \cdots & a_{1,n}\\
a_{2,1} & a_{2,2} & \cdots & a_{2,n}\\
\vdots & \vdots & \ddots & \vdots\\
a_{m,1} & a_{m,2} & \cdots & a_{m,n}
y & = & Ax

And we want to choose the $x$ which minimizes the L2 norm of the residuals because we assume them to be Gaussian:

J\left(x\right)=\left\Vert Ax-y\right\Vert _{2}^{2}

Then $J\left(x\right)$ can be expanded:

\left\Vert Ax-y\right\Vert _{2}^{2} & = & \left(Ax-y\right)^{T}\left(Ax-y\right)\\
& = & \left(x^{T}A^{T}-y^{T}\right)\left(Ax-y\right)\\
& = & x^{T}A^{T}Ax-x^{T}A^{T}y-y^{T}Ax+y^{T}y

However, $x^{T}A^{T}y$ is a scalar that can be computed as $y^{T}Ax$ by reversing the order of the multiplications, so the last expression can be further simplied:

x^{T}A^{T}Ax-x^{T}A^{T}y-y^{T}Ax+y^{T}y & = & x^{T}A^{T}Ax-2y^{T}Ax+y^{T}y

This might look ugly, but we can now minimize $J\left(x\right)$ by
taking the derivative with respect to $x$:

\frac{dJ\left(x\right)}{dx} & = & 2x^{T}A^{T}A-2y^{T}A\\
& = & 2A^{T}Ax-2A^{T}y

Where we used the fact that $x^{T}A^{T}A=A^{T}Ax$ and $y^{T}A=A^{T}y$ because it’s just changing the order of the multiplications again.

Now we can derive the celebrated pseudo-inverse of $A$ by setting the derivative to zero:

2A^{T}Ax-2A^{T}y & = & 0\\
A^{T}Ax & = & A^{T}y\\
x & = & \left(A^{T}A\right)^{-1}A^{T}y

However, the inverse of $A^{T}A$ may not exist, or it may be very hard to compute (due to numerical instability). An alternative solution is to use gradient descent. Since we already have $\frac{dJ\left(x\right)}{dx}$ in a nice form:

\frac{dJ\left(x\right)}{dx} & = & 2A^{T}\left(Ax-y\right)

We can start at any point $x=x_{0}$ and take a step along the direction given by the derivative, $x_{1}=x_{0}-\gamma\frac{dJ\left(x\right)}{dx}$. The problem is how big of a step to take. Despite the existence of a global minimum and an analytical form for the derivative, the steps could either be too small (taking forever to converge) or too large (diverging even when starting near the global minimum). This is where I wrote a paper about bracketing the minimum using a priori constraints (i.e., Golden section search and Brent’s method), but variations on line search minimization are also possible.

I’ve read that the conjugate gradient method is the more popular solution to this problem now. Gradient descent searches strictly along the derivative, whereas the conjugate gradient method chooses a different search direction every time. The Grahm-Schmidt procedure is used to orthogonalize the gradient vectors, and then the conjugate gradient method moves along that new basis. This can be much faster than gradient descent, but it can become slow if the condition number of $A$ is too large. But it’s still a good choice because it doesn’t require the Hessian matrix to be calculated or inverted (as per Newton’s method).

If you’ve got lots of memory and $M,N$ are small-ish, the Levenberg-Marquardt algorithm can converge even faster because it approximates the Hessian with the Jacobian matrix and chooses directions either along the derivative or the Hessian, whichever is better. Unfortunately, it doesn’t work with regularization, and it has a few more internal parameters, and it usually runs out of memory when computing $\left(J^{T}J+\lambda I\right)^{-1}$. So I usually end up using the conjugate gradient method anyway because it can be regularized and doesn’t require crazy amounts of memory.

At CodeMash 2017 I heard a presentation about artificial neural networks where the presenter complained bitterly about how his L2 minimization (“backtracking”) in the neural network was converging very slowly. I thought to suggest an improved algorithm (conjugate gradient), but his talk was focused on a high level introduction with no math, so it didn’t seem appropriate at the time. That inspired me to write this post. Then I saw that someone else already did conjugate gradients with artificial neural networks in 1992. 😛

CodeMash 2017 Fri

Horizontally Scaling Node.js and WebSockets by James Simpson. Prepare for success with scalability. Automatic failover increases uptime. Load balancers: HAProxy, NGINX, ELB, http-proxy. HAProxy is recommended for not having vendor lock-in and for ease of deployment. Application: nodejs, running on multiple servers. Messaging: Redis, RabbitMQ, ZeroMQ. Redis can do key/value store in addition to pub/sub queues. Approaches: single core vs multi-core servers. (smaller, single core servers can scale up and down better than larger, multi-core servers. Single server failures are easier to manage when the number of servers automatically scales.)

Abusing C# More by Jon Skeet. Silly and amusing edge cases of C# 7. I’m pretty sure I will never use the mongolian vowel separator, but at least now I know that it exists.

Tune SQL Server Like a Guru – The Big Three! by Kevin Boles. The big three are File IO Stall differential analysis, Wait Stats differential analysis, and sp_whoisactive. Not a surprise to me, but it’s still good to review them so that I don’t feel like I’m missing out on some new method of figuring out where SQL Server is having problems. Mr. Boles is an excellent speaker, and it was helpful to hear his explanations of some of the wait types.

CXPACKET – excessive means inefficient parallel processing. See common TSQL Mistakes presentation (mismatched datatypes and function calls in WHERE clause). Adjust MAXDOP and cost threshold for paralelism. The default SQL Server configuration is bad. VMs can be especially bad: thin provisioned disks, underpowered CPU.

ASYNC_NETWORKIO – server-side cursors are really bad. MS Entity Framework often does this.

PAGELATCHIO – slow r/w to disk.

PAGELATCH – not related to disk. Just allocation/deallocation of pages. Only 1 thread can hold the 8 KB page at a time. See especially TempDB.

sp_WhoIsActive – best possible analysis tool. See also Ola Halagren’s maintenance tool.

SSMS tools pack – costs about $50, but automatically puts begin tran/rollback in every new query window in SSMS. Could save your job.

An Applied Introduction to R by Gary Short. Difference between trading and gambling: you can play both sides in trading, but not in gambling. Also, you can’t bet against a horse in gambling. I found some of the high frequency trading terms in this presentation to be confusing: Dutching (betting on multiple stocks simultaneously), Scalping (not ticket scalping; buy at Key prices, aka crossover points, are points at which the first derivative changes??), “Back” vs “Lay”. I didn’t get much R out of this talk.

Machine Learning for Predictive Hiring aka Recruiting Isn’t Just Rainbows and Kittens by Carmen Fontana. Speaker used to be a dev, now a recruiter at a business consulting firm. 60% FT, 40% contractors. Many short term contracts. Machine Learning with Textio – optimizes job ads for length, sentence length, wording choices, gender tone, verb usage. This was a pretty interesting talk because I didn’t know that some HR departments are getting better about not using keyword filtering for every job ad.

Machines Learning Human Biases: How Does It Happen? Can We Unteach Them? by Devney Hamilton. Google Word2Vec (2013). MIT team used Google News data to produce vectors representing the similarities between words (i.e., a word embedding). The ANN produced male/female-aligned patterns that reflect the existing stereotypes. De-bias the vector space?

FoxCon Toledo 2017 Presentation – Using Machine Learning to Automatically Predict and Identify Defects in Automotive Assembly Processes

Operators use torque tools to build a car with nuts and bolts, but what happens when the tools themselves have problems? In this presentation Dave will examine a case study of how preventative maintenance and quality inspection can be improved by using machine learning to model the failures observed on the assembly line.  Learn how Bayesian prediction (frequently used in email spam filters) can be applied to predict and identify these failures before they occur.

Presentation download: [ machinelearningdefects PDF ]


FoxCon Toledo 2017 Conference Presentation Schedule

Saturday January 28

Time             Speaker                  Topic

08:00-09:00 Bob Ruple           Opening Comments

1 09:00-10:15 Bob Pierce          Controlling Factory Automation with VFP – Behind the Scenes

10:15-10:30                 Break

2 10:30-11:45 Doug Hennig       Windows Powershell – Batch Files on Steroids

11:45-12:45 pm             Lunch at Park Inn Hotel

3 12:45-02:00 Mike Levy           Building Hypermedia APIs Using .NET Core

02:00-02:15                 Break

4 02:15-03:30 David Johnson    Using Machine Learning to Automatically Predict and Identify Defects in Automotive Assembly Processes

03:30-03:45                Break

5 03:45-05:00 Dave Bernard     The Death of Privacy

Sunday January 29

Time            Speaker                   Topic

09:00-09:15  Bob Ruple         Opening Comments

6 09:15-10:15  Ondrej Balas     Identity Management in ASP.NET Core

10:15-10:45                Break

7 10:45-noon   Mike Feltman    Introducing F1 MD – Just What the Doctor Ordered

Noon-1:15 pm               Lunch at Park Inn Hotel

8 01:15-02:30  Mark Beno        CLEARPATH (CLEveland Area Research Platform for Advancing Translational Healthcare)

02:30-03:00                Closing Comments

3:00 pm                Conference Dismisses

FoxCon Toledo 2017 Software Developers Conference – Abstracts

*** Begin Abstracts ***

Bob Pierce               Allentown

Controlling Factory Automation With VFP – Behind The Scenes

Smooth-On, a manufacturer of liquid rubbers and plastics, recently moved their operation into a much larger facility near Allentown PA.  Last year Bob showed a case study of automating a large chemical manufacturing plant using Microsoft Visual FoxPro.  This year he pulls back the curtain and delves  into the structure and code that drives this process and how it is integrated into Smooth-On’s custom ERP/MRP (also created in VFP).

Dave Bernard           Atlanta

The Death of Privacy

Companies today can learn almost everything about everyone, especially where they are at any particular time, or where they’ve been over a period of time.  That’s because people holding smartphones carry transmitters of their geolocation data. The phone is the enabler.  I know Google, my Internet service provider, my credit card company, and my phone company all mine my personal data. One day I’ll know as much about me as Google and Facebook do! Facebook has your friends, Google has your email, Flickr has your pictures–everyone has the data that you created, but you don’t actually have control over it yourself.

David Johnson Toledo

Using Machine Learning to Automatically Predict and Identify Defects in Automotive Assembly Processes

Operators use torque tools to build a car with nuts and bolts, but what happens when the tools themselves have problems? In this presentation we will examine a case study of how preventative maintenance and quality inspection can be improved by using machine learning to model the failures observed on the assembly line.  Learn how Bayesian prediction (frequently used in email spam filters) can be applied to predict and identify these failures before they occur.

Doug Hennig Winnipeg

Windows PowerShell: Batch Files on Steroids

Windows PowerShell has been included with the operating system since Windows 7 and is available for download for Windows XP and Vista. What is PowerShell? It’s Microsoft’s task automation scripting framework. PowerShell isn’t just a replacement for batch files; it can do a lot more than batch files ever could. This session looks at PowerShell, including why you should start using it and how to create PowerShell scripts.

You will learn:

– Why PowerShell is useful

– How to create PowerShell scripts

– The basics of PowerShell programming

Mark Beno Cleveland

CLEARPATH (CLEveland Area Research Platform for Advancing Translational Healthcare)

As an academic collaboration between Case Western Reserve University, University Hospitals Cleveland Medical Center, and the Cleveland Clinic, the Institute for Computational Biology (ICB) uses the combined resources and expertise of these institutions to achieve its mission, to advance knowledge of human biology through computational methods on big and diverse datasets, and promote the translation of this knowledge into better diagnosis, prognosis, treatment, prevention, and delivery.

For this purpose, CLEARPATH (CLEveland Area Research Platform for Advancing Translational Healthcare) is being proposed to facilitate discovery and research with tools for generating a person’s single synthetic electronic medical record, cohort discovery, and population health analytics.   CLEARPATH aspires to provide a multi-institutional, broad clinical data exchange research collaborative focused on the Cleveland area (including data from, but not limited to, EHRs, disease registries, biorepositories, clinical trials databases, etc.)

Mike Feltman Toledo

Introducing F1 MD – just what the doctor ordered.

After developing several websites based on Angular and Angular Material I have cobbled together a seed project for new web sites. Mike is tentatively calling it F1 MD (the MD stands for Material Design).  F1 MD makes heavy use of Angular components as building blocks for typical, modern, responsive mobile-friendly websites. It also favors convention over configuration which makes getting a new site up and running with it a snap.  Although any back-end is easily supported, Mike makes heavy use of free storage and Internet services such as Google Drive, Google Sheets, YouTube and Google Calendars.  F1 MD includes components that make it easy to plug in items from any of the aforementioned services.

Mike Levy Cincinnati

Building Hypermedia APIs using .NET core

For the last couple of years, Mike has dedicated a part of his “saw sharpening” to the area of Hypermedia APIs. So when Microsoft released ASP.NET Core, it was only natural to jump into the platform and explore what the product team had delivered. This presentation is a report of his learnings.

Ondrej Balas Detroit

Identity Management in ASP.NET Core

Injecting custom code into authentication and authorization in ASP.NET has always been a chore. ASP.NET Identity is a library built to replace both ASP.NET Membership and Simple Membership, making it much easier to implement custom authentication and authorization without the need to rewrite core components. In this session Ondrej will go deep into the abstractions that ASP.NET Identity builds atop of, and show how to take advantage of these hook points to implement a custom membership system.

CodeMash 2017 Thu

How to Win Friends and Influence People With Data Visualization by Amanda Cinnamon.
This talk was pretty good because it covered the essentials about making good presentations. Focus on the action you want the audience to take. Don’t overwhelm with data. The default Excel charts and tables have too many useless lines. Try using tables for their layouts, not emphasizing or using their lines. Pie charts use a lot of space but make area comparisons hard (nonlinear).

Having said that, this presentation used a lot of xkcd comics, which is cheating. Also, the presenter seemed to be unaware of the wealth of existing books about making good presentations, in particular, How to Write and Publish a Scientific Paper by Robert A. Day. I emailed it to her, and she said she’d look into it. That’s a good outcome.

A math-free introduction to Neural Networks by Randall Koutnik.
UI/UX Netflix employee. Enthusiast, not expert. No math, high level talk. Mostly pictures and poorly hand-drawn diagrams. Recommendations: Use Python tensor flow learning, not JS. Try those ANN tutorials again using Tensor Flow. This wasn’t a good talk because I didn’t learn much. At least the presenter was entertaining with the “Calculords” and “Warror JS” projects he worked on, but I wasn’t convinced that he understood ANN very well.

Why Clock Speeds Won’t Be Getting Much Faster, or: EEs Are Amazing by Dan Wiebe. This was one of my favorite talks this year, despite it not being news for years. I really liked learning about the main factors currently limiting clock speeds in silicon: propagation delays, capacitative distortion, heat dissipation, speed of light, diode drop, and feature size (manufacturing contaminants). This presenter wasn’t an EE, and he didn’t seem to know much about the D latch (Flip flop) circuit he showed, but at least he got the facts right, and he gave a solid presentation.

A Synchronicity – Asynchronous Programming in Android by John Pendexter. This talk was pretty good because the presenter emphasized the two main rules of Android threading and gave specific examples of what can go wrong if you don’t. 1. Do not block the UI thread. 2. Do not access the UI outside the UI thread. (These are the same rules in every UI I’ve ever written for, i.e., Win32 and Android, but their exact details vary widely.) He covered a lot of libraries in a short time, but I was very satisfied with this presentation because I learned some new tricks for Loaders.

Digital Payments in a Mobile World by Parag Joshi. I enjoyed learning about the different payment methods (Square readers, Stripe, Android Pay, Apple Wallet). The variations in fees are somewhat surprising, but I guess everyone needs to make a profit somehow. It was valuable to learn that there is no “debug” mode for payment APIs. All calls are real! Refund fees are about 30%. Also: charities are treated very differently by some APIs.

Square readers for credit cards are used to build your own POS system. You can do NFC tap, chip reader, or mag stripe read. Square already has an app that businesses can use (no programming required). Even for 3rd party POS, Square App must be used. Fee for manually entering CC number is higher (due to fraud risk).

Python 3: It’s Time by Charles Yost. I’m a casual Python user, mostly for Scikit-learn, but I haven’t been happy to see the fragmentation of Python 2 vs 3. This talk convinced me that it’s OK to just ignore Python 2 at this point. Python 3.0.0 (2008) was a complete rewrite with sluggish performance and few packages. The point of the rewrite was to fix Unicode strings and byte encoding. 3.2.0 was the first good one. Integer division is now float by default, but floor is available. map and filter return iterators. range now behaves like xrange, and xrange is gone. pyenv – easily switch between versions of python.

Stranger Streams: How to RxAndroid by Michael Yotive. RxAndroid – subscribe to Android UI events as Observables. creates a pool of background threads, but they aren’t used unless you observe on them. Warning: Observables are immutable, so calling functions on them will create new Observables. So always use method chaining! Warning: Back Pressure – when consumers can’t consume events fast enough. Rx will throw “missing back pressure handler” exception; app dies.

CodeMash 2017 Plans

It’s time for CodeMash again. I’m looking forward to learning new things and getting outside my comfort zone, with respect to programming languages and new frameworks. Last year I tried to get familiar with Node.js and microservices. This year, I’m trying to go to more of the functional programming and machine learning talks. I’m somewhat sad that Stephen Cleary doesn’t have a presentation this year, and neither do I. I guess the reviewers want a wider variety of presenters, as the acceptance rate was under 20% and there are a lot of new presenters.

Anyway, we’ll see how it goes. Time to mash some code!


TWP Presentation: Real-time Messaging to Webapps from a Production Database

I am pleased to present a new talk, “Real-time Messaging to Webapps from a Production Database”. Click here to launch the non-interactive presentation. If you want to host by a local server to show the interactive version of the slides, follow the instructions here. Here is a link to the PDF version which should work even after all the javascript code has become too old to run in any browser.


WebSockets are a fast and efficient way to push data from servers to browsers on both desktop computers and mobile devices. Well-known examples include stock tickers, chat rooms (e.g. Slack), social media updates, and online games. Learn how WebSockets are used at the Jeep Wrangler plant for messaging and notifications for critical production activities.

This talk will demonstrate Java servlets with CometD, front-end jQuery code, and webapps currently being used to support production of 250,000+ Jeep Wranglers per year. By attending this talk, you will gain a better understanding of the constraints of an automotive assembly line and real-time messaging.

Bio – David Johnson is a software developer at the Toledo South Assembly Plant, which produces the Jeep Wrangler. David splits his time between writing webapps, and automation. He uses Java, C#, C++, and Javascript. Ask him about barcodes!