Event Log Workflow

Why event logs are used by data systems and accountants

Mature data systems use an event log mechanism, which means they track every change.

Let’s consider the canonical example of a bank with a database of its customers’ financial records. A simple implementation of this database would be a row for each customer account and columns for the account id and the account balance. Whenever a transaction occurs, the account balance is adjusted. Each transaction information, once it’s been accounted for, is promptly discarded to save space.

This implementation is problematic for several reasons:

  • What if the customer wants to verify the accuracy of the account balance? The simplest way to do this would be to provide a record of each transaction since the inception of the account.

  • What if the credit card is stolen three days ago and the bank wants to undo those recent transactions? Since none of the transaction information was stored permanently.

  • How do you handle concurrent updates to the same account? If two threads both see $100 as the existing balance and they each try to update the balance to a new number, will you end up losing one of the updates?

Event log is a simple but powerful concept where you store each event permanently and then later generate views based on those events. Much like how an accountant never “erases” a record when a debt is paid off, an accountant only adds a new entry to the general ledger.

The only two downsides are the storage cost of keeping each event and computation cost of generating each view. In real-world systems, you may end up compacting old events into a snapshot. Likewise, an accountant might calculate the current balance by looking at last year’s audited account balance and then tallying all the transactions in the current year. In theory, one could generate it by tallying all the transactions since the inception of the audited firm, but that would be excessive work with negligible benefit.

Using event log to “Get Things Done” (GTD)

We can use the same event log methodology and use it to keep track of everyday activities to maximize productivity and ensure correctness.

The actual category names aren't important but I'll tell you mine to make this description more concrete.

The first group is called in progress and these are the tasks that I’m working on right now. Typically I'm only going to have one task that I’m working on at any given point. Sometimes, I might be waiting for someone else before I finish a task (e.g. getting a code review) and in that case it's okay to have two tasks in progress. The goal is to limit the cognitive load by focusing your attention on one thing and doing one thing well: you might recognize this as the Unix philosophy.

The second group is called upcoming -  this is the work I haven't done yet. Each of these tasks should be discretely defined with a clear definition of what “done” means. And if I can't precisely define the task yet, that’s the first sub-task that I will do for that task. The goal is to minimize the number of large hairy tasks that I might feel inertia to starting and instead have many small tasks that I can accomplish in a day or so.

The last group is called completed and these are all the tasks that I’ve finished previously.  

What's important to note here is that I never delete a task. The lifecycle of a task is that it starts in the upcoming group and then it goes to the in progress group and then finally it lands in the completed group.

If one of the tasks that I'm working on isn't needed anymore, I won’t actually delete it. Instead I’ll move it to the completed group and mark it with the keyword “skip”. If I find out a month later that I actually do need to do the task, I haven’t lost any information.

Lastly, I like doing this in a Google doc, it keeps things really simple because there's no fancy UI to distract me. Furthermore, everything is tracked in the revision history.


Front-end Architecture

Outline:

  • Import "concepts" not "implementations" - encourage a pattern where people import a generic Component interface
  • Encourage compositions through decorator and mix-in patterns
  • Type safety as a first-class concern
  • Prefer fractal architecture (e.g. a big component is composed of smaller components)
  • Web standards over proprietary standards (e.g. use the normal DOM interface, make it compatible with web components)

Review of "Large-scale Automated Visual Testing"

Just watched a video from Google's 2015 testing conference called Large-scale Automated Visual Testing. Incredibly insightful talk by a cofounder of Applitools, a SaaS provider for visual diff testing.

I heard of Applitools before when I was researching various visual diff tools for my team at work, and I was initially wary that the talk would be an extended informercial of Applitools' product. My concern was quickly proven wrong. It's an incredibly informative talk filled with numerous examples and demos to demonstrate various tips he has for doing visual testing in an efficient and effective manner. I was actually blown away by the demos of Applitools and how effective they were at identifying "structural changes", that is substantive changes to a website / app, and being able to ignore minor differences between browsers or dynamic content that changes (e.g. article blurbs that change each day).

I'm looking forward to trying out the free plan and seeing if we can incorporate Applitools into our team's continuous delivery workflow.

Data normalization

Data normalization is one of those words that I've been intimidated of for a while. My initial reaction is that it's about making the data "normal", i.e. standardized, so you can't have some rows where the date is a timestamp (e.g. 14141231) and others where it's a string (e.g. "January 23, 2015"). I think that initial intuition was along the right tracks but data normalization seems to be more focused on making sure no particular piece of data is stored in more than one place. Essentially, if I can boil it down, data normalization is about having a "single source of truth" for any given piece of information (e.g. Bill Clinton's date of birth).

There are three forms of data normalization that each build on each other, with the second being more strict than the first, and so on. The examples in the wikipedia page were actually very easy to understand and I highly recommend skimming through the pages and reading through the examples:

https://en.wikipedia.org/wiki/Database_normalization

https://en.wikipedia.org/wiki/First_normal_form

https://en.wikipedia.org/wiki/Second_normal_form

https://en.wikipedia.org/wiki/Third_normal_form

I initially got interested in what "normalization" meant, when Dan Abramov mentioned his library normalizr, which normalizes nested JSON data.

As a business analyst in my last job, I think this notion of de-duplicating data is second nature and storing the same piece of information in multiple places is the bane of any data analyst managing a complex Excel workbook. For example, sometimes we had to build an Excel model really quickly and take some shortcuts. Later when our boss would ask us "what would be the impact if factors A and B were adjusted by 5%?", it wouldn't be as simple as changing a single cell in one place. The difficulty would be in remembering all the places where you would need to manually update the data. Of course, as you get better at Excel modeling, you would utilize cell references as much as possible, and try to consolidate all the various inputs ("levers" in consulting-speak) in one area, ideally the first worksheet of an Excel file.

Reading lists

I just purchased several ebooks from Prag Prog because they have a Cyber Monday discount, and I've decided to have a moratorium on purchasing new ebooks until I finish my "to read" list.

Recently finished reading:

  • Leading the transformation: Applying agile and devops principle at scale
  • The Halo effect
  • Debugging teams
  • Hooked: How to build habit-forming products
  • Learning Agile

Currently reading:

  • Innovator's Dilemma
  • Lean Product Playbook

On break from reading:

  • Crossing the Chasm 
  • Service design

Currently on my reading list:

  • NoSQL Distilled (read ~half)
  • Designing data-intensive applications (still being written)
  • How to solve it
  • Reactive Programming JS
  • Predicting the unpredictable
  • Release it!
  • Beyond legacy code
  • The Go programming language
  • How Linux works
  • Functional programming through lambda calculus

DevOps Tools

Tapiki is a production monitoring / debugging tool for the JVM and they've written up a very detailed multi-part guide on various production tools. I really like how they separate out the tools into various categories so you can see which tools are in the same "solution space" (e.g. you probably don't need to use more than one of them, unless there's a really good reason).

http://blog.takipi.com/the-definitive-guide-for-production-tools-24-ways-to-see-through-your-application/

Logging stack

This is just a quick post to summarize my thoughts on logging and monitoring. I have spent a bit of time now for my particular product at OpenTable to do logging and monitoring and I've quickly realized that it's a pretty deep topics.

I've included some resources below, mostly things that I've found helpful or seem interesting:

Source: Digital Ocean on ELK stack

Paid solutions for using StatsD:

  • There is a hosted Graphite / StatsD service that seems to have an affordable entry plan ($19/mo): https://www.hostedgraphite.com/hosted-statsd
  • Scout: https://scoutapp.com/signup

Paid solution for the "ELK" stack:

  • elastic, the company behind the three open-source projects of the ELK stack, has a SaaS offering for Elasticsearch (which can be tricky to operate especially as you scale): https://www.elastic.co/found/features

Feedback driven development workflow

Today I spent some time working on two toy projects to get a better understanding of Typescript. My primary goal was to make a slack bot that could answer common questions on git. I decided to first make a blackjack app in Typescript because I had only briefly used it before and I wanted to have a quick refresher on the major concepts of Typescript in a domain that's very familiar (e.g. blackjack / playing cards). 

For me, learning the actual typescript type syntax hasn't been bad since I dabbled briefly in Go and I've been reading a bit on Typescript and Flow Type. There was a bit of learning curve just figuring out the dev workflow for using Typescript since it means you need to transcompile before you can run your app. I used Visual Studio Code since it offers a really nice balance of the core benefits of an IDE (e.g. intellisense + debugging) with the speed and ease of use that lightweight text editors such as Sublime offer.

Setting up Visual Studio Code

If you use Visual Studio Code to compile your typescript files, you need to create two files in your project:

  1. tsconfig.json (https://github.com/willchen90/typescript-blackjack/blob/master/tsconfig.json)
  2. .vscode/tasks.json (https://github.com/willchen90/typescript-blackjack/blob/master/.vscode/tasks.json)

Then you can run your build task within VS Code and it automatically watch - you can see the results in the bottom left corner. The two issues that I ran into is that: 1) you need to re-run the watch task when you add new file (it looks like this will be solved in Typescript 1.7.X - https://github.com/Microsoft/TypeScript/pull/5127) and 2) you don't know when it's "done" compiling since the watch task never ends, although the Typescript compiler seems to run very fast so that wasn't really an issue.

TSD - Typescript Definition Manager

The other thing I discovered is this tool called TSD (Typescript Definition manager) which is basically a package manager like Bower for typescript definitions (it seems to be a flat dependency structure, although I didn't dig in too deeply on this today). This makes it much easier to add typescript definitions as you essentially only have to manage one Typescript definition file from your application code (typings/tsd.d.ts). The main commands are "tsd init" and "tsd install lodash --save". 

Note: there seems to be a bug where if you include the flag before the package name, the command isn't executed properly. (e.g. tools like npm don't care if you do "npm install --save lodash" or "npm install lodash --save").

Starting with the Slack chatbot client

Initially I was hoping to just run the chatbot client against the actual Slack API using Slack's somewhat supported node.js client (https://github.com/slackhq/node-slack-client), however I quickly ran into the rate-limiting issue (HTTP 429 - Too Many Requests). It seems like Slack has a pretty conservative rate-limiting policy of one message per second. I'm not sure if there's a way of "pay to play" to raise the limit or Slack really dislikes automated messages. 

Making a mock "chat client"

As a workaround I used a "mock" chat client using node.js standard input and standard output interface using the readline npm module. The key to doing it was to isolate the slack client-specific code (which I had essentially copy and pasted from slack's example file) with the rest of the code I was developing.

It was actually a really simple implementation and could be reused for a variety of apps. The next issue I wanted to solve was not having to manually restart the node.js app everytime I made an update. Of course it's not that much work, but it's annoying to have to remember to do everytime so I used an npm module called nodemon. It's very popular for local development because it restarts your node app whenever it detects a file change. If you're creating a REPL-like app (e.g. a chat client), you want to make sure you set the "restartable" flag to false, otherwise nodemon will listen to stdinput and you will get undesirable behavior like repetition of stdinput. It wasn't really clear what caused this from the Readme, but I figured it out by looking at a similar GitHub issue.

Debugging!

For some reason, using the debugger seems to be pretty uncommon in node.js land. I think it's a combination of most JS developers using text editors (e.g. Sublime, Atom) without debugger support and that debugging transcompiled code (e.g. Coffeescript, Typescript) is oftentimes a pain. Luckily with source maps and new tools like Visual Studio Code, it seems like debugging is now a lot easier and actually fun to do. My recommendation for using the debugger in Visual Studio Code is to rely on using the "Attach" setting, which is essentially hook into / debug a node process that you've already started. I've included my example VS Code configuration. This is usually more straightforward than trying to launch a new node process through an IDE.

Unit testing

Eventually it just got too tedious to manually check outputs, even with the mock client. I created a small suite of unit tests using Mocha and Chai, which I was familiar with. Mocha is a very popular framework with a helpful, easy to look at website. The two tips that I have are: 1) use the watch flag (it's like nodemon for testing) and 2) using "source-map-support" npm module so your error stack traces point to your original source files, not the transcompiled .js files. For example of these two tips in action, look at my simple one-line "npm test" script.

I even launched the debugger in Mocha. Use the "--debug-brk" and not the "--debug" flag, otherwise you won't be able to attach your VS Code debugger to the Mocha test process. 

Wallaby.js - unit testing on steroids

Lastly, I want to mention Wallaby.js, which displays unit test results in your editor. It's looked very promising for a while, but I had some trouble using it in WebStorm a while ago (it seemed like the test results didn't update properly). I decided to give it another go since they just launched a Beta for Visual Studio Code (which recently open sourced their codebase and have developed an extension system). I only briefly played around with it, but it seemed quite reliable and I really enjoyed having the console.log information display at the bottom bar, and the code coverage so prominently displayed while you're coding. In essence, Wallaby.js seems like a next-generation testing tool. I'm going to spend some more time with it, and try to use it regularly at work.

To conclude: get more and faster feedback

Even though these were two really short projects (and they're incomplete), I've learned a tremendous amount just from exploring Typescript and all these other tooling that play well with Typescript. I was initially worried that using Typescript would slow me down because 1) I wasn't too familiar with it and 2) it would be time-consuming to write type annotations. In the end, I think those concerns were proven false as I was able to be very productive with Typescript in a short amount of time. Getting errors from Typescript within VS Code was a huge help, and I was able to catch silly mistakes (e.g. typos, logic errors, etc.) in a very quick cycle. I think in the future, I will always consider using Typescript if I'm starting on a new Javascript project. The only downside of Typescript is that it takes a bit of setup to get a smooth workflow and some of the tooling lags behind Javascript (ES6) (e.g. linting, style-checking). However, I think those downsides are far outweighed by the benefits you get from it, and it feels like the Typescript ecosystem is alive and well from the open and active development by Microsoft on their typescript GitHub repo to the new Angular 2 framework that is being developed in virtually all Typescript.


Link summary: 

Tools that I used today:
  • Typescript - (to install the compiler: npm install -g typescript)
  • Visual Studio Code - https://code.visualstudio.com/
  • Mocha & Chai - https://mochajs.org/
  • Nodemon - https://github.com/remy/nodemon
  • Wallaby.js - wallabyjs.com
  • Source map support - https://github.com/evanw/node-source-map-support

Toy projects:

  • Blackjack - https://github.com/willchen90/typescript-blackjack
  • Git chatbot - https://github.com/willchen90/typescript-gitbot