Wednesday, July 20, 2016

WebAssembly Hello World!

Photo by isipeoria (license)
Having done so much browser development in the past I have been watching WebAssembly closely. This past weekend I finally got some time to read through the current documents and mess around with various projects including WAVM. In the process after discovering no one seems to have bothered to create a Hello World! program, so I did.

WebAssembly currently has two formats, a binary format and an editable AST form.  (They are not final and there is even talk of going with something other than the AST form for the official text mode.)  Given that wasm AST is readable one thing I have watched for has been someone posting a Hello World! program as that provides a rich entry point as it doesn't require that you figure out how to compile any C code into wasm to see what it would look like. (I have seen examples that show off adding two numbers, but Hello World being such a classic I figured someone would eventually make it.)  Given the intertwined history of Emscripten with WebAssembly it wasn't a surprise when reading through the WAVM source I discovered that it had exposed Emscripten intrinsics so programs compiled using Emscripten could be run on WAVM.

A short while after finding the Emscripten intrinsics I had a working WebAssembly Hello World! program. The wasm ATS isn't exactly made for programming (typically you will compile to it) and no doubt this example will be improved upon and might even break as WebAssembly improves and changes the spec, but it brings the barrier for playing with WebAssembly down a little bit.  Enjoy!


To run the program using WAVM save it locally to a file called helloworld.wast and execute it with the following command:
$ ./bin/Run -text helloworld.wast main

And for the inevitable comment that points out how much smaller it could be, here it is without comments, spaces, a local variable, return value and all of the other bits that make the program more useful as example code, but is "so much better" at only 6 lines.



And while you can't run this hello world on it, if you are just looking to explore what can be coded in wast you can play around with WebAssembly on the WebAssembly playground which was recently created by Jan Wolski. If you want to try going from a c source file all the way to the browser check this WebAssembly end to end how to document.

Monday, March 21, 2016

Migrating to Git and dealing with large repositories

When migrating source code to Git from a legacy revision control system the default conversion often results in a single repository that is enormous.  Faced with this large messy repository many argue that splitting up the old repository is the only solution and yet there is a disconnect because companies like Google and Facebook seem to have success with a single large repository.   Whether or not the repository is split cleaning it up and removing the interdependencies is required.  Some companies keep this slimmed down repository that contains all projects and use partial checkouts because it provides other benefits around versioning.  It is worth evaluating those benefits before jumping straight into splitting a repository up.

Default behaviors


The default behavior and example usages of legacy systems such as Subversion and Perforce both show creating a single server and repository where all projects live.  By default users checkout everything and over time it is understandable how everything can be required to build.  Users have to actively police or put guards in place to keep thing separated which rarely occurs.  Also because users only have the most recent copy of the repository they don't pay any penalties for checking in binaries or large files.  It further doesn't help that the systems don't provide any incentives to keep those binaries in specific locations.

The above two behaviors result in a disaster when trying to migrate to Git because all files and their history need to get pulled into a single repository.  The repository can easily be tens of gigabytes and commands like Git status can take many seconds to run.  This is why a common reaction is to say that to make Git usable the solution is to split up the one large Git repository into many smaller repositories.

Bloat and lack of organization


Messy bookstore
Photo by Iván Santiesteban (license
The tree structure of legacy repositories are rarely clean.  Build scripts that as the last step check the binaries into the repository inside the src/Debug directory.  ISO images of compilers, editors, and other software that might be used by the developer dumped anywhere and everywhere.  All sorts of generated files sprinkled about.  While the organization might be divided into four sub-organizations there might be a single src directory and only one build script.  This results in an overly difficult task for anyone that wants to export a repository to Git.

Like a small town book store owner that didn't intend to create stacks of books everywhere this is a state that is never strived for.  Cleaning this up and removing the bloat is simply something that should be done whether it is moving to a single Git repository, going to many smaller repository, or even staying on Subversion or Perforce.

Interdependencies


While a legacy repository might required everything to build that is usually historical and not desired.  Just a few of the problems that arise from interdependencies:
  • Increased "code, build, test" cycle from seconds, minutes, hours or even days.
  • Projects reaching into each other in ways they really shouldn't causing significant long term technical debt.
  • Bad API's that resulted from the idea that because the API between projects could be change at any time the developers didn't put any effort into making decent API's to begin with which results in higher API churn and more build breakage.
  • Common shared libraries that are unowned and in horrible shape from everyone pushing in their required features and changing behavior without understanding the consequences.
  • With everyone committing to the main repository, no build bots and everything required to build developers might rarely update resulting in more build breakage.  When the repository rarely builds the overall health of a project is hard to determine and release planning becomes more difficult.
  • The entire repository and all of the projects together can only be released together at a pace that is slower and more complex than any smaller project would desire.
  • Anything that is shared from libraries, scripts, tools and even the release might not get the care it needs from lack of stewardship.
Interdependencies is something to avoid, but once you have the problem there is no silver bullet to fix it.  It just take a lot of code reading and manpower working on the problem.  The end result is that the projects are split up and independent.

Announcing that the repository will be split up is one way to force the projects apart because they can't depend on what isn't there.  Alternatively all of the projects can still live in one repository, but developers start using partial checkouts with only the projects they need to work on.  This forces the projects apart without requiring multiple repositories.  Subversion, Perforce and Git all support partial checkouts.  Besides having one repository what other value does this provide over splitting up the repository?

Build dependency and versioning


When a projects is being built it needs to resolve its dependencies.  With multiple repositories it is hard to get by without specifying versions.  With a single repository where all of the dependencies are checked out with the project no version needs to supplied.  This ability is both powerful and limiting.

First off what is limiting about it?
  • If a library wants to change its api it must update everyone that depends upon it or their build would immediately be broken.
  • A project can not depend upon old versions of their dependencies.
What is powerful about it?
  • An api change can be made atomically with the change that migrates all clients to the new api.
  • A project can only ever be built against one version of their dependencies and so problems such as a project that has two different dependency that depend upon different versions of a third dependency disappear entirely.
  • When you have a problem you only have to report one revision, not one per project / dependancy.
  • You can bisect the source across project and dependencies without worry about what version of dependencies where built and deployed together.
  • When a bug is fixed in a static library it is easy to find all deployed binaries older than the fix to be rebuilt.
  • All tools dealing with history (like the already mentioned bisect and commit) such as log work for the project and its dependancies together.
  • Projects by default immediately get improvements (and regressions) from upstream without having to pull them manually.
  • Regression introduced by dependencies are caught much quicker.
The single repository using partial checkouts and no versioning has some appealing traits.   It isn't perfect and won't work from everyone, but it is almost the best of both worlds and worth thinking about before splitting up the repository.

Conclusion


When migrating to Git from a system like Subversion or Perforce it is easy to conclude that the path
forward is to create many repositories that contain only part of the legacy repository.  The behaviors of the legacy systems have often resulted in repositories that are bloated, lacking in organization and have intertwined code.  Solving these problems doesn't require that the repository be split up and it is worth exploring the benefits that a single repository has to provide.  Whether or not you go with many small repositories or you go with one large one with partial checkouts, the first step is to clean the legacy repository up because that will have to be done before either outcome can be reached.

Sunday, March 29, 2015

A connection machine in your pocket

In the 90's I read the paper "Evolution as a Theme in Artificial Life: The Genesys System" in which  the researchers evolved an ant program which would allow an ant to traverse a winding broken trail they dubbed their The John Muir Trail.  As the ant walked the trail it had to first figure out how to turn and then step over gaps in the trail that grew wider the longer the ant walked.

Execution of a single run took 30 seconds and the program typically had a populate size of 65,536 and ran for 100 generations resulting in a program that took about an hour to run on their Connection Machine (CM2) with 16K processors.  For those not familiar with the CM2, it is an amazing machine, a cube 1.5 meters on a side with up to 65,536 processors and from the eyes of a teen age boy the image below would have been the equivalent of the Ferrari poster hung on the wall.  The machines were so cool that the CM-5 with its red light design was the featured super computer seen in Jurassic Park.  For a teenage it was an insanely awesome and powerful machine that if I have had access to I wouldn't have had a clue what to do with it.


The story of these ants were one of several things that sparked my interest in genetic algorithms and continued interest in programming in general. At the time I had a Pentium 133 and could only dream about the computational power that the CM2 has. I would sketch out ideas for things I would do, but off course it never left the paper which was okay because I was still figuring out the basics of programming.

Lasts weekend I was cleaning out my bookshelf and came across a book that mentioned the Genesys/Tracker system which made me look up the paper again.  Reading the paper in bed I discovered that unlike when I was a teenager I actually understood the mechanics of what the researchers were doing.  Looking around the next evening I didn't find any source code anywhere and thought it would be fun to try to reproduce their program and see how long it took to run.  I coded up a simple little C program with the finite-state machine design. It isn't multi-threaded, it is wasteful of memory and the little bit I did optimize was clearly overkill considering the runtime.  I tried not to pre-optimize (other than replacing rand()) and implemented a simple design.  After fixing two annoying bugs I saw on my screen it rapidly finding better and better ants.  On my five year old laptop running inside a VM it took around a minute to run the same experiment with a population of 65K for 100 generations.

The phone I carry around in my pocket isn't that much slower than the laptop and it is weird to think that a machine I will soon think of as obsolete is already faster than a CM2.  It is one thing to have heard of Moore's law, but every once in a while you get reminded about what it really means.

A quick profile show that 50% of the time is spent not executing the fitness functions, but simply in the crossover function used to generate the next generation.  I even started re-writing it to not be so brute force, but then I stopped myself because at the end of the day 100 generations only took a minute or two and when I let it go crazy and run for 500 generations it wasn't longer than going to make a cup a coffee.  The value of making it faster is more of a challenge of how fast could you make it rather than any need to actually make it faster.  Moore's law has enabled me to write simpler code and get results faster than if I needed to go through a week of making it fast enough first.  If anyone is up for nothing more than a coding challenge grab the source and see how much faster you can make the execution time.

A nice aspect of having such fast hardware today is that I can reproduce the results published in the paper if I want.  While not every run found a solution within 100 generations it often was close so at the minimum I can confirm those results.  The score distribution and results shown in figure 7 and figure 8 wouldn't be too much effort to reproduce either.  I don't know if it would be worth a letter to the journal, but putting these notes and source up online where they can be searched is probably just as good.

In the paper they mention that they didn't select for the minimum number of steps or the minimum number of states only to find a solution that could walk the whole trail.  The best solution presented in the paper required 12 states and took all 200 steps to complete.  Adding a quick tweak to the code I let it run and after first accomplishing walking the full trail bonus points were awarded for using the least amount of states.  After ten minutes or so it had found a state machine that only required 9 states.  I let the application run overnight before stopping it at 28340 rounds, this would have been the equivalent of around 12 straight days on a CM2 machine which given the cost of the machine was probably an unthinkably long period of time to have the machine running a single program.  But it was unable to find a smaller program that would run in less than 9 states.  I then tweaked it to find a fast solution.  Very quickly it found one highly evolved to the specific trail that could run in only 159 steps.

Last but not least I made a basic genesys simulator webpage where you can see various ant's walk their solution live.  I included two 9 state solutions, the papers 12 state solution*, the "fast" 159 step solution, and when you first load the page it shows the basic hand generated state machine mentioned in the paper which is unable to solve the solution in less than 200 steps.

If your curious in running the genesys/tracker I wrote or up to the challenge of making it faster the source code to can be found at https://github.com/icefox/genesys

* In the paper Figure 11 the Champ-100 12 state solution on State 11 it shows 0/S which is an invalid action and a typo and should be 0/N.

Thursday, October 02, 2014

So you want to build a Git server?

So I heard you are thinking about creating a Git server.  Try to cash in on the massive rise of Git perhaps?  Maybe it will be online only like Github or perhaps it will be for customers to install like Gitorious or perhaps it will be mobile only!  You of course have a reason why your Git server is going to be better and take the market by storm.  Having been messing around with Git servers for almost almost seven years I have installed many different types many times and even started working on one myself called GitHaven many years ago that I had to abandon.  But my loss is your gain!  Below is the essential wisdom I have gained that hopefully you can use to make your Git server better.

Beyond that one killer feature that started you down the road of making a Git server you will need to implement a fair number of other things which fall mostly under three categories backend, frontend, and bonus.

Backend is the bits that accept connections for the git fetch/push commands and provides an API into the data and system.  It does authentication and authorization.  It is of course completely separate from the frontend for many good architectural reasons.   After basics you can talk about multiple different authentication schemes, hooking into ldap, scaling ssh, and providing many different rich commands that users can use.

The frontend these days includes at least a web frontend.  At the core this is all about either browsing files from a branch/tag/etc or viewing a log of commits.  Beyond basic viewing the sky is the limit with the addition of search, view log by any number of filters such as author author, diff tools, etc.  After that is basic web administration, user account, user settings, ssh key management, gravatar, password reset etc, and the same goes for the various repository settings.  Then you hit upon richer version control features like forking, pull requests, and comments.  Note that even forking and pull requests are imposing a workflow upon your customers and as you create richer tools they will be used by a smaller audience so choose wisely.

Bonus covers everything not related to Git, typically much more project oriented.  This includes stuff like wiki's, issue trackers, release files and my personal favorite the equivalent of GitHub pages.

I have mentioned just a few of the feature that a Git server could have and there are countless more, but what is really truly important for a Git Server to have?  What makes a Git server?  Is it a built in wiki?  Are customers buying your Git server because of the wiki and the fact that they get repository management a bonus?  Okay maybe it isn't a wiki.  What about issue trackers?  It is probably bolted on and no where near as powerful as something like Jira so perhaps not.  Again it might get used, but they wont be buying the product because of it.  So maybe picking on everything in Bonus is easy.

Maybe a better question to ask would be what feature that if it was missing would have the customer seriously consider switching to another Git server because it stopped from doing their job?  What if you provided no web frontend at all and could only interact with the server through api calls or ssh like Gitolite?  Sure for many customers they might be annoyed, but they could get the job done knowing they could clone, work and share changes.  One reason the frontend separated from the backend is that even if the frontend goes down for some reason developers can still "do work".  So what is the backend providing that is so central to a Git server?

  Mentioned above is authentication and authorization.  If the Git server didn't have those built into its core would that matter?  For authentication for example a customer requires you hook into ldap and your server doesn't support it they will look elsewhere, on the web maybe this means authenticating through a 3rd party.  At the bare minimum if anyone could do anything almost no one would want to use it.

  What about authorization?  Authorization is the ability to say yes or no to what a user wants to do.  At the base minimum would be just a check to see if they are authorized and then it returns true they can do whatever they want.  The average depth of permissions most Git servers have is that each repository can have a list of users that can push to the repository.  A pretty basic safe guard is force pushing.  A good example is Atlassian Git server Stash which out of the box doesn't provide you any way to prevent users from force pushing to a branch.  In the age of automated deployments an accidental push could result in downtime, maybe data lose or worse if the wrong branch is force pushed to.  Stash‎ leaves it to the 3rd party market (only a few clicks away) to provide a plugin which prevents force pushing (on every branch, no configuration) so it isn't as bad as it first sounds.  On the other end of the spectrum is Gitolite which lets you create user groups, project permissions, repository permissions, branch permissions, cascading permissions, read, write, force push, create, and more.  They even let you have permissions based upon what file you are modifying, it is very rich and powerful.  What sounds like edge use cases such as a build server should only be allowed to create tags that start with release-* or the encrypted password file should only be modified by one single user are very common.

  Very closely related to authentication and authorization is having an audit trail.  At some point something is going to be configured incorrectly and someone is going to do something bad and the user will demand the logs so they can find out what went wrong (by whom), undo the behavior, and prevent it from happening again (such as someone force pushing to master).  If you don't have logs and can't prevent force pushing to master they might look at you funny and then start looking for a new place to host their code.

  The third thing that the core of any Git server does is provide the ability for users to add new commands.  This is central to Git itself with its hooks system.  The classic server side hook is email where users can configure the hook to send out email to some email list whenever new commits are added to a branch.  Providing a way to add new hooks cleanly into the system is definitely something you want to do, but there is one hook that should be included from day one, the WebHook.  The web hook is a very simple hook that users can configure that says when something changes in a repository send a post to some url.  This url points to something that the customer owns is something they can get working in no time flat with no access to the Git server.  They don't need to learn your api, create a hook in your language or choice or anything that is a hassle (putting in a change request with IT admin to install a plugin).  They pull up their favorite web interface and make their own tool that runs on their box.  The best part is the because it doesn't run on the server security isn't and issue and it can't take down the server no matter how many web hooks are enabled.

So what is at the heart of a Git server?  Authentication, authorization, and extensibility.  Whatever type of product you create you need to absolutely own those because they are central to the success of your product.  You not only need to own the code, not leave it to a 3rd party and they should be the fundamental building blocks of the product.  Maybe the rich extension system you have planned isn't ready yet, but the WebHook should be there.  Maybe the permissions model is in place, but the frontend doesn't expose it yet, having a check box to blocking force push to master should be there on day one proving it works.  Out of the box the lack of features in these three areas are the biggest area where customers will turn away not because they disagree with some design choice, but because they flat out can't use your product and on the flip side if you fully support these three you will find customers migrating to your product.

Extra: Git users love to use hard drive space, especially enterprise customers.  They will migrate an old CVS repository into a single 10GB Git repository without thinking much about it.  The disk usage will only grow and the system must proactively monitor and support this.  Using authorization to limit disk space, notifying users when a push isn't normal (accidentally committed the object files), and letting users see their own disk usage is one method, but it only slows the growth, supporting scaling of the number and total size of repositories is an absolute must.  Even with a small number of users they will find a way to use all of the available space.

Monday, June 30, 2014

Evil Hangman and functional helper functions

Evil hangman is what looks to be an ordinary game of hangman, but even if you cheat by knowing all of the possible words available it can still be a challenge. Try out your skill right now at http://icefox.github.io/evilhangman/

The reason the game is evil is because the game never picks a word, but every time the user guesses a letter it finds the largest subset of the current words that satisfies the guesses and makes that the new list of current words. For example if there are only the following six words and the user guesses 'e' the program will divide the words into the following three answer groups and then pick _ee_ because it is the largest group.  This continues until the users is out of guesses or the group size is 1 in which case the user has won.

_ee_ : beer been teen
__e_ : then area
____ : rats

A few months ago I saw a version of this written in C, taking up hundreds of lines of code while it was efficient it was difficult to read and modify.  With only 127,141 words in the entire dictionary file many of the complex optimizations for memory, data structures, and algorithms were silly when running on any modern hardware (including a smartphone).  The code instead should concentrate on correctness, ease of development and maintainability.  Using JavaScript primitives combined with the underscorejs library the main meat of the program fits neatly in just 24 lines including blank lines.  Using map, groupBy, max and other similar functional functions I replaced dozens of lines of code with just a handful of very concise lines of code.


For a long time most of my projects were coded in C++ using STL (or similar libraries) for my collections.  I had a growing sense of unhappiness with how I was writing code.  Between for loops sprinkled everywhere, the way that the STL doesn't include convenience functions such as append() my code might be familiar to another STL developer, but the intent of the code was always harder to determine.  As I played around with Clojure I understood the value of map/filter/reduce, but didn't make the connection to how I could use it in C++.  It wasn't until I started working on a project that was written in C# and learned about LINQ did it all come together.  So many of the for loops I had written in the past were doing a map/filter/reduce operation, but in many lines compared to the one or two lines of C#.

When codewars.com was launched I tried to solve as many problems I could using JavaScript's built in map, filter, and reduce capabilities.  I discovered that I could solve the problems faster and the resulting code was easier to read.  Even limiting yourself to just map, filter, reduce and ignoring other functions like range, some, last, and pluck it dramatically changes the ease that others can read your code.  The intent of your code is much more visible.  Given the problem of "encrypting" a paragraph of text in pig latin here are two solutions:


Using chaining and map it is clear that the second solution does three things, splinting the paragraph into words, doing something with each word, and combines them back together.  A user doesn't need to understand how each word is being manipulated to understand what the function is doing.  The first solution is more difficult to reason about, leaks variables outside of the for loop scope and much easier to have a bug in.  Even if you only think of map, filter, and reduce as specialized for loops it increases a developers vocabulary and by seeing a filter() you instantly know what the code will be doing where with a for loop you must parse the entire thing to be sure.  Using these functions remove a whole class of issues where the intent is easily hidden with a for loop that goes from 0 - n, 1 - n or n - 0 rather than the common case of 0 - (n-1) not to mention bugs stemming from the same variables used in multiple loops.

Functional style helper functions in non functional languages are not new, but historically hasn't been the easiest to use and most developers were taught procedural style for loops.  It could just be a baader-meinhof-phenomenon, but it does seem to be a pattern that has been growing the last decade.  From new languages supporting anonymous functions out of the box to JavaScript getting built in helper functions and even C++ is getting anonymous functions in C++11.  The rise of projects like underscorejs or the fact that Dollar.swift was created so shortly after Swift was announce I fully expect that code following this style will continue to grow in the future.

Thursday, March 06, 2014

How to stop leaking your internal sites to Gravatar, while still use them.

Gravatar provides the ability for users to link an avatar to one or more email addresses and any website that wants to display user avatars can use Gravatar. This include not just public websites, but internal corporate websites and other private websites. When viewing a private website even when using ssl the browser will send a request to Gravatar that includes a referer headers which can leak information to Gravatar.

When you viewing the commits of a repository on GitHub such as this one https://github.com/icefox/git-hooks/commits/master you will see a Gravatar image next to each commit.  In Chrome if you open up the inspector and view the Network headers for the image's you will see along with other things that it is sending the header:
  1. Referer: https://github.com/icefox/git-hooks

The past decade urls have for the better gained more meaning, but this can result in insider information leaking through the referer tag to a 3rd party. What if you were working for Apple and running a copy of GitHub internally, it might not be so good to be sending https://git.apple.com/icefox/iwatch/browse out to Gravatar. Even private repositories on GitHub.com are leaking information. If your repository is private, but you have ever browsed the files in your repository on GitHub you have leaked the directory structure to Gravatar.

While it seems common knowledge that you don't use 3rd party tools like Google's analytics on an internal corporate website, Gravatar images seem to slip by. Besides outright blocking one simple solution (of many no doubt) I have found is to make a simple proxy that strips the referer header and than point Gravatar traffic to this machine. For Apache that would look like the following

<virtualhost *:8080>
RequestHeader unset referer
RequestHeader unset User-Agent
ProxyPass /avatar http://www.gravatar.com/avatar
</virtualhost>

Edit: This post has spawned a number of emails as so I want to clarify my setup:

Using Chrome version 33 I browsed to a server running apache setup with ssl  (the url would looks like: https//example.com/) and on that page it had a single image tag like so:

<img src="https://www.gravatar.com/avatar/205e460b479e2e5b48aec07710c08d50">

When fetching the image Chrome will send the referer header of https://example.com/ to gravatar.com.

While Chrome's inspector says it sends the header just to be sure it wasn't stripped right before the fetch I setup a fake gravatar server with ssl that dumped the headers received and pointed the page to it and found as expected the referer header were indeed being sent.

For all of those that told me to go look at the spec I would recommend that you too read it  rfc2616#section-15.1.3 where it only talks about secure to in-secure connections which is not the case we have here.

Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.

Thursday, January 23, 2014

Ben's Law's

#1)  When every developer is committing to the same branch the odds that a commit will break the build increases as more developers contribute to the project.

#2)  If the release cadence for software is slower than it takes to create a minimal competitor the software will have competition that is stronger than it would like.

Popular Posts