Soycode: 2011

Sunday, April 24, 2011

The end of an experiment

You don't exist.

Well, I guess you do, the individual you. Tautologically (and forgoing solipsism) you must.

But the collective you - the "audience" - you're essentially not there.

There's a few reasons for this, chief among them my unwillingness to promote my site in any way. I'm not talking about buying ads or anything silly, simply about "networking" - harassing people I know and harassing them to harass others and tweeting and Facebooking and doing whatever else it takes to grab eyeballs.

Also, it's relatively clear from perusing my entries that, to put it one way, "my heart isn't in it." I find these topics interesting of course, and I'd like to think that my writing is actually not bad and the links and sources are interesting too. But I'm barely up for updating weekly, much less the daily or more that a truly successful site requires.

So, I'm going to stop. There's the off chance I'll post again if and when I actually post more actual open source projects. But I obviously don't have time to do that constantly, and even in my free time I'm finding other priorities more important. Maybe those too will eventually be things I choose to share with the world, but likely through another venue (or at least a different site).

I was never writing with the intention of building an audience anyway, but it does seem rather silly to continue without one. I'm not going to take anything down, the blog entries will stand if nothing else for my reference, as I truly find the links and such interesting. But now when I find new links rather than write short contrived essays I can just copy the URL in my notes.

Thanks for not reading, and enjoy the silence.

Sunday, April 17, 2011

"C is the desert island language"

A fun read. It makes the case that C is the "only sane choice" for much development, but that Go is perhaps its "true heir."

Sunday, April 10, 2011

Diversifying your technology portfolio

You probably use a lot of technology.

Computers. Email. Videogames. Search. Blog. Microblog. Social network. Instant messaging. Browser. Cellphone.

The list goes on and on - and in many cases, wide swathes of these products are made by the same company. Microsoft, Google, Apple, Facebook, Nokia, Sony - you can end up largely dependent on a single company for the bulk of your technology "stack."

And it's commonly marketed that this is in fact a good thing - products from the same company are expected to work better with each other, giving you more convenience and less hassle. This can be true, and in some ways I have followed it myself.

But there's a problem with this approach - you're much more sensitive to any changes. It may seem great to be deeply embedded in a particular ecosystem, but if that one company decides to change something in a manner you dislike it can have broader ramifications for you than for a typical user.

Further, it is actually a fallacy that sticking with one corporate "stack" is the ideal path towards interoperability. Any high quality technology, even if it is proprietary, should adhere to openly established standards that have been widely adopted. Good examples of this are email, XMMP, CD/DVD, RSS, and so forth.

So, I just wanted to take a moment to encourage you to diversify your technology portfolio. That is to say, take stock of what you depend on, and consider alternatives that may make you more robust to shifts in corporate strategy.

As intended by my rhetoric, you can think of this as analogous to a good investment strategy. You don't dump all your money in one company, so why dump all your time? It's an investment too, and I would argue a more important one.

I'll close by saying that this is not intended to be a particular criticism or message, but rather a general realization I reached recently that I decided to share. Of the companies I listed above, some arguably make significantly better products than others in most domains (I'll let you decide which is which there). But even if the product is better, there is something to be said for diversification - which I suppose is why I just said it.

PS - For bonus points, include as much open source software as possible in your personal technology "stack." This has the added advantage of transparency, and it's safe to say there is not and almost certainly will never be a single corporate overlord for all OSS. If something happens that people don't like, development will be forked and the process will continue.

PPS - This approach may lead to more accounts/passwords, which is problematic. OpenID is one potential way to mitigate this problem.

Sunday, April 3, 2011

Music Theory in JavaScript

VexFlow is a pretty cool looking HTML5/JavaScript music notation renderer. And even cooler, you can directly access the underlying music theory API.

Most of my work last week consisted of writing music theory code. VexFlow now has a neat little music theory API, that gives you answers to questions like the following:

What note is a minor 3rd above a B?

What are the scale tones of a Gb Harmonic Minor?

What relation is the C# note to an A Major scale? (Major 3rd)

What accidentals should be displayed for the perfect 4th note of a G Major scale?

etc.

The API is part of VexFlow, and can be used independently of the rendering API. Take a look at music.js in the VexFlow GitHub repository for the complete reference. There's also a handy key management library for building scores in keymanager.js.

Definitely a neat piece of software, and the rendering is pretty too (check out the tests page and scroll down to see numerous examples).

Saturday, March 26, 2011

We are all Samoan (or, why to beware of personalization)

Search engines used to be just that - tools which accepted a query and then tried to find websites related to that query. This is still true, but only partially - many "signals" beyond your query are used to choose and deliver "relevant" search results.

Here's the official word from market leader Google:

As Larry said long ago, we want to give you back “exactly what you want.” When Google was founded, one key innovation was PageRank, a technology that determined the “importance” of a webpage by looking at what other pages link to it, as well as other data. Today we use more than 200 signals, including PageRank, to order websites, and we update these algorithms on a weekly basis. For example, we offer personalized search results based on your web history and location.

As for rival Bing, one need look no further than their aggressive marketing campaign to see similar logic at work. Search is no longer a functional lookup tool a la grep. it is a personalized experience where piles of clever algorithms sift through data, not just about the internet, but about the searcher, in order to deliver custom results.

You may not see anything wrong with this right away - indeed, it is statistically designed to work in the vast majority of cases. The personalization after all isn't truly personal, but rather an attempt to classify people in buckets and show them things that are known to be interesting to most people in that bucket. Searches will suggest and complete in a way that most people want - for example, the next word after "music" is apparently most often "videos."

I often find this behavior irritating, as I suppose my true preferences are still quite a bit different from folks that may be in my "bucket." But it turns out that the dangers of personalization go far beyond occasional poor search results. Eli Pariser gave a compelling 10 minute talk in TED 2011 where he argues that excessive personalization results in "filter bubbles" where we end up consuming only "information junk food."

Silly labels aside, the video is well worth the watch and the argument seems to be a sound one. By always showing the most "relevant" (e.g. most likely to be clicked/consumed) results to a certain demographic, we can end up systematically ignoring whole topics that are often of political or cultural importance. I don't think things are that awful quite yet, but they are getting there.

For a non-political example, think about searching for the word "weather." It used to be that this would give you encyclopedic information about weather - links to NOAA, Wikipedia, and other well-linked resources on the topic. If you wanted to find the current weather conditions in your area then you would add a city name or zip code.

Now if you search for weather search engines will assume that you want that latter case - current weather conditions in your area. You'll get the immediate weather conditions, as well as numerous results for other weather condition services for your area. Most users are satisfied most of the time.

But think of a 5th grader who's trying to write a paper about weather. It may seem like a contrived example, but the point is that "general research" and "canonical results" no longer exist. Search engines (and other websites, such as social networks) make assumptions and tailor results based on them, and generally speaking the user has no recourse if the assumptions are in error.

Eli's talk focuses on political ramifications, and ends with the fanciful notion that algorithms must be programmed to be concerned with "ethics" and "importance" in addition to relevance. Speaking as a computer programmer, I find this rather silly - computers can have no more "ethics" than their programmer, and attempts to codify such things will inevitably have biases and oversights.

I would much rather have the simpler solution of allowing users to disable all the extra signals at their discretion. If I want to search for something and I don't care about where I am or what else I've searched for recently or anything else, I should have that option. If somebody halfway across the world does the same thing for the same search terms, they should get the same results as me.

And this takes me to the title of this entry - while this is admittedly a workaround, currently most personalization and other signals seem to not be as heavily enabled on some of the more exotic localizations of Google (and perhaps other sites, though I haven't checked). Of particular note is www.google.as - Google American Samoa.

The site is still in english, but if you search for "weather" you'll find NOAA and Wikipedia in the top 5 results (Wikipedia doesn't show up til page 3 for me on regular Google). "Pizza" similarly tells you about pizza, rather than making the assumption that you really want to buy a pizza near you. I haven't tested politically oriented queries, but I imagine they should fare better as well (as long as you're logged out/incognito).

And so I at least have a temporary search tool for these use cases, and I hope others find it helpful as well. I imagine in time this workaround will go away, and I can only hope that search providers allow users opt out of any and all signals, giving a "pure" search where the only signal is the query. This means if you screw up your search you'll get screwy results, same as grep - and in some cases, that's a really good thing.

Sunday, March 20, 2011

Think Stats - a free statistics and probability book for programmers

Check out Think Stats - it's free as in speech (Creative Commons license), which alone is reason enough to point it out. The code examples are Python, which is a plus in my book, and the table of contents covers most any basic case I can imagine most programmers would have to deal with. Highlights:

Probability distributions
Monty Hall (a.k.a. getting started with Bayes)
Central Limit Theorem, and "Why normal?"
Hypothesis testing
Estimation
Correlation (e.g. fits and regression)

The whole thing is sprinkled with light math but nothing intimidating, quick runnable code snippets, and nice pictures. Plus as I said above, it's truly free and open, as this sort of material really ought to be.

Why does this matter? If you're a computer programmer in this day and age, you deal with data. But a lot of programmers just deal with data without understanding anything about analyzing it - that's somebody else's job. Or even worse, the programmer just assumes they can intuit what they need about analysis - there are more than a few statistical cases that are notoriously counterintuitive (see the Monty Hall example).

So boning up a bit on your statistics is important, even if you're not an "analyst." In fact, many people with that title are more about fitting data to models than models to data - that is, they know the theory they want to support, so they "massage" the data until it looks like that. This is absolutely entirely unscientific, and getting a better understanding of stats will help you see when folks try to do this.

Anyway, I hope this resource is helpful to folks, and thanks for reading!

Sunday, March 13, 2011

Music as data

It's MAD.

Music as Data (MAD) is a live programming language / environment based on Processing.org written in Clojure. It's something like SuperCollider or Chuck but aims to be easier to hack / experiment live.

Seems pretty cool - Clojure is a recent Lisp dialect, so it should be expressive (if parenthetical). Check out the Github repository, and I'll update with my thoughts if I get a chance to play around with it.

Sunday, March 6, 2011

Spacewar! Now in HTML5

Play the first digitally programmed "action" videogame ever created. (Note - it's two player, no AI. Controls from the readme: "The "a", "d", "s", "w" keys control one of the spaceships. The "j", "l", "k", "i" keys control the other. The controls are spin one way, spin the other, thrust, and fire.")

Check out the Wikipedia article for the full backstory, but the short version is this game was created for the PDP-1 in 1961-2. In 1997 it was ported to a Java applet (which was actually a complete PDP-1 emulator), and now it's been updated to be purely JavaScript and HTML5.

In the words of the authors, "This should see the game through Spacewar!'s 50th (and hopefully 60th) birthday. Expect another update around 2025."

Sunday, February 27, 2011

New open source R IDE

Check out RStudio - with typical IDE niceties like autocomplete and windows for documentation/object browsing/history, this might be one of the first serious competitors to Emacs Speaks Statistics. It's also nice to see that it really is open, project on Github and all that.

It's also crossplatform - Windows, OSX, Linux, and even over the web itself. Very cool. Check out the other screenshots for a more complete feature overview, some things that catch my eye:

Support for Sweave/LaTeX (aka fancy typesetting and literate programming)
Easy plot manipulation/printing/exporting
Customizable layout/interface

I very much suggest checking it out, as it looks like a quite promising piece of software. Definitely a nice balance for people who want to still write real code but have a more gentle learning curve than Emacs, and possibly with enough substance to even tease away a few of the Emacs gurus.

Sunday, February 20, 2011

FreedomBox (and PirateBox)

A project worth checking out.

Because social networking and digital communications technologies are now critical to people fighting to make freedom in their societies or simply trying to preserve their privacy where the Web and other parts of the Net are intensively surveilled by profit-seekers and government agencies. Because smartphones, mobile tablets, and other common forms of consumer electronics are being built as "platforms" to control their users and monitor their activity.

Freedom Box exists to counter these unfree "platform" technologies that threaten political freedom. Freedom Box exists to provide people with privacy-respecting technology alternatives in normal times, and to offer ways to collaborate safely and securely with others in building social networks of protest, demonstration, and mobilization for political change in the not-so-normal times.

This project is still in early stages but is definitely worth following. This New York Times article gives a bit more context, but the short version is they're building a Debian derivative that will run on "Plug computers", which are embedded platforms entirely contained in power plugs. These are already available at ~$100 and are capable of being simple home servers, and as the software develops and users increase the hope is both price will go down and functionality will go up.

For contrast, there is a similar interesting project called Pirate Box. While the Freedom Box will focus on traditional "cloud" applications like mail and calendar, Pirate Box is focused on peer to peer connections and media sharing. Issues of legality and intellectual property aside, it's a pretty neat piece of technology. You can build it yourself quite cheaply with the following components:

You can think of this as a cross between the Pogoplug Multimedia Sharing Device - itself a plug computer - and the rather rebellious Dead Drops project. While piracy has a negative connotation (legally at least), the ability to freely share media is a critical aspect of any modern open society - and even within the existing legal restrictions, there is plenty of "intellectual property" that is either licensed permissively (for example, music from Jamendo) or in the public domain (for example, books from Project Gutenberg).

But back to the FreedomBox - why should you care? It's pretty simple - even if you really trust corporations and government to not abuse how the "cloud" grants easy access to your private information, having a cheap and simple "appliance" solution that lets you mirror your data locally just makes sense. It serves as a backup and as a fallback if your internet connection fails or if the service you depend on simply goes down (companies do fail, even large companies like Yahoo have cut services with relatively short notice).

For now these projects are all in early stages such that only "enthusiasts" and "hackers" will likely be comfortable setting these solutions up. But there is promise for the future, and if this ever gets to the "appliance" level it will hopefully even be built into routers and automatically guarantee all users a basic level of privacy, security, and control.

Sunday, February 13, 2011

Web development for software developers

A very handy list of hints. I'll pick out a few of my favorites:

Add the attribute rel="nofollow" to user-generated links to avoid spam.
Use SSL/HTTPS for login and any pages where sensitive data is entered (like credit card info).
Optimize images - don't use a 20 KB image for a repeating background.
Use "search engine friendly" URLs, i.e. use example.com/pages/45-article-title instead ofexample.com/index.php?page=45
Be aware that JavaScript can and will be disabled, and that AJAX is therefore an extension, not a baseline. Even if most normal users leave it on now, remember that NoScript is becoming more popular, mobile devices may not work as expected, and Google won't run most of your JavaScript when indexing the site.
Understand you'll spend 20% of your time coding and 80% of it maintaining, so code accordingly.

There's many more good tips so check it out if you're thinking of making a website. The JavaScript/AJAX hint in particular is a good one to remember as a lot of hotshot designs seem to be violating that rule quite blatantly (see: recent Gawker sites redesign).

Sunday, February 6, 2011

Evolutionary music

Short entry this week, but hopefully a precursor for future entries - Triumph of the Cyborg Composer is a fascinating article well worth the read. It details the work of "Emily Howell", a computer program that composes quite remarkably (samples included in the article).

I highlight this because it's a direction similar to a project I stumbled on called Evolution 9, which is an attempt to "evolve" music to mimic the Beatles. All of this could potentially play nicely with my RNM, so look for more on this in the future.

For now, enjoy the works of Emily Howell (her newly released album is Emily Howell: From Darkness, Light), and ponder the philosophical ramifications of a composing computer (Douglas Hofstadter talks about this nicely in the latter half of Gödel, Escher, Bach: An Eternal Golden Braid).

Sunday, January 30, 2011

JavaScript in Ten Minutes

JavaScript is a maligned language with a disadvantageous lineage. Hacked together for a limited purpose, rebranded because Java was cool (even though it's totally unrelated), and then stretched and used everywhere whether you want it or not. A good way to sum it up is with JavaScript in a Single Picture.

But despite these flaws, JavaScript is arguably the most commonly used programming language, at least in certain regards. Java may win in terms of official usage, but JavaScript is omnipresent and commonly the language that someone will first try "hacking around" with. And modern browsers have changed the rules - optimizing JavaScript is important and as a result it's actually possible to do some real things with it.

Adding client-side power and expressiveness to a webapp is very valuable, and so knowing at least enough JavaScript not to hurt anything is helpful. The trouble is that nondestructive coding is particularly hard with a language where variables are dynamically typed and by default global (e.g. unless you use var) and "==" comparisons coerce type (you want "===" for a true equivalency). It's incredibly easy to write JavaScript code - it's incredibly hard to write good JavaScript code.

A good place to start is JavaScript in Ten Minutes, which is in some ways almost a hyper-abbreviated version of JavaScript: The Good Parts in the picture linked earlier (the other book is JavaScript: The Definitive Guide which is more of a reference tome than a autodidactical tool). These are all fine resources, but since the first one is quick and free it's a nice place to start.

JavaScript in Ten Minutes gives a very quick overview of types and syntax and then proceeds to give some nice attention to objects and prototypes. This includes some features that actually distinguish JavaScript in a nice way (the inspiration behind "The Good Parts" book title), where you can use prototyping to easily maintain a common property across many objects.

I won't go into much more detail as the whole point is the source itself is a quick read. If you have any interest in JavaScript I'd say it's worth your 10 minutes. And if you find yourself wanting more, I'd recommend either of the O'Reilly books I linked to above.

Sunday, January 23, 2011

More Scala - is it worth migrating from Java?

After recently playing with Scala, I discovered somebody who did so more methodically and tried it out for a year. He breaks down using Scala to a number of categories and rates them from 1-5. His conclusion:

To summarise Scala evaluation, this is a scores matrix for all categories (1 - poor, 2 - fair, 3 - good, 4 - very good, 5 - excellent).

Programming language - very good (4)

Testing - very good (4)

Performance - very good (4)

Tools - fair (2)

Language extensions - good (3)

Interoperability - excellent (5)

Monitoring and maintenance - good (3)

Support - good (3)

Scala skills - poor (1)

A good read, especially if your curiosity was at all piqued by Scala from my entry last week. I very much agree that, while pushing Scala for "enterprise" is a bit risky (though at least moderately feasible due to the interoperability), it is definitely a worthy tool for personal use.

Sunday, January 16, 2011

Scala - concise Java?

As evidenced by my projects on Google Code, Python is my tool of choice for general programming. Python has a strong ecosystem and good technical underpinnings, but is also genuinely intuitive and even expressive. My pseudocode is closer to Python than any other language.

But I recognize that, in the "real world", Java is king. With top popularity ratings, Java is simply what people build "industry" applications with. I have used Java, but exclusively in the classroom - I have never pursued it recreationally or professionally. I can tell you why in one elongated word: FactoryFactoryFactoryFactory...

Python is concise. In the words of The Nuts and Bolts of College Writing, "Concision is intimately connected to clarity." Java is verbose - it can be argued that the verbosity is for sound structural and theoretical reasons, but it is verbosity nonetheless.

And so I was intrigued when I ran across the assertion that Scala == Effective Java. Scala has all the power and resources of Java, but with minimal additional requirements (basically one library) adds a quality that I would have to describe as almost Pythonic.

Java and Python both have well-defined philosophies, but Java tends to enforce itself in an almost exaggerated manner while Python more elegantly encourages good style. Scala tries to give this quality to Java, and from my initial perusing it meets with a fair amount of success.

As an example, here is a simple extension of a "Hello World!" program. The idea is to write a "HelloArgs" program that says "Hello " followed by any command line arguments passed to it. Since these arguments are normally a list/vector/array of strings/chars/whatever, this requires a wee bit of parsing to compress it into one line and separate with spaces.

Here it is in Python (3.1):

def HelloArgs(args):
  print('Hello ' + ' '.join([str(arg) for arg
                             in args]))

A one-liner (at 80 columns, less here on the blog), reasonably sensible, though the idea of using a blank space and applying join to it might be a bit unintuitive at first. The list comprehension syntax is reasonable though. and better than this esoteric alternative:

def HelloArgs(args):
  print('Hello ' + ' '.join(map(''.join,
                                args)))

This works and is fewer characters, but this isn't a game of Perl Golf here. Brevity is connected to clarity but you still need to have enough body to communicate your message.

Now, here is an equivalent method in Scala:

object HelloArgs {
  def main(args : Array[String]) : Unit = {
    Console.println("Hello " +
                    args.mkString(" "))
  }
}

Admittedly there are a few more lines, but this is actually a complete valid Scala application (the Python method above is just a method and would require working with a main() method or similar to run from the command line). It's simple, doesn't take too long to figure out ("mkString" is a way to "make strings" delineated by a given separator), and plays nicely in JVM land.

Now really this barely scratches the surface, and that's because I don't know much about Scala yet. I am looking forward to learning more, and for those interested I highly suggest the Eclipse Scala IDE. I'm using the Helios version and it went perfectly smoothly, and I have to admit I do see a place for "robust" (if large) IDEs. I like doing Python in Vim of course, but that's another matter...

Sunday, January 9, 2011

IPv6 - Now with Nethack

Oil. Uranium. Coal.

There are many limited resources in the world, but with all our gadgetry we have now created a new one - IPv4 addresses.

Our numeric friends which help all our technology communicate are in fact a rapidly disappearing commodity - less than 3% remain to be allocated, with an estimated depletion between June and December this year.

Of course you may already have heard that - the cries of IPv4 end-times have rung for some time. People have been content to ignore them as long as the status quo works, even though we have a perfectly suitable alternative that we really ought to start using. That is, IPv6.

IPv4 was invented many years ago, before the potential of the modern internet was realized. As such, IPv6 offers a number of enhancements - better mobile support, mandatory security, larger packets, and so forth. But the most obvious difference, and the raison d'etre, is that it supports a much larger address space. That is to say, you can have more of them.

This of course makes them longer and lets them use more characters (letters). Here's an IPv6 address: 2a00:1018:801:1000::4d3f:a9c8

Note that this address is in fact abbreviated. There are rules allowing you to omit leading 0's from IPv6 addresses. The full address being represented above is: 2a00:1018:0801:1000:0000:0000:4d3f:a9c8

An IPv6 address consists of 8 groups of four characters (16 bits), totaling 128 bits and allowing for approximately 3.4 x 10^38 addresses. IPv4 addresses are 32 bits and allow for approximately 4.3 x 10^9 addresses.

In other words, there are already more human beings than IPv4 addresses, while there are more IPv6 addresses than there are stars in the observable universe. As an aside, according to current estimates there are actually still more atoms in the universe than IPv6 addresses.

Of course, if you read the title of this post you're probably wondering "what about Nethack?". Well, the IPv6 address I used as an example is actually pointed at a Debian VPS that I fought long and hard with that now serves Nethack (much like nethack.alt.org). Just telnet to it and you'll be greeted with Nethack, and you can be one of the first to play this classic game in a brave new internet realm.

But those long addresses are a pain to bandy about, so thankfully DNS works just peachy in IPv6 as well. I assigned nethack.proles.net to be the above IPv6 address, so if you have IPv6 access you can telnet right to that and be on your way.

Now you probably don't actually have IPv6 access. If you're on *nix you're in luck - Miredo is easy to install and configure and gets you IPv6 access without even needing to set up an account. Otherwise, I'd suggest Tunnelbroker. It'll take some doing, but you'll be on the forefront of where the internet is going (or at least needs to go).

And that's that - I'll close by just saying that the layers of geekery inherent in Nethack over IPv6 bring me great joy. Thanks for reading!

Addendum - nethack.fi also supports IPv6 (and IPv4, whereas my offering is IPv6 only). Oh well, still cool to add something to the IPv6 world.

Sunday, January 2, 2011

A year in R

Tis the season for retrospectives, and I happened across a particularly appropriate one for statistical programming - R in 2010.

In a nutshell, R is an open-source object-oriented functional programming language intended for statistical applications and based on S (a product from Bell Labs in the 70s). Over the past decade it has become the de facto standard for statisticians (e.g. people actually in stats departments), as well as enjoyed increasing use in biostatistics, computer science, and the social sciences.

The most interesting part of the linked post is the first section listing the top 14 R-related posts from 2010. This gives a good idea of the range of tasks R can be used for, from publishing quality graphics to data mining to game theory (Prisoner's dilemma).

Personally I've most used R in combination with Sweave/LaTeX ("literate programming") to simultaneously perform statistical analysis and typeset the report (potentially even automating the process). This guide looks interesting, though it adds Eclipse to the mix (I find Emacs Speaks Statistics to be the best way to use R).

Of course if all of the above sounds like gibberish to you and you just want to get started, there's a bunch of resources for that too. R is a powerful, well-supported, and surprisingly expressive language and I would advocate for anybody considering statistics beyond basic summaries (e.g. means and the like). It has a steep learning curve (the interactive prompt "command line") but can be made gentler by installing GUI plugins such as R Commander.

All in all, a good year for a good software package.