__lucas

npm version

Found a nice little helper in npm for my module development workflow I haven’t heard anyone bring up before and thought I would share: npm version.

My most frequent screw-ups when publishing module changes are:

  • forgetting to increment the version number in package.json
  • making a promise to add git tags for each version, and then promptly forgetting to add them 4 times out of 10

Fortunately, npm version is here to help.

With no args, you’ll get the versions of npm, node and node’s dependencies. If you’re in a module with a package.json, you’ll get it’s version number as well.

{
    http_parser: '1.0',
    node: '0.10.11',
    v8: '3.14.5.9',
    ares: '1.9.0-DEV',
    uv: '0.10.11',
    zlib: '1.2.3',
    modules: '11',
    openssl: '1.0.1e',
    npm: '1.2.30',
    'imlucas.github.com': '0.0.1'
}

npm version also has a setter form. The most basic is to just set the version in package.json and create a tag.

[master] imlucas.github.com/ npm version 0.0.2
v0.0.2
[master] imlucas.github.com/ cat package.json
{
  "name": "imlucas.github.com",
  "version": "0.0.2",
  "dependencies": {}
}
[master] imlucas.github.com/ git tag -l
v0.0.1
v0.0.2

That’s handy, but prone to fat-fingering. Instead, you can pass a valid argument to semver.inc, major, minor, patch or build. This gives a pretty nice workflow for day-to-day.

# ... write code
# ... run tests
# ... commit

[master] imlucas.github.com/ npm version patch
v0.0.3
[master] imlucas.github.com/ cat package.json
{
  "name": "imlucas.github.com",
  "version": "0.0.3",
  "dependencies": {}
}
[master] imlucas.github.com/ git tag -l
v0.0.1
v0.0.2
v0.0.3

# ... push
# ... npm publish

Please do this. People using your modules for their projects will be extremely greatful and not send you hate mail. Now that I’ve been using this for a bit, next steps are:

  • is it helpful to generate a CHANGELOG.md with git log <previous-version> <new-version> and add that as the tag commit messgage?
  • how does it feel to increment package build number, with branch name after running through CI?

Waiting for Instances to Start with Python and Boto

Again, this is something I do almost everyday and finally got around to scripting it. At some point there was a helper already in boto to do this, but it seems to have been factored out.

import boto
import copy

def wait_for_instances_to_start(conn, instance_ids, pending_ids):
    """Loop through all pending instace ids waiting for them to start.
        If an instance is running, remove it from pending_ids.
        If there are still pending requests, sleep and check again in 10 seconds.
        Only return when all instances are running."""
    reservations = conn.get_all_instances(instance_ids=pending_ids)
    for reservation in reservations:
        for instance in reservation.instances:
            if instance.state == 'running':
                print "instance `{}` running!".format(instance.id)
                pending_ids.pop(pending_ids.index(instance.id))
            else:
                print "instance `{}` starting...".format(instance.id)
    if len(pending_ids) == 0:
        print "all instances started!"
    else:
        time.sleep(10)
        wait_for_instances_to_start(conn, instance_ids, pending_ids)

# Connect to EC2
conn = boto.connect_ec2()

# Run an instance
requests = [conn.run_instances('<ami-image-id>')]

instance_ids = [instance.id for instance in request.reservation.instances
                for request in requests]

# Wait for our spots to fulfill
wait_for_instances_to_start(conn, instance_ids, copy.deepcopy(instance_ids))

Waiting for Spot Instances to Be Fulfilled with Python and Boto

Something I do constantly is start spot instances, wait for them to be fulfilled and then work with them. I finally got around to just scripting this yesterday.

import boto
import copy

def wait_for_fulfillment(conn, request_ids, pending_request_ids):
    """Loop through all pending request ids waiting for them to be fulfilled.
    If a request is fulfilled, remove it from pending_request_ids.
    If there are still pending requests, sleep and check again in 10 seconds.
    Only return when all spot requests have been fulfilled."""
    results = conn.get_all_spot_instance_requests(request_ids=pending_request_ids)
    for result in results:
        if result.status.code == 'fulfilled':
            pending_request_ids.pop(pending_request_ids.index(result.id))
            print "spot request `{}` fulfilled!".format(result.id)
        else:
            print "waiting on `{}`".format(result.id)

    if len(pending_request_ids) == 0:
        print "all spots fulfilled!"
    else:
        time.sleep(10)
        wait_for_fulfillment(conn, request_ids, pending_request_ids)

# Connect to EC2
conn = boto.connect_ec2()

# Request a spot instance
# Requests is a list because in the real world, you'll probably
# want to make these requests in multiple availability zones
requests = [conn.request_spot_instances(price, image_id, count=1,
        type='one-time', instance_type='m1.micro')]

# Figure out what our actual spot reservations are
request_ids = [req.id for req in request for request in requests]


# Wait for our spots to fulfill
wait_for_fulfillment(conn, request_ids, copy.deepcopy(request_ids))
My friend Omri is created a Genome Compiler making synthetic biology more accessible and allowing projects like these glow in the dark plants. I recommend you check it out.

idancohen:

My friend Omri created a Genome Compiler making synthetic biology more accessible and making projects like these glow in the dark plants a reality.

These tools are revolutionary.

More of this.  See the main DIYBio mailing list.

Sorted Sets in DynamoDB

With Local Secondary Indexes now available in Dynamo, lots of new possibilities are available. The one that came to mind right after the announcement was sorted sets. If you’re familiar with Redis, you’ll already know how handy these can be. On ex.fm, this is where all of the data from trending comes from. As users start interacting with songs, points are added to it’s daily total. In redis, this looks something like:

ZADD trending:20130419 song_id:1 1
ZADD trending:20130419 song_id:1 0.1
ZADD trending:20130419 song_id:2 0.3
ZADD trending:20130419 song_id:3 0.7
ZADD trending:20130419 song_id:4 0.9

Then to grab song ids to show for trending we just grab the top 20 by points:

ZRANGE 0 20

Now with secondary indexes, we can implement the same basic functionality in Dynamo and elliminate another moving part.

If we have a Trending table with a hash on day, a range on song_id and a secondary index on points, our full schema looks something like this.

{
    "KeySchema": [
        {
            "AttributeName": "day",
            "AttributeType": "HASH"
        },
        {
            "AttributeName": "song_id",
            "AttributeType": "RANGE"
        }
    ],
    "AttributeDefinitions": [
        {
            "AttributeName": "day",
            "AttributeType": "S"
        },
        {
            "AttributeName": "song_id",
            "AttributeType": "S"
        },
        {
            "AttributeName": "points",
            "AttributeType": "N"
        }
    ],
    "LocalSecondaryIndexes": [
        {
            "IndexName": "points-index",
            "KeySchema": [
                {
                    "AttributeName": "day",
                    "KeyType": "HASH"
                },
                {
                    "AttributeName": "points",
                    "KeyType": "RANGE"
                },

            ],
            "Projection": {
                "NonKeyAttributes": [
                    "song_id"
                ],
                "ProjectionType": "INCLUDE"
            }
        }
    ]
}

Coding this up with mambo is a little easier to chew on:

var mambo = require('mambo');

var model = new mambo.Model(new mambo.Schema(
    'Trending', 'trending', ['day', 'song_id'],
    {
        'day': mambo.StringField,
        'song_id': mambo.StringField,
        'points': mambo.NumberField,
        'points-index': new mambo.IndexField('points').project(['song_id'])
    }
));

Then instead of calling ZADD, we just make an atomic update:

var items = [
        ['song_id:1', 1],
        ['song_id:1', 0.1],
        ['song_id:2', 0.3],
        ['song_id:3', 0.7],
        ['song_id:4', 0.9]
    ],
    day = '20130419';

items.map(function(item){
    return model.update('trending', day, '1')
        .inc('points', 1)
        .commit();
});

Instead of calling ZRANGE to get our most popular song ids, we query with our secondary index:

model.objects('trending', day)
    .index('points-index')
    .reverse() // Return DESC by points
    .fetch()
    .then(function(items){
        var ids = items.map(function(item){
            return item.song_id;
        });
    });

After a jetlag fueled morning of reading up on how to actually use secondary indexes and getting this initial version working, I added the SortedSet helper right into mambo. I’m really excited to get this into production and see how it actually performs.

Lucas:
its like promises you know? you already use'em and love'em, but you don't know that's what they actually are
jm:
just like t-pain

24C3: Programming DNA: A 2-bit language for engineering biology

Lucas:
so just for the curious...
Tallinn - St Petersburg Depart Mon 18 March 2013 at 1900 - Duration 16hr 30min
St Petersburg - Tallinn Depart Tue 19 March 2013 at 1900 - Duration 63hr 30min
by ferry
YOUR TICKET PRICE IS
680.37 GBP
!!!!
Jess:
nooooooooo
don't do eet!
Lucas:
63 HOURS!
Jess:
hehehe
Lucas:
its like a hundred miles between tallinn and st petersburg
Jess:
they have to break the ice
cut them a break

Why PhantomJS is Awesome

We started working on the sites redesign a while back. The extension has always kept a history of which sites with music you’ve visited and then exposed the usual iTunes grid UI to browse your history and drill down to listen to everything from just a single site. All of this history was kept in local storage and only on the client side. This had two major drawbacks we wanted to solve immediately on this project.

Kill Local Storage

Local Storage has a hard limit of 5 MB of data. There is not way in the API to see how close you are to that limit. Our initial tactic for addressing this issue was to limit the history to 1000 songs. Even with this change, there were still issues storing a lot of information because reads and writes always go to disk. On an SSD, you probably didn’t notice the hiccups. On a spinning disk though, this could mean the extension locking up the browser for 15 seconds or more. Unacceptable.

Device Portability

Storing all of this information only in Local Storage of course meant it was limited only to one device. It’s a bad experience having different data on at work vs your laptop at home, not to mention it’s just weird not being able to access your information for your phone in any fashion.

Enter PhantomJS

A couple of weeks into the sites redesign, we had all of our new API’s in place, things were looking good, but something was really missing. We could easily grab all of the meta info for a site (title, favicon, description). For soundcloud pages and bandcamp albums, we could easyily grab the album art and show that as a thumbnail for the site, but it really needed screenshots for all those blogs posting really amazing music. We played around with the usual GTK/QT based screenshot scripts, but they all had their own downsides and just didn’t work.

A few days into this experimenting process, we were starting to lose hope. PhantomJS looked great, but involved a lot of hacking to get it running on ubuntu ec2 instances. Miraculously, just after we discovered Phantom, 1.5 was released and it was huge. 1.5 ran perfectly on the instances with no hacking!

Neat! We can take screenshots now!

Once we found Phantom could solve our screenshot problem we started to dig deeper. The extension code that finds what things are playable on a page was constantly in flux. API’s and regexes would need constant updating and pushing out extension updates takes several days to filter down to all users, meaning our core functionality could be broken for users for several days. With PhantomJS, we were able to move almost all of this discovery code to the backend. When an API or regex changes, we can now immediately push an update to the server, and almost no one notices. In order to make this actually happen, we prototyped a small wrapper script to load the page in PhantomJS and inject the extension playable discovery code. It was a burning bush type moment.

Getting it to production

Our initial design was to use celeryworkers to call the PhantomJS script via subprocess. Too predictably, this completed bombed all of our servers. After a quick re-think, we whipped together a a little express app that would run on a single instance as a REST service and memcache discovery results for a few minutes. Celery workers could then just make an HTTP call and this has worked great. (Side note: service-ify everything.)

Other Improvements

For any PhantomJS script longer than just a few lines, make you life easier and user phantomjs-nodify or hack in your own require function override (PhantomJS’s built-in require can only access a subset of moudles like fs, although hopefully this will change in the future). Without a good require function, longer scripts will get out of control really quickly. Modularize as much as possible, as early as possible.

Set a timeout for child_process.exec and another loading timeout in your page loader script to avoid thousands of processes hanging out.

Use cluster on the express app so you can swamp the whole box.

Handle errors in your page loader script. Scripts you are injecting will be broken in some cases because of unexpected failures or broken scripts on the pages you’re trying to load or unexpected native object prototype overloads. Make sure to at least log failures somewhere as most of these problems are easily fixed.

Securing MongoDB on AWS

Security groups are one of the many great features of AWS. To give you a quick primer: security groups allow you to define inbound traffic rules based on IP address or security group like IPtables, but without having to edit the rules on every machine when you change them. They also allow you the advantage of applying rules based on security group, so that your application instances can talk to mongod, but nothing outside of your application can.

The following assumes that you, the reader, already has some working knowledge of how security groups work and are using them. If not, take a look at the Building three-tier architectures with security groups on the AWS blog.

Mongo’s built-in security is admittedly weak at best, but getting better. In fact, mongo’s default seems to be to have you not use its built-in security at all, instead putting the impetus on you to create a "trusted environment". The most common way to set up this trusted environment on AWS (other than not having it at all) is to modify the security group under which your instances are currently running and allow access to mongo only from specific IP addresses. This is a also a great way to guarantee heartburn when you forget allow a new instance or want to make your infrastructure more liquid. A more effectiv way would be to allow only traffic from specific security groups— that is, only from your instances. An important note: In order to use security group based rules, you must connect to the instance via the private ip address. Internally EC2 will route the public DNS ec2-100-00-000-000.compute-1.amazonaws.com to the correct private IP address of 10.100.00.00 for you.

We found the automatic routing to be a bit funky when actually trying to get this all set up and into production. Instead, we created another domain in route53, say supersecretex.am, and added A records pointing directly to the private IP’s. This also made it easier to tell which things in the application we knew to be locked down, and which were intentionally more open. We also added A records pointing to the public DNS to make SSH, health checks and ops involve less typing. Needless to say, you should also have your mongo instances associated with an elastic IP address. There’s nothing worse than the DNS changing out from under you and being woken up in the middle of the night because your app is now completely down.

Lock It Down and Simplify

Say we have five instances: Two web apps (oscar and elmo), a mongo primary (bert), a mongo secondary (ernie) and an arbiter (arbiter). The five instances are spread out between two security groups, and we’re using the “add IP addresses as we add new instances” approach.

Web - sg1

Port - Source
22 - (SSH) 0.0.0.0/0
80 - (HTTP)0.0.0.0/0
443 - (HTTPS)0.0.0.0/0

Mongo - sg2

Port - Source
22 - (SSH)0.0.0.0/0
27017 - arbiter's ip address
27017 - bert's ip address
27017 - ernies's ip address
27017 - oscar's ip address
27017 - elmo's ip address
30000 - bert's ip address
30000 - ernies's ip address

Now this is all well and good, but it’s a headache when you add a new web instance to rememeber to add its IP to the Mongo security group; not to mention using autoscaling or cloudformation. So, let’s simplify this by adding two new rules to the mongo group to allow inbound traffic from the web and mongo security groups. This is easy to do in the AWS console and it will autocomplete the correct security groups for you when you start typing sg into the source input box. Our mongo security group will now look like this:

Port - Source
22 - (SSH)0.0.0.0/0
27017 - sg1
27017 - sg2
27017 - arbiter's ip address
27017 - bert's ip address
27017 - ernies's ip address
27017 - oscar's ip address
27017 - elmo's ip address
30000 - sg2
30000 - bert's ip address
30000 - ernies's ip address

Don’t go deleteing those old rules just yet! We have a bit more to do to make sure things go smoothly.

Changing replica set configuration

After you add your new DNS records and before you restict access to the security group, you’ll want to update your replica set configuration to point to the new DNS entries. Don’t worry. It’s not as scary as it sounds.

PRIMARY> rs.conf()
{
    "_id" : "mydb",
    "version" : 1,
    "members" : [
        {
            "_id" : 0,
            "host" : "mongo-bert.company.com:27017",
            "self": true
        },
        {
            "_id" : 1,
            "host" : "mongo-arbiter.company.com:30000",
            "arbiterOnly" : true
        },
        {
            "_id" : 2,
            "host" : "mongo-ernie.company.com:27017"
        }
    ]
}

The first thing we’ll want to do is update the config for arbiter and ernie. So jump into the mongo shell on your primary and run the following:

PRIMARY> conf = rs.conf()
PRIMARY> conf.members[0].host = "mongo-bert.int-company.com:27017"
PRIMARY> conf.members[1].host = "mongo-arbiter.int-company.com:30000"
PRIMARY> conf.members[2].host = "mongo-ernie.int-company.com:27017"
PRIMARY> rs.reconfig(conf)

Running this on the primary will automatically update the configuration on your secondary and arbiter, so you just have to run it once. More information on replica set reconfiguration can be found in the docs. You’ll probably also want to update the hostnames on your mongo instances to use the new DNS entries.

MMS

If you’re using MMS, which you absolutely should be, you’ll need to make some changes so it doesn’t freak out and start sending you alerts. You can make changes in your /etc/hosts on the instance running MMS (usually the arbiter) to point the public dns records to the internal dns records. In our case, we just deleted our old hosts in MMS that were using the public DNS records and added the new internal DNS records for that oh-so-fresh feeling and didn’t really mind losing all of the old, bad data.

Let’s Do This!

We’re now ready to remove those old IP address specific rules. By now you’ve:

  • Updated your application configuration to point to the internal DNS records
  • Deployed your application to production with the new configuration changes
  • Updated MMS so it will be happy after this change

Go ahead and delete the ip address based rules in the Mongo security group and hit apply. The Mongo security group will now look like:

Port - Source
22 - (SSH)0.0.0.0/0
27017 - sg1
27017 - sg2
30000 - sg2

Much nicer, no? Your mongo cluster is now running in a trusted environment and you can add and remove machines as needed.