My friend Omri created a Genome Compiler making synthetic biology more accessible and making projects like these glow in the dark plants a reality.
These tools are revolutionary.
More of this. See the main DIYBio mailing list.
My friend Omri created a Genome Compiler making synthetic biology more accessible and making projects like these glow in the dark plants a reality.
These tools are revolutionary.
More of this. See the main DIYBio mailing list.
With Local Secondary Indexes now available in Dynamo, lots of new possibilities are available. The one that came to mind right after the announcement was sorted sets. If you’re familiar with Redis, you’ll already know how handy these can be. On ex.fm, this is where all of the data from trending comes from. As users start interacting with songs, points are added to it’s daily total. In redis, this looks something like:
ZADD trending:20130419 song_id:1 1
ZADD trending:20130419 song_id:1 0.1
ZADD trending:20130419 song_id:2 0.3
ZADD trending:20130419 song_id:3 0.7
ZADD trending:20130419 song_id:4 0.9
Then to grab song ids to show for trending we just grab the top 20 by points:
ZRANGE 0 20
Now with secondary indexes, we can implement the same basic functionality in Dynamo and elliminate another moving part.
If we have a Trending table with a hash on day, a range on song_id and a secondary index on points, our full schema looks something like this.
{
"KeySchema": [
{
"AttributeName": "day",
"AttributeType": "HASH"
},
{
"AttributeName": "song_id",
"AttributeType": "RANGE"
}
],
"AttributeDefinitions": [
{
"AttributeName": "day",
"AttributeType": "S"
},
{
"AttributeName": "song_id",
"AttributeType": "S"
},
{
"AttributeName": "points",
"AttributeType": "N"
}
],
"LocalSecondaryIndexes": [
{
"IndexName": "points-index",
"KeySchema": [
{
"AttributeName": "day",
"KeyType": "HASH"
},
{
"AttributeName": "points",
"KeyType": "RANGE"
},
],
"Projection": {
"NonKeyAttributes": [
"song_id"
],
"ProjectionType": "INCLUDE"
}
}
]
}
Coding this up with mambo is a little easier to chew on:
var mambo = require('mambo');
var model = new mambo.Model(new mambo.Schema(
'Trending', 'trending', ['day', 'song_id'],
{
'day': mambo.StringField,
'song_id': mambo.StringField,
'points': mambo.NumberField,
'points-index': new mambo.IndexField('points').project(['song_id'])
}
));
Then instead of calling ZADD, we just make an atomic update:
var items = [
['song_id:1', 1],
['song_id:1', 0.1],
['song_id:2', 0.3],
['song_id:3', 0.7],
['song_id:4', 0.9]
],
day = '20130419';
items.map(function(item){
return model.update('trending', day, '1')
.inc('points', 1)
.commit();
});
Instead of calling ZRANGE to get our most popular song ids, we query with our secondary index:
model.objects('trending', day)
.index('points-index')
.reverse() // Return DESC by points
.fetch()
.then(function(items){
var ids = items.map(function(item){
return item.song_id;
});
});
After a jetlag fueled morning of reading up on how to actually use secondary indexes and getting this initial version working, I added the SortedSet helper right into mambo. I’m really excited to get this into production and see how it actually performs.
We started working on the sites redesign a while back. The extension has always kept a history of which sites with music you’ve visited and then exposed the usual iTunes grid UI to browse your history and drill down to listen to everything from just a single site. All of this history was kept in local storage and only on the client side. This had two major drawbacks we wanted to solve immediately on this project.
Local Storage has a hard limit of 5 MB of data. There is not way in the API to see how close you are to that limit. Our initial tactic for addressing this issue was to limit the history to 1000 songs. Even with this change, there were still issues storing a lot of information because reads and writes always go to disk. On an SSD, you probably didn’t notice the hiccups. On a spinning disk though, this could mean the extension locking up the browser for 15 seconds or more. Unacceptable.
Storing all of this information only in Local Storage of course meant it was limited only to one device. It’s a bad experience having different data on at work vs your laptop at home, not to mention it’s just weird not being able to access your information for your phone in any fashion.
A couple of weeks into the sites redesign, we had all of our new API’s in place, things were looking good, but something was really missing. We could easily grab all of the meta info for a site (title, favicon, description). For soundcloud pages and bandcamp albums, we could easyily grab the album art and show that as a thumbnail for the site, but it really needed screenshots for all those blogs posting really amazing music. We played around with the usual GTK/QT based screenshot scripts, but they all had their own downsides and just didn’t work.
A few days into this experimenting process, we were starting to lose hope. PhantomJS looked great, but involved a lot of hacking to get it running on ubuntu ec2 instances. Miraculously, just after we discovered Phantom, 1.5 was released and it was huge. 1.5 ran perfectly on the instances with no hacking!
Once we found Phantom could solve our screenshot problem we started to dig deeper. The extension code that finds what things are playable on a page was constantly in flux. API’s and regexes would need constant updating and pushing out extension updates takes several days to filter down to all users, meaning our core functionality could be broken for users for several days. With PhantomJS, we were able to move almost all of this discovery code to the backend. When an API or regex changes, we can now immediately push an update to the server, and almost no one notices. In order to make this actually happen, we prototyped a small wrapper script to load the page in PhantomJS and inject the extension playable discovery code. It was a burning bush type moment.
Our initial design was to use celeryworkers to call the PhantomJS script via subprocess. Too predictably, this completed bombed all of our servers. After a quick re-think, we whipped together a a little express app that would run on a single instance as a REST service and memcache discovery results for a few minutes. Celery workers could then just make an HTTP call and this has worked great. (Side note: service-ify everything.)
For any PhantomJS script longer than just a few lines, make you life easier and user phantomjs-nodify or hack in your own require function override (PhantomJS’s built-in require can only access a subset of moudles like fs, although hopefully this will change in the future). Without a good require function, longer scripts will get out of control really quickly. Modularize as much as possible, as early as possible.
Set a timeout for child_process.exec and another loading timeout in your page loader script to avoid thousands of processes hanging out.
Use cluster on the express app so you can swamp the whole box.
Handle errors in your page loader script. Scripts you are injecting will be broken in some cases because of unexpected failures or broken scripts on the pages you’re trying to load or unexpected native object prototype overloads. Make sure to at least log failures somewhere as most of these problems are easily fixed.
Security groups are one of the many great features of AWS. To give you a quick primer: security groups allow you to define inbound traffic rules based on IP address or security group like IPtables, but without having to edit the rules on every machine when you change them. They also allow you the advantage of applying rules based on security group, so that your application instances can talk to mongod, but nothing outside of your application can.
The following assumes that you, the reader, already has some working knowledge of how security groups work and are using them. If not, take a look at the Building three-tier architectures with security groups on the AWS blog.
Mongo’s built-in security is admittedly weak at best, but getting better. In fact, mongo’s default seems to be to have you not use its built-in security at all, instead putting the impetus on you to create a “trusted environment”. The most common way to set up this trusted environment on AWS (other than not having it at all) is to modify the security group under which your instances are currently running and allow access to mongo only from specific IP addresses. This is a also a great way to guarantee heartburn when you forget allow a new instance or want to make your infrastructure more liquid. A more effectiv way would be to allow only traffic from specific security groups— that is, only from your instances. An important note: In order to use security group based rules, you must connect to the instance via the private ip address. Internally EC2 will route the public DNS ec2-100-00-000-000.compute-1.amazonaws.com to the correct private IP address of 10.100.00.00 for you.
We found the automatic routing to be a bit funky when actually trying to get this all set up and into production. Instead, we created another domain in route53, say supersecretex.am, and added A records pointing directly to the private IP’s. This also made it easier to tell which things in the application we knew to be locked down, and which were intentionally more open. We also added A records pointing to the public DNS to make SSH, health checks and ops involve less typing. Needless to say, you should also have your mongo instances associated with an elastic IP address. There’s nothing worse than the DNS changing out from under you and being woken up in the middle of the night because your app is now completely down.
Say we have five instances: Two web apps (oscar and elmo), a mongo primary (bert), a mongo secondary (ernie) and an arbiter (arbiter). The five instances are spread out between two security groups, and we’re using the “add IP addresses as we add new instances” approach.
Port - Source
22 - (SSH) 0.0.0.0/0
80 - (HTTP)0.0.0.0/0
443 - (HTTPS)0.0.0.0/0
Port - Source
22 - (SSH)0.0.0.0/0
27017 - arbiter's ip address
27017 - bert's ip address
27017 - ernies's ip address
27017 - oscar's ip address
27017 - elmo's ip address
30000 - bert's ip address
30000 - ernies's ip address
Now this is all well and good, but it’s a headache when you add a new web instance to rememeber to add its IP to the Mongo security group; not to mention using autoscaling or cloudformation. So, let’s simplify this by adding two new rules to the mongo group to allow inbound traffic from the web and mongo security groups. This is easy to do in the AWS console and it will autocomplete the correct security groups for you when you start typing sg into the source input box. Our mongo security group will now look like this:
Port - Source
22 - (SSH)0.0.0.0/0
27017 - sg1
27017 - sg2
27017 - arbiter's ip address
27017 - bert's ip address
27017 - ernies's ip address
27017 - oscar's ip address
27017 - elmo's ip address
30000 - sg2
30000 - bert's ip address
30000 - ernies's ip address
Don’t go deleteing those old rules just yet! We have a bit more to do to make sure things go smoothly.
After you add your new DNS records and before you restict access to the security group, you’ll want to update your replica set configuration to point to the new DNS entries. Don’t worry. It’s not as scary as it sounds.
PRIMARY> rs.conf()
{
"_id" : "mydb",
"version" : 1,
"members" : [
{
"_id" : 0,
"host" : "mongo-bert.company.com:27017",
"self": true
},
{
"_id" : 1,
"host" : "mongo-arbiter.company.com:30000",
"arbiterOnly" : true
},
{
"_id" : 2,
"host" : "mongo-ernie.company.com:27017"
}
]
}
The first thing we’ll want to do is update the config for arbiter and ernie. So jump into the mongo shell on your primary and run the following:
PRIMARY> conf = rs.conf()
PRIMARY> conf.members[0].host = "mongo-bert.int-company.com:27017"
PRIMARY> conf.members[1].host = "mongo-arbiter.int-company.com:30000"
PRIMARY> conf.members[2].host = "mongo-ernie.int-company.com:27017"
PRIMARY> rs.reconfig(conf)
Running this on the primary will automatically update the configuration on your secondary and arbiter, so you just have to run it once. More information on replica set reconfiguration can be found in the docs. You’ll probably also want to update the hostnames on your mongo instances to use the new DNS entries.
If you’re using MMS, which you absolutely should be, you’ll need to make some changes so it doesn’t freak out and start sending you alerts. You can make changes in your /etc/hosts on the instance running MMS (usually the arbiter) to point the public dns records to the internal dns records. In our case, we just deleted our old hosts in MMS that were using the public DNS records and added the new internal DNS records for that oh-so-fresh feeling and didn’t really mind losing all of the old, bad data.
We’re now ready to remove those old IP address specific rules. By now you’ve:
Go ahead and delete the ip address based rules in the Mongo security group and hit apply. The Mongo security group will now look like:
Port - Source
22 - (SSH)0.0.0.0/0
27017 - sg1
27017 - sg2
30000 - sg2
Much nicer, no? Your mongo cluster is now running in a trusted environment and you can add and remove machines as needed.
A while back, John Resig shared how they handle configuration info at Khan Academy and I thought I would share our solution at exfm as well.
We have two reasons for not keeping configuration data in the code. First, we don’t want to accidentally expose the data. This happens all the time; accidentally making a Github Gist public when it should have been private, committing your local config.json file to a public repo. The second is we don’t want to deploy just to enable a feature for A/B Testing.
The solution we use is centralizing the configuration details behind a service. There are plenty of ways to skin this cat, for example Zookeeper, Netflix’s Archaius or Rackspace Service Registry. These configuration services have, more or less, two layers of security: one on the network level (iptables, AWS security groups, etc) and the second being simple username and password authentication. The problem you run into next is where do you store the credentials to access the configuration service? You don’t want these credentials to be littered throughout your different projects. You’ll just incur an extra expense when it’s time to update them (which should be done regularly) and inevitably, each project will end up implementing a slightly different API to actually retrieve the configuration data.
The trick we use is to exploit the fact that npm and pip can install requirements from private repositories. We have a private repository for each language we use that just contains the credentials for the configuration service and a consistent API for getting config values. Each project then just adds the repo as a requirement to package.json or requirements.txt.
In lieu of Archaius or Zookeeper, we have a little express app running on one of our utility instances. It handles basic HTTP auth from the incoming clients and grabs JSON from DynamoDB by environment. Even this uses a local config.json on the instance, so no one ever really sees the AWS credentials. As a bonus, DynamoDB comes with a ready made GUI in the AWS Console so we can reuse all of the built in AWS security, and updating or adding values is really simple.
So, how does this all fit together in code? Here are a few examples.
// Add our config client to dependencies in package.json
"config-client": "git+ssh://git@github.com/<username>/config-client.git"
// server.js
var nconf = require('nconf'),
getConfig = require('config-client'); // configuration client
// Configure nconf
nconf.argv().env().use('memory');
// Bundles making REST calls to confighost/NODE_ENV
getConfig(nconf.get('NODE_ENV'), function(err, config){
if(err) return console.error('ERROR ', err);
nconf.overrides(config);
// Start listening for incoming
});
Or with for our python services
# Add to requirements.txt
-e git+ssh://git@github.com/<username>/config-client.git#egg=exfmconfigclient
# Put this in a module, say exfmconfigclient.__init__.py
config_username = <some username>
config_password = <some password>
config_host = 'https://someserver.com'
__all__ = ['get_config']
def get_config(env):
return requests.get(config_host + '/' + env, username=config_username,
password=config_password, headers={'Accept': 'application/json'}).json
# app.py, __init__.py, etc
from flask import Flask
from exfmconfigclient import get_config
def get_environment():
# Load from a file or environment variable like APP_ENVIRONMENT=production etc
def create_app():
app = Flask(__name__)
app.config.from_object(get_config(get_environment()))
return app
We’ve been using this for a few months now and it’s been working out pretty well. Not having to think about configuration at all feels great.