First, it uses Hadoop to support its associates program, in which affiliates post links to Amazon-based products on their websites and get a percentage of related revenue. Originally, Rauser explained, Amazon developers wrote three separate applications in C++ to process and analyze data associated with these transactions to determine how much to pay each affiliate. But the system quickly began to run up against scaling issues, particularly every fourth quarter (typically the busiest quarter of the year for Amazon.) Each October through December Amazon engineers scrambled to add memory to one application in particular, hoping it would hold on for “just one more fourth quarter.” Finally, in 2010, Amazon turned to Hadoop on AWS to process the data. Now, data streams daily into S3 and each morning a Hadoop cluster processes the last 60 days worth of affiliate data, which is then shipped back to S3 and then on to further analyzed to provide accurate payout data.
Second, Amazon uses Hadoop to classify high-risk, high-value items that must be stored in highly secured areas of its fulfillment center. In other words, expensive items like Kindles and plasma TVs must be locked up tight until they’re shipped so nobody steals them. Until recently, the process involved running extremely dense SQL scripts against a large data warehouse which generated reports that were sent to each of the fulfillment centers. From there, humans made the decision on which items to secure. The problem with this approach, according to Rauser, is that it requires “highly skilled people” to make “sophisticated” judgment calls. That’s not a scalable approach. Amazon wanted to take advantage of machine learning algorithms so the system could decide which items to classify and secure, rather than Amazon workers. Today, Amazon stores its entire product catalog on S3 and uses a 2-node Hadoop cluster to process the small percentage of the catalog that changes each week. Machine learning takes it from there.
Hadoop Helping Backpackers Find Couches
Airbnb is an online marketplace that allows anybody to advertise and rent out spare room to travellers. If you’re living in San Francisco, for example, and have a free couch, you could make a few bucks if you let a backpacker from France crash in your living room for a couple nights. Like Amazon, Airbnb uses Hadoop to store all of its user data – that’s 10 million http requests per day – to support multiple use cases.
One is to identify its most valuable users, according to Topher Lin, a Data Engineer at Airbnb. Previously, the company determined who its most valuable customers were based on the amount of revenue each one generated personally for Airbnb. But today the company loads all its social media-related data – the company maintains 178 “social connections,” according to Lin – into a Hadoop cluster, then runs proprietary algorithms against it to identify those customers that have the most influence on their social networks when it comes to doing business with Airbnb. So the most valuable customers in a given market are not necessarily the ones that generate the most revenue personally, but the ones who prompt the most number of their social connections to also do business with Airbnb. The company can then target these influencers to maximize their influencing effects to drive more business.
Another use involves calculating demand for a given area. Until recently, Airbnb relied on occupancy rates to determine demand for a city or region. So if only 25% of available crash pads were occupied in Chicago, for example, Airbnb would assume demand for its services in that city were low and it wasn’t worth the money or time trying to find more potential business. But that’s a faulty assumption, according to Lin. Instead, because Hadoop allows it to store and access all the data related to searches on its site, Airbnb can determine if occupancy rates are low not because demand is low but because there aren’t enough suitable matches. It could be that ther is huge demand for single rooms and larger in Chicago area, but the majority of spaces available are living room couches. Airbnb can then work to meet the actual demand.
Its Big Data approach appears to be working. Airbnb booked 1 million guest nights in 2011. Its already topped 5 million in the first four months of this year. That Hadoop cluster looks like its going to get some more work.