“Is the technology holding the product back or is the product holding itself back? The question you really need to ask is does it matter” – Avi Zurel
About Trip.com
- 4,000,000 active users
- 15,000 requests / second
- 120 active instances over Amazon Web Services
- Major services used: Amazon Web Services, Redis, memcached, Elasticache, MySQL
Avi Zurel (Twitter – @kensodev | Github – @kensodev) is Senior Engineer at Trip.com based in Silicon Valley. Avi came to Silicon Valley two years ago with his family after working for four years remotely from Tel Aviv.
He says he joined Trip.com because the technology was new and the product was exciting and his specialty is in scaling companies up. Trip.com invited him to come aboard.
Coding since age 14, before joining Gogobot, Avi ran a consultancy on his own in Israel. A family man, he and his wife have three children, 6, 4, and 1. He works part time in the office and part time from home in order to be with them more. He is also a competitive cyclist, enjoying weekend races in the Bay Area.
When asked about challenges he faced as an engineer, Avi talked of infrastructure, code deployment, and intrapersonal skills. He easily owned up to his failures, most of which centered on people skills. His mantra for combating the pressure and stress to perform in coding and engineering is to ask “Does it matter?”. It all comes back to that for Avi, separating what matters from what doesn’t, what affects the bottom line and the customer or client and choosing your battles.
This transfers to his personal life with family as well. When working from home, one of his key weapons against distractions is a command line called “GET SHIT DONE” that disables distracting websites and keeps him focused.
We cover many technical topics including:
- Why he moved from Tel Aviv Israel to Silicon Valley 0:48
- Motivation to join Trip.com and the how getting trusted travel reviews were so hard 3:00
- Fires and “melting servers” 4:00
- First experience coding with HTML 6:00
- Why Trip.com chose to work on Amazon Web Services (AWS) from day one 8:00
- Managing the challenge between serving “producing” users and “consuming” users at scale 10:30
- Deploying edge servers to meet global demand 12:00
- Cache deployment via memcached over dedicated servers and membase 13:00
- Making the decision to use Amazon Web Service Elasticache 13:15
- How to deploy code without affecting the end user experience 15:00
- Handling Techcrunch traffic without affecting the user experience 15:10
- How to scale up without making the database a bottleneck 15:30
- Changing the database architecture
- Negotiating with other stakeholders to halt new feature development to stabilize code 19:30
- Finding out what matters vs just being a hard person to work with 23:30
- Career advice on code reviews between engineers and managers 25:00
- Learning from failure: engineering as a consultant vs employee 26:00
- Learning from failure: direction communication as an Israeli vs conventional communication in the Valley 27:00
- How to improve communication skills as an engineer both online and off 29:00
- How to motivate people to get things done without being too aggressive 30:00
- When launching features that are not 100% perfect makes sense for learning 31:00
- Enforcing zero compromises on code if it would have any negative effect on revenue or partner value
- Bringing the team together to learn from mistakes, open a discussion, and to learn how to ensure that it won’t happen again 35:00
- If you point a finger at someone, three fingers are pointing back at you” 37:00
- How to manage a TB MongoDB cluster. 38:00
- When failure means that mobile battery dies and you’re abroad 38:00
- When managing a black ops dev project makes sense 47:00
- How Trip.com uses 250 AWS lambda functions in production 52:00
- Why using AWS products makes more sense 58:00
- Need to see 50%+ in cost savings in order to even consider another cloud 1:01
- Opens source tools to check out, including riak and javascript based tools 1:03
- 1:05
“Nothing matters to me more than the person I’m with… the product and the technicalities, they take second place” – Avi Zurel
A big supporter of AWS, Avi is working with LAMBDA and looking towards the future when he and his family will return to Israel, to home. You can find him on all social media:
—
Full Transcript
Cameron Peron: Alright, Avi, welcome to the podcast.
Avi Zurel: Thank you for having me.
Cameron Peron: Alright, let’s start at the beginning. Why did you come to Silicon Valley to begin with?
Avi Zurel: So we came here January almost two years ago, my wife, and then two kids – since then we had another one. (Cameron Peron: congratulations), thank you. My wife came here pregnant, we had our third baby here. We came here, really, just not one pure reason, we came for the experience, to leave somewhere else, for the weekend travels. So basically instead of going to places we know, then we go to Tahoe or Yosemite or all these different places.
Cameron Peron: You don’t have to fly abroad, right?
Avi Zurel: We don’t have to fly abroad. And the thing is that California is so rich in wildlife and hiking and biking and stuff like that, it’s just ridiculous. Just by starting the car and driving two hours in each direction you can see so many things. So for us it’s really exciting and the kids can go to Disneyland and we don’t have to fly 18 hours to go to Disneyland.
Cameron Peron: When did you join Gogobot?
Avi Zurel: I joined in March I think six years ago.
Cameron Peron: So you relocated from Tel Aviv to Silicon Valley with Gogobot?
Avi Zurel: No, so I was working for Trip.com remotely the first four years, fully remote. I came here two to three times a year for a month at a time, just to meet the team, more for the social aspect than the professional aspect. We almost pulled the trigger on a relocation, I think it was four years ago, four to five years ago when my son was born but it didn’t’ work out, visa issues and stuff like that so we had to stay in Israel. So we pulled the trigger now because we said, if we don’t do it now, the kids will be too old to do it – they’l have a more tight friend circle and school schedule and stuff like that so we just decided to pull the trigger.
Cameron Peron: Got it. So what was the motivation to join Gogobot?
Avi Zurel: So the motivation to join Trip.com – you know – it’s hard to kind of go back bc it seems like so many years ago, and I think that’s also unique in Silicon Valley that you see people in companies for 4-5-or 6 years. But the reason I joined Trip.com was that the tech was exciting and the product was exciting and back then, travel was a real pain. Figuring out where to go and what to do and getting reviews that you could trust was a huge pain.
Cameron Peron: The business sense sounds very clear, right, but what about the engineering part of it? Did you see a severe challenge there that you really wanted to test yourself against, or did you have an experience before that you felt you learned a certain amount and you wanted to learn more?
Avi Zurel: I think one thing I was really known for as a consultant before I joined Gogobot, I had a consultancy in Israel. And the one thing I was really known for, that’s why Trip.com called and said, hey do you want to join in, was my ability to scale a company up in terms of technology.
And you know, back then, early stage startup – everything is a fire, everything is hey everything is coming down, servers are melting and users are coming and techcrunch is writing about you and everything seems to be on fire all the time and…
Cameron Peron: in those days it was a bit easier bc it was just techcrunch, but now there are a bunch of other sources that can create fires for you – good fires
Avi Zurel: yes, good fires but it’s still an engineering issue. So early stage startup – everyone does what they need to do so basically everything you deliver, you deliver a set of problems. To the scaling issues, so you create a table or you create an index or something like that. So the scale becomes a real issue and I think that’s the reason Trip.com approached me, and that’s the reason I joined because it’s super interesting to work on problems at that scale, and you know, coming from Israel, at least back then – five years ago – we did not have that scale in Israel. So to get a million users or 30k users a day or 50k users a minute, stuff like that that techcrunch writes about you, we didn’t have that, so moving to Trip.com made a lot of sense to me.
Cameron Peron: Let’s go back, so what was your first experience coding like for you? Avi Zurel: First time for Gogobot?
Cameron Peron: no, first time for yourself
Avi Zurel: The first time I coded for myself, I remember that day really clearly. I think I was 13 o14 years old and there’s a magazine in Israel called Mareve for youth. so they had this little pamphlet of learning html, like really simple, html and you can go to the browser and create stuff. I remember just picking it up and going to my computer and thinking hey this is interesting, why don’t I try it. And I remember it was so basic, you had to write all caps and everything was really sensitive but it was my first time that I really created something with my computer instead of consuming. Cameron Peron: What year was this? Avi Zurel: I was 14 so….1996, 1995 maybe? So the start of the internet, basically. I created it, and it was super simple and probably had my name and my age and whatever, but I created it and double clicked and saw it in the browser and since then I’ve been doing the same thing.
Cameron Peron: Ok, Before Gogobot, you were a consultant. What were some of the major events that led up to where you are today?
Avi Zurel: So I think over my career if you look back, I was very lucky to move technologies whenever technologies were turning obsolete. So before I was a ruby engineer, and before I did web and scale and stuff like that, I did flash. Flash and flex, remember, the technology that apple destroyed. So I did a lot of consulting for b companies like hp and SAP in israel – back then they had a lot of infrastructure with flex and flash on testing software. So I was really lucky in my career and I wish I could take credit for that but it was really like a bunch of coincidences that I moved technologies. So i think Trip.com came right at the time where I really wanted to verse myself in the cloud. So I think the first time I really did quote unquote the cloud was when I started working for Gogobot. Cameron Peron: and that was in 2010 I think, right? Avi Zurel: 2010, yes I think so 2010-2011
Cameron Peron: And when they were looking at like a cloud solution, referring to what kind of infrastructure?
Avi Zurel: So we were, Trip.com was an AWS in state model.
Cameron Peron: Incredible, ok.
Avi Zurel: We were on Amazon Web Services (AWS) in state one the first code I deployed was AWS. We do have a lot of infrastructure on other clouds now, but we started with AWS and we’ve been there since then. At some point, I think it was 2011 or 2012 we had a point where we scaled up really big and had a big check going to AWS every month so we were thinking about going towards dedicated infrastructure and getting the servers, buying the servers, and managing the servers ourselves, but…it was quickly removed from conversations because the thing back then, we needed to scale up and scale down so many times a day, it was crazy and so we just decided to ditch it and I think, looking back, I think it was a smart decision. We would definitely save money if we did that but I think we would create a lot of problems in the business if we did that.
Cameron Peron: Interesting – so many questions about that – Today Trip.com is around 4 mil users, right?
Avi Zurel: Yes
Cameron Peron: 15k requests per second and how many amazon instances do you currently run today?
Avi Zurel: I’d say somewhere about 120 something total that includes worker instances, databases, cache, redis, everything.
Cameron Peron: What are some of the craziest back end engineering challenges you face so far today?
Avi Zurel: It really changed over time. Before we had all of the infrastructure that we have now, we had a lot of trouble balancing between the consuming users and the producing users. Every company has that issue, you know if you look at Facebook or twitter or anything else, you have producers and you have consumers, right, and you essentially need to grab content from the producers and deliver it to the consumers and managing scale is essentially saying I am going to manage the scale for producers this way and for consumers that way. And I think finding the balance between the two was the big challenge. And today, for example, our consumers, they don’t even hit our servers. It’s rare people hit a server, like a real server.
Cameron Peron: What did you set up exactly for that?
Avi Zurel: So what happens is we have a really broad content delivery network, all over the world, and what happens is you go and you type Gogobot.com/… for example, and you go to the nearest edge. So if you’re in like Dubai, you go to the nearest edge, and if that edge does not have the content, you go to the next edge. Going back all the way to our servers – and then even if it hits our servers, it will first hit a cache layer. So we have a cache which is a CDN and then another cache on the way to our servers. Only if those are cold then you hit our servers.
Cameron Peron: what do you use for cache?
Avi Zurel: We use a company called InStart for the CDN cache. They are incredible. We can only speak good things about them. They do a lot of optimization and customization for us and we have full control over how the CDN is architected and then for the cache on our side, we use the most famous Memcached so we have a huge cluster on Memcached on the way to our servers
Cameron Peron: do you manage the cache yourselves or use an amazon service?
Avi Zurel: We use an amazon service. We did a lot of experiments with managing the cache ourselves so we had memcached on dedicated servers and then we had membase which is essentially back up to disk so you can reheat the cache from the disk. So before we had all the chef and dev labs set up, it was a pain to manage and we decided to go with amazon and have been with them since then. Now that we have the chef infrastructure set up we could probably move it but it’s working so flawlessly, why touch it?
Cameron Peron: Was there some crazy event or real problem you had to face and solve that changed what your stack looks like today? For instance, like too many people, consumers hitting the service at any given time or some kind of downtime?
Avi Zurel: So it’s really hard to remember one thing after so many years but we had a lot of problems with like, every time we deployed new code, we’d create a downtime on the servers. So as we’re updating code, some of the consumers see error pages and stuff like that so solving that was one big issue we had to solve.
Cameron Peron: So you’re constantly updating the pages?
Avi Zurel: Yes, we need to update the version, deploy new code, and we deploy new code many many times a day, sometimes 10, sometimes 15 times a day, right? So we need to deploy code without the current version being affected by it, so if you go and you hit the current version, you will get it with no problem, and if the new version has a problem, it will roll back to the old version.
Cameron Peron: So this has a huge impact on customer experience?
Avi Zurel: Yes, a huge impact, and you know, what happened back in the day, back in 2011 or 2012, what happens is techCrunch is writing about you and somewhere you find a bug and then as many many many users hit your pages, before we had all this cache and stuff, many users hit pages and then you do an update and now you know that some of them will hit error pages and never come back again. So a technical issue becomes a product issue, becomes a PR issue, becomes all these different things
It works up and down the chain. That was one challenge
Avi Zurel: The other challenge was how do you scale up without making the database a bottleneck?
I think a lot of people are now realizing that whatever you do, doc or this or that, you still have, almost in every application, one point of failure and one point of scale issues and that’s the database. I think we had three or four months where all I did was re-architect the database, rearchitect tables, changing index sizes and stuff like that so I think that was another issue.
Cameron Peron: Let’s dig deeper into that. What kind of database were you using?
Avi Zurel: We’re using, right now, every one of them. But our main one has been mysql since day one.
Cameron Peron: What did you specifically do to alleviate this bottleneck?
Avi Zurel: What happened was, there were a lot of things that we didn’t even realize, that were just recreating tables and we didn’t even realize we were using too many resources. So one of the things we had to rearchitect for example is how many columns are we selecting from a given table. What we found is that if you have streams that are too long and you’re selecting all the field from one row, then the server will go to the disk to grab your record and that slows down the process a lot. When you slow down the process a lot in the reads – the process in the writes will slow down. That’s one thing we had to do, we had to change the architecture around indexes to use as little index as possible in the database and re architect the data on how we store the data, stuff like that. A lot of little little stuff.
Cameron Peron: Scalability was a major challenge, managing updates was a major challenge, anything else you want to discuss?
Avi Zurel: I think in terms of challenges, one of the hardest challenges we had to face, and I know I had to face, was how do you maintain quality when things are hard. How do you make sure that even though you’re basically burning your keyboard because you need to work so fast, because something is happening, how do you still maintain the quality? How do you still maintain the architecture of the code? How do you normalize how engineers code and how engineers deliver code? I think that’s a culture thing and that’s one of the hardest things we had to do.
When I came to Gogobot, we had zero tests, so everything was zero tests, people were working on master, people were running over other people’s code. When you needed to deploy, someone would say, hey you can’t deploy now, I’m still working on this. And then you had to wait five minutes and sometimes that five minutes looked like the end of the world. So those things are one of the hardest to solve because there are people issues sometimes and you’ve got to work through them one by one.
Cameron Peron: Can you give a specific example of how you solved that?
Avi Zurel: Yes, so, what we did was first of all, and it was hard, we halted all development of new features for two weeks.
That was hard, getting the CPO and CEO to agree while you’re kind of in the news and in the crunch – how do you halt development then – but we held development on new features for two weeks and all we did was stabilize the system. So we’d find a hotspot and released it. If we found critical code paths that were failing constantly or that we thought were sensitive, then it was tested and retested and stuff like that.
Avi Zurel: We developed a way of how we’d work with branches, so master is no longer pushable – no one pushes the master. And that sounds – today that’s completely how people work. People hear me and they say hey of course this is how you work, right, but five years ago that wasn’t the case.
Cameron Peron: Sounds like four or five years is an eternity away.
Avi Zurel: Yes, today the way we worked then is unthinkable. But I don’t know if it is unthinkable because I’ve matured here and if you go to another company like an early stage startup or a seed series if they still work that way. I’m pretty sure they do. They might have a more experienced CEO, they might have more experienced engineers, but still the way you need to drive the business fast in the seed round and the a round will dictate whether you want to grow or not.
Cameron Peron: Yeah, there’s so much pressure to get things done, right?
Avi Zurel: Yes there’s so much pressure that it’s hard to withstand as an engineer. I think it’s getting easier because a lot of people realize that you can’t work this way, move fast and break things all the time, but still there’s so much pressure. Contracts coming, sales are coming, all this adds up.
Cameron Peron: If this situation were to happen again, and it does happen to a lot of people, right, but let’s say you were to put yourself into a position either in this company or somewhere else, where you’re in between a seed round and an a round or an a round and a b round, it’s very similar and you see the same problem but you have huge constrained amount of resources and you have a lot of pressure from the board and CEO and CTO, what steps would you, what would be the conditions you would look for, to say we have to stop and we need to stabilize first before we attempt anything else?
Avi Zurel: I have a very clear rule about that and I ask only one question: Does it matter?
Let me give you an example, we as engineers can obsess a lot about the technology but technology is not the bottleneck. Let’s say you have a product right now and it’s in production and it’s not in the code technology. Is the technology holding the product back or is the product holding itself back? Is it a sales issue, is it a technology issue? If you need to deliver something that say you have a contract coming – let’s take it to the travel business – so you have an airline and its requesting a product and you need a demo. Does it matter if the demo will not be as fully tested as you want it to be or if the code will not be as nice? No, it doesn’t because in two weeks if the deal goes, then you’ll rewrite it, you’ll have the time, you’ll have the resources. So the question you really need to ask is does it matter? Does it matter that I will put myself in the position of quote unquote fighting with management about something. It could be testing, could be delivery day, whatever, does it even matter, or am I just being hard here. Am I being hard because this is code or am I being hard because does this matter?
Cameron Peron: This is awesome. So now that you’ve transitioned from the scene in Israel here, do you see that this level of thinking is encouraged, or it’s welcomed, or there’s reciprocity there, or is it something that you really have to challenge to push management to consider?
Avi Zurel: I think first of all, if I’m speaking about my company and myself specifically, I have 100% credit from management and they understand. If I’m saying hey this is going to take two weeks, there’s no way we can do it in less than two weeks, and what we deliver here matters, they understand and the reason they understand and give me that leeway is because I gave them the leeway somewhere else. So if you’re hard on something all the time, no one will believe you because you say hey it matters all the time. If you say it doesn’t matter all the time, it’s the same thing. But if they see a balance in you as a professional and you as an engineer, they’ll say you know what, this doesn’t matter. Let’s deliver it whatever we need to do to get the deal done. And in some places, other places, I say this is a critical path. We cannot wing it here. You want it in three days, it will not happen in three days. It will happen in 16 days. But when we do it, it will be good to be delivered and you’ll get the credit.
Cameron Peron: So it’s really about choosing your battles and understanding what matters and making a decision on what you really should fight for and what you should give leeway on.
Avi Zurel: Right, and it’s not only between a senior engineer like me and management. Sometimes it’s between a senior engineer and a junior engineer. If you hammer the jr engineer all the time, he will think, whatever I do, he will hammer me, this code sucks. Sometimes you have to swallow a frog and say you know, this code sucks but I’m going to let it slide. Because he will learn from it and it’s not a critical path for the company, not a critical path for the product, I’m not going to fight that battle. I’m going to fight it where it matters.
Cameron Peron: I really want to dig into that more. Before we get there; let’s talk about failure. Where you failed either as a leader or as an engineer and what you learned from it.
Avi Zurel: I failed in both (laughs). Many times I’ve failed in both. As a manager, not even as a manager, as a team member, the first couple of years I failed people skills miserably. So what happened was I came in as a consultant, and as a consultant, you’re expected to say shit, right? It’s like the position of the consultant. You come here, everybody knows everything is shit, and you say, this is shit, right.
But when you’re a member of the team and you’re a part of the team, you cannot really say this is shit, or this is a shitty job.
I also think the way Israelis communicate and the way we are to communicate here in the Valley is very different and that was a huge failure for me and I alienated team members and people were like, hey, he’s difficult to work with, he’s too direct, I can’t really deal with it. A lot of it was also that they didn’t really know me and all of our communications were in chat so sometimes, you know tone is not delivered in chat.
Cameron Peron: So true, also email, sometimes the worst way to have a discussion is over email, because you can interpret it the wrong way and goes nowhere.
Avi Zurel: Exactly, and we have emojis and stuff but it’s not the same (laughs). So that was my huge failure as a team member and as a manager was people skills and communication skills and also delegating.
Cameron Peron: Did you realize this when you were already here or when you were working remotely?
Avi Zurel: I realized it quickly, well not quickly because it took two years, but
Cameron Peron: I can definitely appreciate that when you were in Israel, especially, it’s very hard, you can be aware of it but it’s like on the low end of the priority, right? But when you were here and you were dealing with this, face to face, was it still a problem for you, that you realized you had to solve and you have to figure out, like this is a point I have to improve this in some way?
Avi Zurel: Sure, I feel like I’m still dealing with this in some ways, I’m not perfect in any way, and I think – the thing is this – my attitude was “i’m not going out of my way to hurt you, but I’m also not going out of my way to make sure you’re not hurt.” So that was a mistake because I do need to go out of my way to make sure you’re not hurt. Because if I don’t mean to hurt you, there’s absolutely no reason for you to get hurt.
Cameron Peron: So how do you justify this kind of dissonance? Because there is definitely this approach which I think is very important to bring which is if it’s not perfect, it’s not good, right? That’s something that definitely comes out of the Israeli startup mentality, even though, there’s a lot of, i don’t know, people will launch products that were half-baked and they suck, but there is a lot of intense motivation to get things done, right?
Avi Zurel: Yes, right.
Cameron Peron: Whereas here in the bay area there’s many different cultures and many different ways of thinking, it’s very difficult to bring that sense of getting things done together with people that have a different way of doing it. So how do you get things done with team members from different places with different ways of thinking without being too – the bad Israeli – demanding and saying that everything sucks in order to motivate people, how do you get people, how do you motivate people to get things done?
Avi Zurel: These days nothing matters to me more than the person I’m with. I mean the product and the technicalities, they take second place. What I saw from adopting that approach is that first your feedback is different because what you do is say if I’m giving you feedback, I’m giving you feedback to make you better. You can launch it, but I’m giving you feedback to make it better, not to put you down.
Cameron Peron: So you’re okay in launching things that you’re not 100% satisfied with, just for the sake of bringing the team member in order to learn that what they just launched was a mistake?
Avi Zurel: yes, 100% of the time, as long as it does not affect the bottom line of the company I’m 100% okay with that.
Cameron Peron: Can you give a real example of that, without saying names? Because almost anything you launch can have an effect on the bottom line…?
Avi Zurel: Not really, let me give you an example. One of the things we were struggling with a lot was balancing between tests and deliverables. So what I was constantly saying was, hey, if you go ahead and test something, you need to test the happy path and you need to test the sad path. What happens if my program gets input that I did not expect? What happens if something fails? And sometimes I would see a poor request without that and I would say hey, we’re going to launch it that’s fine, but this can be a problem –
Cameron Peron: You would launch that into a staged environment or a live environment?
Avi Zurel: — launching it live. And I’m telling you this can be a problem and monitor this and see what happens.
Cameron Peron: You have a policy in your team that you meet a certain level of requirements and you’ll launch it anyway even if it doesn’t meet those?
Avi Zurel: So the thing with requirements and code review and stuff like that is it’s a really gray area. What happens is this: to mission critical code, meaning code that deals with people’s money and code that deals with partner’s money, there’s no bullshit. No way this code is going if it’s not scalable, if it even has one path of small failure rate, this path does not go live. But in many parts of the system what you have is like big product, in many parts of the system this product is being used by 3% of the users or 50% of the users or 20% of the users. So you can afford making mistakes where they matter less. What happens is you need to balance the education of the engineer in places where it matters less and educate him that look, it matters, and when he comes to deal with a more mission critical code, he will know because you let him fail in that first part. He saw that what you’re saying is actually true.
Cameron Peron: How do you perform that postmortem exactly?
Avi Zurel: First of all, if it’s a big failure, we have an official post mortem and we send and email and say what happened, what measures are we taking for it not to happen again
Cameron Peron: This email is questions or directives?
Avi Zurel: The first email is just a report – this happened during this time, this was the failure, this cold path, this is what we’ve done and this is how it will never happen again. And then what happens is it provokes a discussion. Not sure if provokes is right, triggers a discussion and then people ask questions and you give them answers.
Cameron Peron: Do you hold that in an organized fashion – everyone gets the report and everyone is kind of freaking out a little bit and it’s uncomfortable because it’s failure – how do you hold that feedback together from everybody so that everyone learns and then implement some kind of new way so that we’re sure we’re not going to repeat this again?
Avi Zurel: One of the things I as an Israeli have, other than my directness to your quote unquote fuck ups, I have a directness to my fuckups.
Cameron Peron: It goes both ways.
Avi Zurel: One of the things people learn about me quickly is that if I make a mistake I will never ever run away from it. First of all I will never hide it. I will never say it’s someone else’s fault. One of my army commanders told me – if you point a finger at someone, three fingers are pointing back at you. So I first want to grab on to my own mistakes. When people see that, and see that from someone who is, you know, “senior,” then they don’t run from their mistakes as well. Because they know, look we all make mistakes and if I go and I count my mistakes and we put them all in a paper, you say hey it’s a miracle this guy is here, right. But you are more than the sum of your mistakes and I think we’ve adopted that as a team so people realize that and when they fail they learn from it. It’s a process – it’s not something you learn in a day; it’s not something you learn in a week.
Cameron Peron: When you do this postmortem, do you hold it in an organized fashion or is it kind of like ad-hoc? Is it email, over slack?
Avi Zurel: It’s on slack, it’s on email, and if it is worth discussion, we do a meeting with the engineering team about it as well. To tell you the truth (knock on wood) we did not have a lot of these where we needed like postmortems and stuff like that but let me give you an example of one.
Avi Zurel: Up until a year or a year and a half ago we had a huge cluster of MongoDB. It was a pain, almost too much of a pain to manage.
Cameron Peron: Just curious, what was the scale, how many nodes?
Avi Zurel: We had 18 shards, 2 replicas of each shard, 6.8 terabytes of data. Cameron Peron: Crazy. Avi Zurel: Yeah, and that was courted in realtime and we had about 60k inserts a minute to that cluster. So yes, a big cluster that was a pain to manage. So what happened was we had a single point of failure on that MongoDB. If one of the connections to the MongoDB failed or one of the replicas failed, or something like that, it would crumble the whole site down. So if MongoDB failed, even though it’s not mission critical, the whole site would go down. So it was night here, day in Israel, so I was on call for it. The site went down and I lost battery in my phone so I didn’t get the alerts and I wasn’t available. And I was in a place with no internet. So I said hey this is what happened, this is why it happened, this is how we fixed it so we’re not dependent on the connection to MongoDB….
Cameron Peron: How did you fix it?
Avi Zurel: Without getting more into the technical details
Cameron Peron: By all means, please get into them.
Avi Zurel: What happened was we were holding a connection between our main application to MongoDB. Even though what needs MongoDB is not the main application, it’s like one other application. But what was written into that application was the main application. So one of the things we did was separate our concerns. If the application does not care about this connection, there’s no point of holding that connection. So we refactored the application completely to say MongoDB is only touched by this application and this is a realtime application coming from Facebook so it’s a completely separate application. So we went into the micro service realm and said, let’s refactor this into its own thing and if it fails it’s fine, but it will never fail the main site.
So what happened was we degraded gracefully. One of the things we read from MongoDB was – I don’t know if you remember, but three or something years ago, every one of the titles of Trip.com had who of your friends had been there (fb, twitter, etc). So imagine this social graph was stored in that MongoDB so we said hey this is not critical. So if MongoDB is down, this part of the system, just for those logged in users that will be down but the whole site will not be down. So that’s what we did – since then we don’t have MongoDB anymore, not because of this issue but because of other issues.
Cameron Peron: Are there any other ways that you empower your team to make a tactical decision without getting in the way of the day to day business?
Avi Zurel: That’s…making a decision is a challenge and I think a lot of times you just either have to put your foot down or just release.
And again, we go back to the question of does it matter. Does it matter – will I just – let’s contribute to the discussion here, but it doesn’t really matter, the decision, but if it does really matter, you’ll have to discuss it with your team. There’s no really rule of thumb I can say that is more powerful than does it matter. And when you go by that rule, suddenly you realize a lot of the things, a lot of the stress that you were carrying didn’t really matter.
Cameron Peron: you can apply that far outside of engineering
Cameron Peron: How important do you rank failure in software engineering?
Avi Zurel: I think without…what makes a senior engineer senior is not successes, its failures. It’s scars. It’s what you did to screw up, how badly was your screw up, how badly did it affect the bottom line, and how did you deal with it? So without my failures, I wouldn’t be who I am, I wouldn’t be where I am. Without my failures, I wouldn’t know what to review on other people’s code. And they will still fail and that’s fine because that’s where they are on their career path, and you need to let them go by it. You cannot be the net below the guy walking the rope, Sometimes you need to let him fall – let him fall and break a leg, don’t let him fall and die. I think failure is a huge part of software engineering, I think failure is a huge part of life. It sucks, it stinks, every single time, but one of the most powerful things if you discuss it openly and you take responsibility and you know not to put yourself in that situation again.
Cameron Peron: Hopefully you don’t, right, but sometimes you can. I guess you’re just better equipped to realize you’re going back into the same footsteps.
Avi Zurel: Well, you’re not going back, the thing is that your failures are now mistakes, how did I not notice this thing?
Cameron Peron: in order to not repeat your mistakes, what do you do? Do you see a pattern and then suddenly the results hit your face again and you back out or do you have any kind of process that you do so that you really truly learn from those mistakes and you don’t repeat them again?
Avi Zurel: So I don’t let myself get stressed. If I get stressed with either a deadline or a lot of whole different things, then I make mistakes, so I just don’t let myself get stressed. Let me give you an example, we launched a new product to production, it’s on experiment right now so I cannot discuss it, but this product has been like my business development inside Gogobot. So for a year I’ve been saying look, we need to rewrite this part. If we rewrite it with new technology, we can bring more conversion to the site than our current conversion. And everyone was like, no, technology cannot do that, and that goes back to what we said, is technology the bottleneck? And I said, look here technology is the bottleneck, i am telling you for sure. I am telling you this without a question and everybody was like, no, we don’t have time, resources, etc. And then what happened was someone released it, said go do that, but you have a week to give us a proof of concept. I said I don’t have a week. I have as much time as I need, I’m going to work on it in secret, take me out of the loop of the other tasks, I’m going to use any resources I need. Cameron Peron: black ops project, I love it. Avi Zurel: Don’t ask me any questions, let me work. What happens is you get a lot of stress because people are like hey we need you on this, or that, or this here.
Cameron Peron: So you removed yourself from the day to day process for a certain amount of time or for the full time?
Avi Zurel: The full time
Cameron Peron: How long?
Avi Zurel: Four weeks.
Cameron Peron: Wow those are some very serious business development skills…
Avi Zurel: I did handle something if it was like my responsibility but if it wasn’t then I saw it as an opportunity to give someone else that responsibility. So I used the resources, so I used one of the junior engineers we had here. And he said while I was working on it I felt the pressure from management saying hey when is this going to be done, and when is that going to be done. And I asked him how does this affect you? And he said I think I could’ve done better. And I told him look, this is a new thing, there is absolutely no reason to launch it with problems you are aware of. If it takes another day, no one will die here, this is not an EKG program, so don’t stress about it. So we do stress and then four weeks later.
Cameron Peron: What do you do, I appreciate the way you negotiate, but how do you reduce the stress?
Avi Zurel: Let me give you an example of how it worked out, right. So the product that is currently in production has been in production for two years and has done about 70 iterations of A B testing to get the conversion rate to x. This new product has done zero. Zero optimization, there is no AB testing it, just launching the production and it’s 1% underperforming the product we have by 1%. So 1% so standard deviation that’s like nothing. It turned out great. If it hadn’t I would need to do a lot of explaining, but I’m thankful that it worked out. To answer your question how do you manage the stress – it’s a constant battle. It’s like you want to help your teammates, you want to give the good answers to your bosses, you want to tell them hey this is coming in two days, but you have to keep an honest observation of what you can and can’t do and what people expect of you versus what you know is good for the business. And it’s a battle. And to tell you the truth, if I’d only been here for a year, I’d never be able to do that, but because I’ve worked under the same bosses for five or six years and I know them, they know me, I have that leeway. I’m not sure if I was working here for a year or I went to work for like Facebook and I had a more direct boss, I’m not sure I’d have that leeway but here I had it. So I said hey we’re not going to stress it. We’re going to do it as best as we can, there’s going to be zero issues, no bugs, it’s going to work on all the brothers, just launch it silently to production.
Cameron Peron: You focused on building the project completely by yourself or with other resources?
Avi Zurel: Not by myself, I built with two other team members. The reason is because it is a new technology, we built this product with riak and we did not have riak before, so it’s not healthy to release a product to production without any of your team knowing about it. Then it’s your problem forever and ever and ever and no one else knows how to deal with it and no on knows how to do it and also what I wanted to do was get people excited. And the thing I saw immediately, was that after two weeks of me working on it and suddenly it was visible, people were coming to me and saying do you need something to do on this, I’m excited, I want to work on it and then we distributed small tasks and people got excited and it was a whole collaboration with the team and it was beautiful and fun to work on.
Cameron Peron: So tell me more about the open source and the AWS related services that you love to use.
Avi Zurel: We use AWS extensively, almost every product – every usable product – that they’ve got. One of the things I’m really excited about, but it’s abused in a lot of ways, is AWS Lambda, so server less (I really hate that definition) but kind of the way that you just deploy function and you don’t need to worry about it – That gave us a lot of freedom to do like simple stuff. That’s one thing I’m excited about. The other thing I’m really excited about –
Cameron Peron: Are you using Lambda right now at a certain level of scale?
Avi Zurel: Yes
Cameron Peron: How would you define that scale right now?
Avi Zurel: We are using Lambda for everything that is one off, so every function….so we have partners and those partners need callbacks or they need something from us, we do not do that anymore in our main app, we just deploy a Lambda function and we give them the endpoint and we forget about it. We don’t need to manage it, we don’t need to monitor it, nothing. It’s live and it never fails. So we have Lambda in quite a bit of scale today.
Cameron Peron: How many invocations or …?
Avi Zurel: It’s about 250
Cameron Peron: per second or ?
Avi Zurel: No, 250 Lambda functions.
Cameron Peron: Were you offsetting some level of computes to Lambda?
Avi Zurel: No, it’s like CPI that are one off that are not part of the main business project.
Cameron Peron: Are there any challenges that you are trying to sort out, like monitor wise or certain level of restrictions?
Avi Zurel: We haven’t hit those limitations, and I think you hit those limitations if you start using the product for things that it wasn’t meant to be used for.
Cameron Peron: Like what?
Avi Zurel: Like if you build an app around it. If you don’t just have your app and have this as a one callback function that’s completely isolated from the main app or if you use it as your api or stuff like that, a lot of stupid things you can do. If you use it to return data from the database, not only save data to the database, stuff like that. You basically need those functions to work as fast as you can and as low … as you can. If the function takes a second, it should be a lambda function. So that’s one thing I’m excited about and to see how it develops, although I do see it being abused and don’t like the way that’s going.
Avi Zurel: One other thing is google’s ??netics?? I think it’s the start of something really fantastic, but I’m still excited to see who develops a framework for devops in the same way someone develops like a framework for web production for web applications. Because there are a lot of these products and they’re all moving around and no one knows – there’s like no one thing that is a best practice, everyone is kind of winging it, so I’m excited to see what happens there.
Cameron Peron: What is a problem that you see around that, that isn’t being solved so well, why framework?
Avi Zurel: So I think G-netics is a start but before g-netics started, all the things around mesos and mesosphere and stuff like that. So for example it gives you a way to launch a docker container into the cloud, but that’s only one thing. You still need discover, you still need monitoring, you still need recovery and it was a lot of different solutions, how do I do it, and it was this connection with this proxy and that proxy and nothing really like worked together, and I think g-netics is one thing that really starts to kind of work out, but you still see how do I manage global environment variables, how do I store this, how do I store that, how am I going to work with my DNS provider, not internal DNS, stuff like that still hanging in the air.
Cameron Peron: Earlier we talked about reak. What kinds of NoSQL databases and open resources do you like?
Avi Zurel: We used almost every kind of NoSQL database that you can imagine. I was really excited for rethinkDB two way shutdown yesterday, I think it’s a beautiful product, and one of the comments on hacker news, and I don’t necessarily agree with it, said, it’s sad to see MongoDB thriving and rethinkDB shutting down because it was such a good product and such good people working on it but, definitely rethinkDB was something to get excited about. I think NOsql got a lot of attention 2-3 years ago when people started to talk about web scale and stuff like that and suddenly no one talks about it anymore, people are now in postgres because you can store JASON. I really believe in choosing the tools for the job.
For example, here at Gogobot, one of the databases we use the most is DynamoDB by Amazon and we use it because it just works and you can scale it up and scale it down and it’s amazing.
Cameron Peron: Question about that – you have all these services on AWS today – do you see it as an inconvenience or what are the criteria you look for in using a similar service that’s outside of AWS? Assuming it’s a part of all this stuff and it connects into amazon and all that. There are many similar services which amazon offers and there are many – a whole vendor ecosystem that provides those kind of services. I’m just curious when you look at those, what are you looking for? The convenience of just having everything in amazon? Are you interested looking at other APIs that would help you to say, DynamoDB is fantastic for a few things but actually what I really want to use is like something else even though it’s outside the nature of what amazon is offering?
Avi Zurel: The way we decided things so far is that we chose the tool for the job and then if amazon offered to work with it, then we worked with it. We had redis for years and years and years but amazon only released redis like two years ago or something, they only had memcached and only recently in terms of the way I work with amazon did they release redis so we moved. We try to get the job done with the product that can do it the best and if amazon offers a product for it then we will essentially go with amazon. Because one of the things – we don’t have a dedicated ops team here to go by so you try to “outsource” everything you can and amazon does a fantastic job.
I think one of the things I see all the time in discussions is, hey what’s the alternative to this and what’s the alternative to that, and the question is, why are you looking for an alternative?
And again, go back to the does it matter.
Let’s say I’m working with amazon mysql and memcached, and you say if you move cloud structure then you won’t be able to use those – but why would I move, what’s the trigger for me to move? Let’s say we’re paying $1 to amazon every month and this infrastructure provider comes to us and says hey you know what, we’re a cheap partner, we’ll give it to you for 10 or 20 cents, then it’s time to find an alternative, but until then you don’t need to.
Cameron Peron: What is the value you’re looking for? The threshold to make it interesting.
Avi Zurel: I think it would push us towards 50% less than what we’re paying today, then we would consider. If we see that we’re paying amazon for the services and something would make more financial sense. But it’s a company – sometimes as a person you’ll go to the supermarket and see something for 20 cents less and say I’ll take it – but as a company sometimes 20 cents less will cost you more in the long run.
There’s also something that you need to understand and to move the sheet to that way will carry the cost of it. And you have to think about the butterfly effect. If you are moving the cloud structure now, and someone is looking into, and you’re a series A startup, you want series B, someone is looking at making an offer for round b and you decide you’re going to move the infrastructure to save 20 cents and you’re down for three days – you have to think about that butterfly effect.
And you have to think about everything when you do that and rarely can you say I’m going to set up my infrastructure there, and I’m going to pull the switch and everything just works. So you have to see the value in moving.
Let’s say we’re looking into google cloud right now, we’re not, but for example, what are we to gain from it? Al long as we have nothing to gain from it, there’s nothing for us there so there’s no reason to look at an alternative to amazon.
Cameron Peron: We talked about Lambda, Mongo, riak, are there any other open source or AWS services that you really love, enjoy using them, think are fascinating.
Avi Zurel: Not off the top of my head. I’m really excited about a lot of open source that’s going on these days but nothing off the top of my mind.
Cameron Peron: What are some of the new emerging open source projects that you like or that you think have potential?
Avi Zurel: I think again all the javascript world, even though it can have a lot of fatigue because there’s a lot of open source and seems like you make a decision today and tomorrow it’s completely opposite. So I started this new project with riak and started five weeks ago and it seems that some things I’ve made are already old, but I think riak is something to be excited about.
Cameron Peron: In your opinion, how do you think training services like microservices or your favorite service architecture/lambda and containers will impact the development life cycle?
Avi Zurel: It will impact the development life cycle – I think a lot of people are doing micro services now, they are like, hey this is cool, I can separate this function, I can separate this service, again, the question is what is the gain? what are you gaining from making this service? I think one of the things if you’re looking at the company of like every size startup in an a b series – 15-20 engineers, is there really a reason for micro services here? If you look at a company like twitter, uber, Facebook, stuff like that, then you have teams working on different problems and when you have teams working on different problems, you have to allow them to select the technology of which they will work with. And when you need to let teams select the technology of which they work with, you have to keep offering services. But I’m seeing a lot of people abusing it and saying hey micro service, hey micro service. The thing we adopted is that if it’s not main business logic, it doesn’t need micro service. So if we have an editorial application that the editorial team needs, we will build it with whatever and launch it into the main app and launch it as a micro service. So if it’s not really related to the pricing or to the main site, then it’s a micro service.