Backend development in simple terms. Part 2
The more our clients grow, the more challenging it becomes for Onde system stability. Our backend team is working hard to improve this and give our ride-hailing platform the power of giants. Meet the people behind the latest system enhancements and learn about the solutions they’ve implemented.
New version of Cassandra = more opportunities, more challenges
Artem: Cassandra is our database. We used to work with an older version, 2.. Now we’ve moved to 3. — this was needed to provide the scalability Onde system requires.
Igor: Migrating to the new version of Cassandra smoothly was impossible. The update was long and painful, the new driver works completely different than the previous one. Under this migration, we had about 20 hours downtime. We were working day and night to tune the driver. It was really a top priority: otherwise we couldn’t update the version of Cassandra, would have reached a dramatic bottleneck and it would have been a total overkill of time and money in the future. Now that it’s done we couldn’t be happier. Thanks to this painful process, we’ve enormously improved our understanding of the internal implementation of the Cassandra database and its pitfalls. We realized what we were doing wrong, and began to improve some of the solutions. We wrote a framework for flexible and convenient work with Cassandra and implemented supporting flexible data denormalization within this framework. This approach allows cloud solutions to scale to very large sizes because when data is normalized, scaling is much more complicated and fails more often. We work with NoSQL now. The database is designed the way you want to read it out, and there’s a lot of room to scale it. Magic!
New hardware, more power
Artem: We improved the hardware as well. For instance, we deployed a new Cassandra cluster on a hardware more than 2 times more powerful than before. Hetzner data center is very reliable, yet not immune to force majeure. For instance, somewhere last winter there was a power outage there, and our server had two hours downtime. We were not able to do anything about it. Very frustrating. Now we don’t put all our eggs in one basket. Thanks to the new hardware, we can switch from Hetzner to G-Core by pressing just one button. This makes the Onde system much more sustainable than before.
Igor: We’ve also moved from HDD to the latest generation of SSD. Record and reading of the data became 25-100 times faster. The data storage is bulletproof now.
ZooKeeper to keep exclusive access possible
Igor: ZooKeeper is a technology allowing to block the drivers. It is necessary for the operations involving the drivers. It controls the operations don’t take place simultaneously and everything works efficiently. All our servers have “agreed” that when they access any entity (a driver or a wallet, for example), they get exclusive access to it. Only after an operation is completed, they release it. ZooKeeper allows us to do it.
Artem: ZooKeeper is a fail-safe cloud solution. It works on three servers and if one of them goes down, all the others automatically resume the process. ZooKeeper is an Apache solution, a proven and very reliable technology. Fortunately, we’ve learned it really quickly. Before, the system was similar but very slow, all the children peed and got dressed in turn, so to say. Now the processing of exclusive access to entities is seriously accelerated. However, we are still looking for an ideal solution. ZooKeeper has its limitations. F.or example, a clear performance limit, that might potentially get in our way. We would need to install, like, 100 servers, and this cannot be done with ZooKeeper. Looking for a better option is our priority now.
Pulsar: stunning collaboration with developers community
Apache Pulsar is a distributed data delivery system we’ve implemented to make our platform quicker. Pulsar allows us to deliver data between thousands of users in real time. As a result of the integration, the Onde system works 10-15 times faster.
Artem: Our applications and servers communicate with each other by sending messages. These messages must stay in the system because each of them should reach the user smartphone and inform them, for example, that the driver has arrived. All these messages were stored in Cassandra before. But the more our customers grow, the more messages we needed to store. Storing them in Cassandra became unproductive: the system was getting bulky and slowed down gradually. In March, we realized that we needed a new solution. It was time to get rid of this bottleneck. We were looking for an ideal match. The options were Kafka, Pulsar, and RabbitMQ. Apache Pulsar was originally created at Yahoo and is now part of the Apache Software Foundation. It’s an open-source thing, being developed by a community of dedicated developers. It fits us ideologically and architecturally. Kafka couldn’t provide us with enough growth space. Pulsar was a good match, however, it took us four months to bring it home.
Igor: It’s a cutting-edge technology but it’s still developing. Because of this, bugs and memory leaks can occur. In fact, such a service is to compare with an independent messenger like WhatsApp, when there are millions of users and the messages should be delivered from an arbitrary user to an arbitrary user. It’s quite complicated. We collaborate very closely with the Pulsar developers community — they like us, because the Onde platform is an exciting use case of Pulsar and because we’re smart to find their bugs. So we tell them where the bugs are, and they make solutions for us especially. Oh yeah, we do have an excellent experience with remote developers who do cool things for us free of charge. 😂
Thanks to this collaboration, at 3 a.m. UTC on Friday, 5 October we realized our server can and will stay working without rebooting it. All this work to let our clients sleep well that early in the morning.
Artem: Manual labor always results in a whole load of errors and mistakes. Therefore, we automate. A simple example: sometimes you can spend an hour automating some work that takes you a minute to do. In two years, this automation spares you a thousand minutes.
Igor: If there is a procedure that has been done twice already, this process should be automated. Continuous integration, continuous deployment we call it.
Artem: Doing everything associated with the launch of a new version of the server by just pressing one button is beautiful and frees you from errors. So we’re proud of having automated the processes of launching new server versions, emergency server reboot, and many operations on Cassandra.
And a lot more things still going on
Igor: The work is never over though. The main priority now is to invent and implement an approach for launching server updates without downtime. When we’ll find one — and we hope this will be soon! — everything will go flawlessly.
Artem: We are currently busy with further optimization of working with Cassandra. We look for the slow requests and make them work better, optimize the requests that can slow down a whole database cluster. We’ve increased the number of server instances working at the same time. From the very beginning, this was the way the Onde system develops: so that it could be scaled easily, beautifully and qualitatively.
Igor: I quit smoking. Two years ago, I promised to stop the exact moment our server would be completely stable. I thought it would happen in September 2017. Well, it’s happening now.
Artem: We always keep big goals and a shiny bright future in mind. And we believe the future can be only cooler! 😎
P.S: Don’t panic! 🛸