Square’s App Visibility Principles

Square’s App Visibility Principles

[[147070]]

background

Over the past five years, our site has evolved significantly. Our technical community has grown from a monolithic application processing community to a microservices architecture community. The changes and growth of our services have brought new challenges to application visibility. In today's blog post, we will provide some guiding principles and show the technology we use to detect and visualize our service ecosystem. Starting today, we will open source various parts of our service monitoring and visualization technology!

in principle

Some guiding principles are as follows:

  • Pay attention to ease of use as early as possible. With the microservice architecture, it is very easy to collect a large number of signals. A good user interface must be able to extract information from the signals.

  • Identify and present the most important aspects of the metric. We all agree that humans can only effectively handle a few tasks at a time. Therefore, any question about the metric should be answered in a limited number of times. For example:

    • Top-N API indicators are categorized by underlying factors or changed weekly.

    • Automatic problem detection. The inspect tool can reveal obvious system problems.

  • Applications should have good metrics by default from the start. We ensure good monitoring in our standard application, and our dashboard includes the following metrics:

    • database

    • Hosts and containers

    • Performance metrics for HTTP/REST endpoints

    • VM data used by the JVM to run service components

  • Warnings should be concise and relevant. We monitor a large number of indicators when warnings occur, with the goal of improving the performance of use and avoiding the lack of corresponding measures for warnings.

    • Warnings should be prompt and responded to immediately.

    • Warnings should be considered unusual events when they occur.

    • All warnings should require AI processing.

    • All warnings should be reproducible.

application

Under the above principles, the applications we use on the Square website are:

  • Appdash. Use this app to quickly get information about your app, including:

    • Operation information, such as which hosts are running, what has been released, etc. can be obtained

    • Application-dependent geometry

    • Events and exceptions from your application

    • Capacity Modeling

  • MetricsDashboard. Using this application, you can view the metrics of all platforms and applications. Below is an example of a dashboard in the metricsdashboard UI database.

  • Presidio. It is a log search application based on Elasticsearch. It provides an interface for application developers to easily find patterns that may cause errors, or help developers track an event in multiple services.

  • Equilibrium. It is our next generation alerting system and is rapidly replacing the Nagios infrastructure. Equilibrium is easy to use, has better reliability and balance. It was influenced by our experience using Nagios and working for other companies, and is in line with the current open source trend.

Now, we open source a seemingly small but very important project in the system: inspect. Inspect is a collection of libraries that we use to collect Linux, MySQL, and PostgreSQL metrics. The project also provides Linux command-line tools that can perform basic problem detection.

Conclusion

We hope that inspect was helpful, and that this blog post gave you a good understanding of the monitoring and alerting systems we use at Square. We will go into more detail about each system in subsequent blog posts. As always, please check back at https://corner.squareup.com/***Updates

<<:  It’s that simple! 5 simple principles to help you master the user experience design of your product

>>:  The future of virtual reality: from game accessories to control devices

Recommend

60 data truths about Kuaishou

Many people around him did not expect that Jay Ch...

How can we make them have more monkeys? The keepers are really worried

Key Points ★ Due to hunting and habitat shrinkage...

How to use traffic dividends to acquire customers at low cost?

In an environment dominated by traffic giants suc...

How to make a good product operation plan!

This document introduces two contents: the first ...

Details are king! Let’s talk about the three key roles of scene design

If you ask me which app I've used recently ga...

Meizu App Store promotion account opening qualification requirements!

What qualifications are required to open an accou...

Review of short video community product operation skills!

I have been working in a startup team on a sports...