I like the concept of simple monitoring. Simple means it is simple to install, s...

jerrac · on Sept 24, 2023

Not sure I haven't run across it before, but this is the first time I've tried using Netdata. Looks like it is very good for metrics, at least in the 10 minutes I have spent installing it on my local desktop and poking around the ui there.

I'm not seeing anything in it for logs, though. I'm guessing it doesn't aggregate or do anything with logs? What do you use for log aggregation and analysis?

I'm very interested because I've been getting frustrated with the ELK Stack, and the Prometheus/Grafana/Loki stack has never worked for me. I'm really close to trying to reinvent the wheel...

valyala · on Sept 26, 2023

If you want easy to install, maintain and use system for logs, then take a look at VictoriaLogs [1] I'm working on. It is just a single relatively small binary (around 10MB) without external dependencies. It supports both structured and unstructured logs. It provides intuitive query language - LogsQL [2]. It integrates well with good old command-line tools (such as grep, head, jq, wc, sort, etc.) via unix pipes [3].

[1] https://docs.victoriametrics.com/VictoriaLogs/

[2] https://docs.victoriametrics.com/VictoriaLogs/LogsQL.html

[3] https://docs.victoriametrics.com/VictoriaLogs/querying/#comm...

thinkmassive · on Sept 25, 2023

Prometheus has become ubiquitous for a reason. Exporting metrics on a basic http endpoint for scraping is as simple as you can get.

Service discovery adds some complexity, but if you’re operating with any amount of scale that involves dynamically scaling machines then it’s also the simplest model available so far.

What about it doesn’t work for you?

Edit: I didn’t touch on logging because the post is about metrics. Personally I’ve enjoyed using Loki better than ELK/EFK, but it does have tradeoffs. I’d still be interested to hear why it doesn’t work, so I can keep that in mind when recommending solutions in the future.

jerrac · on Sept 25, 2023

Last time I tried Prometheus was years ago. So I don't know how much might have changed... I gave it a good month or two of effort trying to get the stack to do what I needed and never really succeeded.

Just my opinion, but I honestly don't think the scraping model makes much sense. It requires you expose extra ports and paths on your servers that the push model doesn't require. I'm not a fan of the extra effort required to keep those ports and paths secure.

Beyond that, promql is an extra learning curve that I didn't like. I still ran into disk space issues when I used a proper data backend (TimescaleDB). Configuring all the scrapers was overly complicated. Making sure to deploy all the collectors and the needed configuration was rather complicated.

In comparison, deploying Filebeat and Metricbeat is super simple, just configure the yaml file via something like Ansible and you're done. Elastic Agent is annoying in that you can't do that when using Fleet, or at least I have yet to figure out how to automate it. But it's still way easier than the Prometheus stack.

I've tried to get Loki to work 2 or 3 times. Never have really succeeded. I think I was able to browse a few log lines during one attempt, I don't think I even got that far in the other attempts... The impression I came away with was that it was designed to be run by people with lots of experience with it. Either that, or it just wasn't actually ready to be used by anyone not actively developing it.

So, yeah, while I figure a lot of people do well with the Prometheus/Grafana/Loki stack, it just isn't for me.

thinkmassive · on Sept 25, 2023

The most basic setup, and the one typically used until you need something more advanced, is using Prometheus for scraping and as the TSDB backend. If you ever decide to revisit prometheus, you’ll likely have better luck starting with this approach, rather than implementing your own scraping or involving TimescaleDB at all (at least until you have a working monitoring stack).

There used to be a connector called Promscale that was for sending metrics data from Prometheus to Timescale (using Prometheus’ remote_write) but it was deprecated earlier this year.

thinkmassive · on Sept 25, 2023

Also important to add: using prometheus as the tsdb is good for short term use (on the order of days to months). For longer retention you could offload it elsewhere, like another Prometheus-based backend or something else SQL-based, etc

andrewm4894 · on Sept 25, 2023

hey - I work on ML at Netdata (disclaimer).

We have a big PR open and under review at moment that brings in a lot more logs capabilities: https://github.com/netdata/netdata/pull/13291

We also have some specific logs collectors too - i think in here might be best place to look around at the moment, should take you to the logs part of the integrations section in our demo space (no login needed, sorry for the long horrible url, we adding this section to our docs soon but at moment only lives in the app)

https://app.netdata.cloud/spaces/netdata-demo/rooms/all-node...

jerrac · on Sept 25, 2023

Nice to see that the log analysis is being worked on.

I'll see if I can figure out the integrations you pointed out. They look more like they are aimed at monitoring the metrics of the tools, not using the tools to aggregate logs. Right?

The way most ops systems treat logs and metrics as completely separate areas has always struck me as odd. Both are related to each other, and having them in the same system should be default. That's why I've put as much effort into the ELK Stack as I have. They've seemed to be the only ones who have really grasped that idea. (Though it's been a year or two since I've really surveyed the space...)

One question not log related, is it required to sign up for a cloud account to get multiple nodes displaying in the same screen? From the docs on streaming, I think you can configure nodes to send data to a parent node without a cloud account, but I either haven't configured it properly yet, or something else is in the way, since the node I'm trying to set up as a parent isn't showing anything from the child node.

jerrac · on Sept 25, 2023

FYI, you need to add the api-key config section to the stream.conf file on the parent node in order to enable the api key and allow child nodes to send data to the parent node. I thought it went into the netdata.conf file... I also kinda wonder why it matters what file has what config since the different config sections all have section headings like `[stream]` or `[web]`.

So, the answer to my question is that you can get multiple nodes showing up without a cloud account. Just have to configure it correctly.

pdimitar · on Sept 25, 2023

I have used https://github.com/openobserve/openobserve in several hobby projects and liked it. It's an all-in-one solution. It's likely less featureful than many others but a single binary and everything in one place pulled me in and worked for me so far.

Not affiliated, I just like the tool.

cduzz · on Sept 25, 2023

I'm not sure if the version in use at $workplace is out of date or incorrectly configured, but it is a dreadful prometheus client in that it doesn't use labels, it just shovels all the metadata into the metric name like a 1935 style graphite install, making most of the typical prometheus goodness impossible to use.

The little dashboard thing is nice, though.

ilyt · on Sept 25, 2023

From my experience, no silver bullets. Let metric software do metric and log software do logs.

At the very least at the database level. Maybe we will get visualisation engine that merges both nicely but database wise the type of data couldn't be any different.

hadlock · on Sept 24, 2023

Back in 2017 when I had a bunch of physical machines and unmanaged VMs we ended up putting netdata on the servers. The reason why was because most of the team was used to manually logging onto servers and diagnosing the issue manually.

The reason I liked it was because it exposes a standard Prometheus endpoint I can scrape and then view using something like Grafana. There are only about 20,000 Grafana dashboard modules available for netdata but generally you can find one that works for you. Having that prometheus endpoint allows you to springboard into the cloud and get like-metrics out of your cloud stuff as well, with a nice long historical data trail from your older/est machines.

bugsense · on Sept 25, 2023

you don't need to scap anything if you use netdata cloud see https://blog.netdata.cloud/introducing-netdata-source-plugin...

andrewm4894 · on Sept 25, 2023

we are in process of getting the plugin signed at the moment: https://github.com/netdata/netdata-grafana-datasource-plugin

xrd · on Sept 24, 2023

I've been struggling with graphana and netdata looks so much better.

Is this a tool where you can boot up the docker app and then connect a bunch of servers into a centralized dashboard? Or, is it better to think of netdata as a dashboard for a single server that permits monitoring of a bunch of processes only on that machine?

I'm not sure I understand whether agents can be configured to talk to a dashboard, or if you don't need to do that configuration because they expect to talk to localhost. I have a bunch of VMs running on a bunch of different random hardware and want a way to monitor those VMs (and perhaps the hosts as well).

dizhn · on Sept 24, 2023

If you connect your servers to the netdata cloud, you can manage all of them there. (Put into groups etc). As far as I know there is no self hosted solution for this.

https://learn.netdata.cloud/docs/configuring/connect-agent-t...

andrewm4894 · on Sept 25, 2023

Hey - i work for Netdata on ML.

We have recently created enterprise self hosted options for bigger customers who can't use cloud etc. (prob not as relevant here)

For self hosted at a smaller scale then you can have your own parent with multiple children streaming to it.

This is an example demo node which is also a parent for some other demo nodes. None of these need to be claimed to or signed in to cloud:

https://sanfrancisco.my-netdata.io/

It uses the same actual dashboard as cloud so that we only have one dashboard to maintain so you get the cloud dashboard locally basically and the parent can then kind of act like its own little Netdata Cloud.

A handful of features not available this way since they depend on the metadata being stored in cloud as opposed to on a parent node but we are trying to bridge that gap where possible such that the metadata could actually live on a parent.

xrd · on Sept 25, 2023

Drat. I'm only interested in things I can self host. Back to the drawing board. Thanks for the clarification!

lukevp · on Sept 25, 2023

You can self host and centralize configuration with netdata parents [1]. It’s extremely lightweight and efficient for metrics collection, and the UI is very good as well. I recommend giving it more in depth analysis.

[1] https://community.netdata.cloud/t/advice-on-self-hosted-self...

dizhn · on Sept 25, 2023

Apparently this is possible. I didn't know. Didn't mean to mislead you. Sorry.

xrd · on Sept 25, 2023

Maybe what I want is nachos?

https://www.nagios.org/

doubled112 · on Sept 25, 2023

Or Zabbix. I’m assuming Nachos is a funny typo.

https://www.zabbix.com/

dengolius · on Sept 27, 2023

Zabbix was cool till 2015, now its better to use https://gitlab.com/mikler/glaber/ or https://signoz.io/.

andrewm4894 · on Sept 25, 2023

mmm nachos

rsyring · on Sept 25, 2023

They have a concept called "Parents":

> A “Parent” is a Netdata Agent, like the ones we install on all our systems, but is configured as a central node that receives, stores and processes metrics data from other Netdata “Child” nodes in our infrastructure...

https://learn.netdata.cloud/docs/streaming/

andrewm4894 · on Sept 25, 2023

hey - i work in Netdata on ML

Just to mention there is this doc too that also tries to explain various deployment strategies

e.g. stand alone: https://learn.netdata.cloud/docs/architecture/deployment-str...

andrewm4894 · on Sept 25, 2023

actually sorry in this case its more like parent-child

https://learn.netdata.cloud/docs/architecture/deployment-str...

and just dont have to claim the nodes to Netdata Cloud if you don't want to.

abrookewood · on Sept 25, 2023

Netdata deserves way more attention. It automatically configures itself with all relevant modules, runs very lean and has more information available than most people will ever need.

defanor · on Sept 25, 2023

The article's complaints include complexity of JS web interfaces and "eye candy", while netdata's UI requires JS, is quite laggy and jerky, very interactive. I think munin fits better (uses the same RRDtool graphs, too), though possibly its configuration is too lengthy for the requirements.