Commit Graph

62 Commits

Author SHA1 Message Date
chrisbecke ef0b8367b5 Update minio-overview.json data source panel (#13730)
Add missing datasource in `Healing` panel.
2021-11-23 09:01:07 -08:00
Mani 7b82411e6f change the unit of measurement from TB to TiB (#13686) 2021-11-18 20:06:37 -08:00
Ashish Kumar Sinha 3d2bc15e9a Add grafana json file for replication metrics (#13678) 2021-11-17 14:49:46 -08:00
jandres - moscardo 1aa08f594d Update README.md prometheus (#13514)
Modify the doc to warn users about Prometheus sending `domain:port`
2021-11-02 12:27:30 -07:00
Harshavardhana 90e505e58f calculate API requests/error as increase() intervals not as rate() 2021-09-12 11:28:28 -07:00
Harshavardhana 1250312287 fail ready/liveness if etcd is unhealthy in gateway mode (#13146) 2021-09-03 17:05:41 -07:00
Nitish Tiwari 60394ddf83 Add support for changing job name in Grafana dashboard (#13050) 2021-08-24 09:51:09 -07:00
Krishnan Parthasarathi 30b77f59b1 doc: Add ilm prometheus metrics information (#12994) 2021-08-17 12:19:36 -07:00
Matt Sarrel 109c8acf4f fixed typo in metrics README.md (#12874) 2021-08-04 12:48:57 -07:00
Nitish Tiwari 32017454ee fix typo in Grafana dashboard json (#12471) 2021-06-09 08:04:12 -07:00
Nitish Tiwari 00c5d7e1b3 Add healing related metrics in official dashboard (#12456) 2021-06-07 12:46:54 -07:00
Poorna Krishnamoorthy 3690de0c6b Drop Pending size and count from replication metrics (#12378)
Real-time metrics calculated in-memory rely on the initial
replication metrics saved with data usage. However, this can
lag behind the actual state of the cluster at the time of server 
restart leading to inaccurate Pending size/counts reported to
Prometheus. Dropping the Pending metrics as this can be more 
reliably monitored by applications with replication notifications.

Signed-off-by: Poorna Krishnamoorthy <poorna@minio.io>
2021-05-31 20:26:52 -07:00
Nitish Tiwari a592d3be19 fix the dashboard to use $rate_interval (#12277)
refer https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/
for further information
2021-05-12 08:06:47 -07:00
Harshavardhana 2fd9c13b50 rename minio-cluster to minio-job as per prometheus config 2021-05-06 12:39:58 -07:00
Nitish Tiwari ddc1e4b5b3 Update Grafana dashboard to use the new v2 cluster metrics (#12220)
Fixes #11543
2021-05-06 14:44:03 +05:30
Harshavardhana 8a9d15ace2 update prometheus metrics with failed_count 2021-04-04 09:52:37 -07:00
Poorna Krishnamoorthy 47c09a1e6f Various improvements in replication (#11949)
- collect real time replication metrics for prometheus.
- add pending_count, failed_count metric for total pending/failed replication operations.

- add API to get replication metrics

- add MRF worker to handle spill-over replication operations

- multiple issues found with replication
- fixes an issue when client sends a bucket
 name with `/` at the end from SetRemoteTarget
 API call make sure to trim the bucket name to 
 avoid any extra `/`.

- hold write locks in GetObjectNInfo during replication
  to ensure that object version stack is not overwritten
  while reading the content.

- add additional protection during WriteMetadata() to
  ensure that we always write a valid FileInfo{} and avoid
  ever writing empty FileInfo{} to the lowest layers.

Co-authored-by: Poorna Krishnamoorthy <poorna@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
2021-04-03 09:03:42 -07:00
Ritesh H Shukla 23b03dadb8 Add process uptime metric (#11844) 2021-03-20 21:23:27 -07:00
Harshavardhana 2c198ae7b6 fix: prometheus metrics disks_online count when disks are down (#11689)
prometheus metrics was using total disks instead
of online disk count, when disks were down, this
PR fixes this and also adds a new metric for
total_disk_count
2021-03-03 11:18:41 -08:00
Krishna Srinivas 876b79b8d8 read-health check endpoint returns success if cluster can serve read requests (#11310) 2021-02-09 01:00:44 -08:00
Ritesh H Shukla c4848f9b4f Add process start time to cluster metrics. (#11405) 2021-02-01 23:02:18 -08:00
Ritesh H Shukla 0bf2d84f96 update new metrics url docs (#11342) 2021-01-25 01:03:07 -08:00
Ritesh H Shukla 7575c24037 Add open FD and FD limit to cluster metrics (#11328) 2021-01-22 18:30:16 -08:00
Harshavardhana c080f04e66 fix: prometheus metrics link typo update to latest 2021-01-22 01:53:23 -08:00
Ritesh H Shukla b4add82bb6 Updated Prometheus metrics (#11141)
* Add metrics for nodes online and offline
* Add cluster capacity metrics
* Introduce v2 metrics
2021-01-18 20:35:38 -08:00
Harshavardhana 14792cdbc6 docs: fix the metrics formatting (#11081) 2020-12-10 18:15:47 -08:00
Harshavardhana 97856bfebf fix: grafana double counting for bucket usage, histrogram and objects (#11070) 2020-12-09 20:30:37 -08:00
Nitish Tiwari 54d243cd98 fix: grafana dashboard calculating online nodes (#11041)
Also use a generic name instead of diff names per revision
2020-12-09 00:26:42 -08:00
Ritesh H Shukla 04848dfa1c Add documentation for bucket replication related metrics (#11055) 2020-12-08 12:48:10 -08:00
Harshavardhana 4a564336fe Revert "Add metrics for nodes online and offline (#11050)"
This reverts commit f60bbdf86b.
2020-12-08 09:23:35 -08:00
Ritesh H Shukla f60bbdf86b Add metrics for nodes online and offline (#11050) 2020-12-08 01:06:27 -08:00
Poorna Krishnamoorthy f3beb1236a Add cache usage, total capacity to prometheus metrics (#11026) 2020-12-07 16:35:11 -08:00
Nitish Tiwari 6ff12f5f01 Add the dashboard json file (#11028)
This will allow users to contribute to the dashboard as needed.
2020-12-04 16:27:41 -08:00
Nitish Tiwari de9b64834e fix: update grafana dashboard docs (#11023)
Refer to the official Grafana dashboard
2020-12-03 15:56:15 -08:00
Anis Elleuch 8e8ddf7233 doc: Add definition of 1KB and 1MB in prometheus (#10857) 2020-11-09 10:05:01 -08:00
Harshavardhana 9c042a503b remove deprecate readiness from healthcheck docs (#10659) 2020-10-12 18:56:03 -07:00
Harshavardhana 8b74a72b21 fix: rename READY deadline to CLUSTER deadline ENV (#10535) 2020-09-23 09:14:33 -07:00
Derek Bender 3168e93730 fix typo in healthcheck README.md (#10518) 2020-09-18 09:52:37 -07:00
Harshavardhana ec06089eda fix: re-implement cluster healthcheck (#10101) 2020-07-20 18:31:22 -07:00
Harshavardhana 3520e946a2 fix: versioning docs add more examples 2020-07-11 00:57:46 -07:00
Harshavardhana d5ff1c8e3b fix docs image urls to be absolute path 2020-07-11 00:27:30 -07:00
Nitish Tiwari 30c251efd3 Add Grafana dashboard (#10000) 2020-07-09 12:01:58 -07:00
Harshavardhana c0ac25bfff fix: readiness needs to be like liveness (#9941)
Readiness as no reasoning to be cluster scope
because that is not how the k8s networking works
for pods, all the pods to a deployment are not
sharing the network in a singleton. Instead they
are run as local scopes to themselves, with
readiness failures the pod is potentially taken
out of the network to be resolvable - this
affects the distributed setup in myriad of
different ways.

Instead readiness should behave like liveness
with local scope alone, and should be a dummy
implementation.

This PR all the startup times and overal k8s
startup time dramatically improves.

Added another handler called as `/minio/health/cluster`
to understand the cluster scope health.
2020-06-30 11:28:27 -07:00
Harshavardhana f9aa239973 fix: export prometheus metrics for cache GC triggers (#9815)
Bonus change to use channel to serialize triggers,
instead of using atomic variables. More efficient
mechanism for synchronization.

Co-authored-by: Nitish Tiwari <nitish@minio.io>
2020-06-15 09:05:35 -07:00
Harshavardhana 5e529a1c96 simplify context timeout for readiness (#9772)
additionally also add CORS support to restrict
for specific origin, adds a new config and
updated the documentation as well
2020-06-04 14:58:34 -07:00
Harshavardhana 53aaa5d2a5 Export bucket usage counts as part of bucket metrics (#9710)
Bonus fixes in quota enforcement to use the
new datastructure and use timedValue to cache
a value/reload automatically avoids one less
global variable.
2020-05-27 06:45:43 -07:00
poornas 336460f67e fix: gateway_s3_bytes_sent metric for all API methods (#9242)
Co-authored-by: Harshavardhana <harsha@minio.io>
2020-04-01 12:52:31 -07:00
Nitish Tiwari 6b984410d5 Add support for self-healing related metrics in Prometheus (#9079)
Fixes #8988

Co-authored-by: Anis Elleuch <vadmeste@users.noreply.github.com>
Co-authored-by: Harshavardhana <harsha@minio.io>
2020-03-24 22:40:45 -07:00
Nitish Tiwari 63be4709b7 Add metrics support for Azure & GCS Gateway (#8954)
We added support for caching and S3 related metrics in #8591. As
a continuation, it would be helpful to add support for Azure & GCS
gateway related metrics as well.
2020-02-11 21:08:01 +05:30
Kevin Humphreys 656146b699 doc: Prometheus metrics name fix (#8774)
changed docs to reflect proper Prometheus metrics
2020-01-09 18:36:58 -08:00