-
Notifications
You must be signed in to change notification settings - Fork 44
Conversation
General suggestion: wherever we use
🤔 : We can also get fs protocol used wherever path's are used (eg, to know that people clone from github to local, or build directly from/to s3) UPD: @aguschin updated this. Can think about more options, but these looks good to me already. |
Hey, are there examples of libraries that call home? (even CLI raises some concerns, library can give us even more troubles). Second question - are we wrapping it into a separate thread? |
From the top of my head I think boto3 logs every request, not sure tbh but seen something like this when I debugged something We spawn a separate process even |
Don't see any difference between API and CLI tbh. Most of the concerns I heard are about PI and sending it to someone who sells the data (like GA). Since we use our own open source tools to collect telemetry we should be good, anybody can see what fields we are collecting if they have doubts |
Codecov ReportBase: 87.99% // Head: 87.16% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #451 +/- ##
==========================================
- Coverage 87.99% 87.16% -0.84%
==========================================
Files 96 96
Lines 8282 8718 +436
==========================================
+ Hits 7288 7599 +311
- Misses 994 1119 +125
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Some thoughts:
What is the reason for telemetry here? if it's to show usage / adoption - then I would treat API and CLI as competing entry points. meaning, if someone ran a command via CLI, not register api event, assuming we are tracking api event in order to learn about the direct usage of API
Same, I would think why are we logging this. if it's to show usage than we probably shouldn't
Suggest to consult with engineers from DVC and CML about what they chose to register in their telemetry - usually rule of thumb should be to white-list specific stuff (model architecture, deployment technology) , being very mindful to collect only a handful of needed bits of information only and nothing that could accidentally be looked at as personal/sensitive information like file names, paths, env variables etc.
👍
+1 should not effect behavior or have super minimal/negligable effect (functionality or performance)
This is very true. We should probably put out some kind of blog post explaining what and when are we logging and explain so we can point concerned users there. Also if we're adding this to api it should be VERY explicit and easy to turn off (func signature) |
No IMO, on't see a reason for this
No, don't see a reason as well
Will update your comment suggestion if you don't mind.
Yes!
Good idea!
Wouldn't turning it off in MLEM settings be enough? So it's either
Having it in each func signature could be a non-convenient overkill IMO. |
Environment matters. It should do what's asked for and expected, definitely not phone home. Also, at this stage of |
I would expect some insights about:
I guess we can extract other things listed above from CLI usage as well. But at least these 2 can't be extracted without API analytics I assume. |
I find it kinda ironic that we implement tracking stuff in PR #451 🤣 |
# Conflicts: # mlem/api/commands.py
I've created iterative/telemetry-python#49 based on code in this PR. Once it's merged I will use it to update code in this PR |
Yeah, for CLI it makes sense 👍 . for code usage - config might not make sense, so your option 2 looks good ( I totally get @shcheklein and @skshetry concerns here - even if it's not that clear cut. I think we should take them very seriously and come off as super friendly to the user. do whatever we can. having it in every api func signatures might indeed be overdoing it a bit, it's dirty. but If you import a package and use api, step into the functions, don't see anything related to analytics, don't think about adding a config and editing and have it call home; it's not obvious, to me at least 🤔 so may be perceived as sneaky/non-trivial to turn-off. A less dirty solution is maybe just adding a note about it and how to turn it off in the docstrings of api entry points, or at the top of api files - guiding people how to turn it off, probably also outputting a friendly message with how to turn it off in stderr if it fails? somewhere developers will notice and are aware. So it's your choice folks, but yeah, let's be very careful to make it obvious and easy to turn off, and it should be very safe, never harm performance, or fail the execution (separate thread?) - to not drive/scare away users - should be safe to run in air-gapped systems, be well documented, etc |
You can turn it off with I think we should write a blogpost that explains how, what and why we collect. Maybe even add a link to it in api module docstring? No sure about function docstrings - to much duplication. Also when we reach 1.0 we can gradually remove some reports, or maybe disable API reporting by default. But it makes sense that we want as much feedback as possible while MLEM is not mature |
# Conflicts: # mlem/api/commands.py
basically blocked by #472 |
# Conflicts: # mlem/cli/main.py # mlem/core/metadata.py
@@ -146,6 +172,7 @@ def load_meta( | |||
... | |||
|
|||
|
|||
@api_telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be precise and log def load(...)
instead of load_meta
?
Related to the question about logging list_objects
calls instead of many load_meta
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @mike0sv, decided to skip it for now.
Adds telemetry event logging to
mlem.api
commands. Some Qs:save, load, load_meta
be added too? It can show interesting stuff like what types of models and datasets are used more frequently