Skip to content
This repository has been archived by the owner on Sep 13, 2023. It is now read-only.

Add telemetry to API #451

Merged
merged 13 commits into from
Nov 29, 2022
Merged

Add telemetry to API #451

merged 13 commits into from
Nov 29, 2022

Conversation

mike0sv
Copy link
Contributor

@mike0sv mike0sv commented Oct 26, 2022

Adds telemetry event logging to mlem.api commands. Some Qs:

  1. Should we log API event if it was used by CLI command (in this PR we do not)
  2. What if one API command calls some nested API command? Should both events be logged, or only topmost? (kinda the same as 1) if you really think about it)
  3. What parameters we want to log from each command (will post suggestions as a separate comment)
  4. Should save, load, load_meta be added too? It can show interesting stuff like what types of models and datasets are used more frequently
  5. Need a framework to enforce same extra fields for similar API/CLI calls, but that depends on previous Qs

@mike0sv mike0sv requested a review from a team October 26, 2022 17:44
@mike0sv mike0sv temporarily deployed to internal October 26, 2022 17:44 Inactive
@mike0sv
Copy link
Contributor Author

mike0sv commented Oct 26, 2022

General suggestion: wherever we use MlemABC, log it's alias (string type) and classpath (like mlem.contrib.smth.ClassName)

  • apply: model type, data type, is batch used, (inference time + number of samples + number of features - ?)
  • apply_remoe: client, data type
  • clone: from where to where (from s3 to local, or from GitHub to s3)
  • init: -
  • link: from where to where
  • build: builder, model type
  • serve: server, model type
  • import_object: import hook type
  • deploy: deployment type
  • save, load, load_meta: model framework, data type

🤔 : We can also get fs protocol used wherever path's are used (eg, to know that people clone from github to local, or build directly from/to s3)

UPD: @aguschin updated this. Can think about more options, but these looks good to me already.

@mike0sv
Copy link
Contributor Author

mike0sv commented Oct 26, 2022

cc: @shcheklein @omesser @aguschin

@shcheklein
Copy link
Member

Hey, are there examples of libraries that call home? (even CLI raises some concerns, library can give us even more troubles).

Second question - are we wrapping it into a separate thread?

@mike0sv
Copy link
Contributor Author

mike0sv commented Oct 27, 2022

From the top of my head I think boto3 logs every request, not sure tbh but seen something like this when I debugged something

We spawn a separate process even

@mike0sv
Copy link
Contributor Author

mike0sv commented Oct 27, 2022

Don't see any difference between API and CLI tbh. Most of the concerns I heard are about PI and sending it to someone who sells the data (like GA). Since we use our own open source tools to collect telemetry we should be good, anybody can see what fields we are collecting if they have doubts

Base automatically changed from release/0.3.0 to main October 27, 2022 16:38
@mike0sv mike0sv temporarily deployed to internal October 30, 2022 03:43 Inactive
@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

Base: 87.99% // Head: 87.16% // Decreases project coverage by -0.83% ⚠️

Coverage data is based on head (d43fc76) compared to base (75e2074).
Patch coverage: 90.64% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #451      +/-   ##
==========================================
- Coverage   87.99%   87.16%   -0.84%     
==========================================
  Files          96       96              
  Lines        8282     8718     +436     
==========================================
+ Hits         7288     7599     +311     
- Misses        994     1119     +125     
Impacted Files Coverage Δ
mlem/cli/serve.py 90.00% <0.00%> (ø)
mlem/contrib/sagemaker/runtime.py 80.64% <71.42%> (+0.98%) ⬆️
mlem/contrib/fastapi.py 84.74% <76.19%> (-11.09%) ⬇️
mlem/runtime/client.py 84.84% <79.41%> (-5.95%) ⬇️
mlem/contrib/lightgbm.py 94.35% <81.81%> (+0.23%) ⬆️
mlem/cli/main.py 84.70% <82.85%> (ø)
mlem/runtime/server.py 88.31% <88.23%> (+3.01%) ⬆️
mlem/core/data_type.py 92.63% <89.61%> (-2.93%) ⬇️
mlem/runtime/interface.py 87.50% <93.18%> (+2.39%) ⬆️
mlem/contrib/torch.py 93.91% <95.00%> (+0.16%) ⬆️
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@omesser
Copy link
Contributor

omesser commented Oct 31, 2022

Some thoughts:

  1. Should we log API event if it was used by CLI command (in this PR we do not)

What is the reason for telemetry here? if it's to show usage / adoption - then I would treat API and CLI as competing entry points. meaning, if someone ran a command via CLI, not register api event, assuming we are tracking api event in order to learn about the direct usage of API

What if one API command calls some nested API command? Should both events be logged, or only topmost? (kinda the same as 1) if you really think about it)

Same, I would think why are we logging this. if it's to show usage than we probably shouldn't

  1. What parameters we want to log from each command (will post suggestions as a separate comment)

Suggest to consult with engineers from DVC and CML about what they chose to register in their telemetry - usually rule of thumb should be to white-list specific stuff (model architecture, deployment technology) , being very mindful to collect only a handful of needed bits of information only and nothing that could accidentally be looked at as personal/sensitive information like file names, paths, env variables etc.

  1. Should save, load, load_meta be added too? It can show interesting stuff like what types of models and datasets are used more frequently

👍

Second question - are we wrapping it into a separate thread?

+1 should not effect behavior or have super minimal/negligable effect (functionality or performance)

even CLI raises some concerns, library can give us even more troubles

This is very true. We should probably put out some kind of blog post explaining what and when are we logging and explain so we can point concerned users there. Also if we're adding this to api it should be VERY explicit and easy to turn off (func signature)

@aguschin
Copy link
Contributor

@mike0sv

  1. Should we log API event if it was used by CLI command (in this PR we do not)

No IMO, on't see a reason for this

  1. What if one API command calls some nested API command? Should both events be logged, or only topmost? (kinda the same as 1) if you really think about it)

No, don't see a reason as well

  1. What parameters we want to log from each command (will post suggestions as a separate comment)

Will update your comment suggestion if you don't mind.

  1. Should save, load, load_meta be added too? It can show interesting stuff like what types of models and datasets are used more frequently

Yes!

@omesser

We should probably put out some kind of blog post explaining what and when are we logging and explain so we can point concerned users there.

Good idea!

Also if we're adding this to api it should be VERY explicit and easy to turn off (func signature)

Wouldn't turning it off in MLEM settings be enough? So it's either

  1. editing .mlem.yaml manually or by mlem config set ...
  2. doing that on-the-fly in Python runtime with something like
import mlem.config
mlem.config.analytics = False

Having it in each func signature could be a non-convenient overkill IMO.

@skshetry
Copy link
Member

skshetry commented Oct 31, 2022

Don't see any difference between API and CLI tbh.

Environment matters. It should do what's asked for and expected, definitely not phone home.

Also, at this stage of mlem(with ~800 downloads/month), what insights do we think this will provide us?

@aguschin
Copy link
Contributor

aguschin commented Nov 1, 2022

what insights do we think this will provide us

I would expect some insights about:

  1. Which ML frameworks and Data formats people use (and which they try to use, but MLEM fails) - this is what we can get from mlem.api.save(). Here we can pay more attention to some frameworks or prioritize support new ones.
  2. Is MLEM used programmatically at all - and how frequently. Right now, I didn't pay attention to API at all (except for save/load) TBH. I assume CLI commands should be few times widely used. If I happen to be wrong, I'll take closer look at API to see how people us it, etc.

I guess we can extract other things listed above from CLI usage as well. But at least these 2 can't be extracted without API analytics I assume.

@mike0sv
Copy link
Contributor Author

mike0sv commented Nov 2, 2022

I find it kinda ironic that we implement tracking stuff in PR #451 🤣

# Conflicts:
#	mlem/api/commands.py
@mike0sv
Copy link
Contributor Author

mike0sv commented Nov 2, 2022

I've created iterative/telemetry-python#49 based on code in this PR. Once it's merged I will use it to update code in this PR

@mike0sv mike0sv temporarily deployed to internal November 2, 2022 12:26 Inactive
@omesser
Copy link
Contributor

omesser commented Nov 2, 2022

@aguschin

Wouldn't turning it off in MLEM settings be enough?

Yeah, for CLI it makes sense 👍 . for code usage - config might not make sense, so your option 2 looks good (mlem.config.analytics = False), something like a global switch makes sense

I totally get @shcheklein and @skshetry concerns here - even if it's not that clear cut. I think we should take them very seriously and come off as super friendly to the user. do whatever we can.

having it in every api func signatures might indeed be overdoing it a bit, it's dirty. but If you import a package and use api, step into the functions, don't see anything related to analytics, don't think about adding a config and editing and have it call home; it's not obvious, to me at least 🤔 so may be perceived as sneaky/non-trivial to turn-off.

A less dirty solution is maybe just adding a note about it and how to turn it off in the docstrings of api entry points, or at the top of api files - guiding people how to turn it off, probably also outputting a friendly message with how to turn it off in stderr if it fails? somewhere developers will notice and are aware.

So it's your choice folks, but yeah, let's be very careful to make it obvious and easy to turn off, and it should be very safe, never harm performance, or fail the execution (separate thread?) - to not drive/scare away users - should be safe to run in air-gapped systems, be well documented, etc

@mike0sv
Copy link
Contributor Author

mike0sv commented Nov 3, 2022

You can turn it off with mlem.telemetry.telemetry.enabled = False

I think we should write a blogpost that explains how, what and why we collect.

Maybe even add a link to it in api module docstring? No sure about function docstrings - to much duplication.

Also when we reach 1.0 we can gradually remove some reports, or maybe disable API reporting by default. But it makes sense that we want as much feedback as possible while MLEM is not mature

@omesser omesser requested review from omesser and removed request for omesser November 6, 2022 13:22
@mike0sv mike0sv mentioned this pull request Nov 9, 2022
3 tasks
@mike0sv mike0sv temporarily deployed to internal November 11, 2022 13:53 Inactive
@mike0sv mike0sv temporarily deployed to internal November 11, 2022 15:20 Inactive
@mike0sv mike0sv changed the base branch from main to aguschin-patch-1 November 11, 2022 16:08
@mike0sv mike0sv changed the base branch from aguschin-patch-1 to main November 11, 2022 16:08
@mike0sv mike0sv temporarily deployed to internal November 11, 2022 16:09 Inactive
@mike0sv mike0sv temporarily deployed to internal November 11, 2022 16:57 Inactive
@mike0sv
Copy link
Contributor Author

mike0sv commented Nov 12, 2022

basically blocked by #472

@mike0sv mike0sv temporarily deployed to internal November 19, 2022 14:36 Inactive
@aguschin aguschin added api MLEM Python API analytics MLEM usage analytics and telemetry labels Nov 21, 2022
# Conflicts:
#	mlem/cli/main.py
#	mlem/core/metadata.py
@mike0sv mike0sv temporarily deployed to internal November 24, 2022 08:33 Inactive
@mike0sv mike0sv temporarily deployed to internal November 25, 2022 10:10 Inactive
@mike0sv mike0sv temporarily deployed to internal November 25, 2022 11:41 Inactive
@@ -146,6 +172,7 @@ def load_meta(
...


@api_telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be precise and log def load(...) instead of load_meta?

Related to the question about logging list_objects calls instead of many load_meta calls.

Copy link
Contributor

@aguschin aguschin Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @mike0sv, decided to skip it for now.

@aguschin aguschin merged commit 74ca3ae into main Nov 29, 2022
@aguschin aguschin deleted the feature/api-telemetry branch November 29, 2022 13:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
analytics MLEM usage analytics and telemetry api MLEM Python API
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants