Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTS Statistics by kondra for TFS 1.6+ #4874

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gesior
Copy link
Contributor

@gesior gesior commented Dec 14, 2024

It's well known 'OTS stats' system with LIMITED FUNCTIONALITY FOR TFS 1.6+:
It does not track C++ functions executed by dispatcher as it's not possible on 1.6+ ( #4306 (comment) ).
To track usage by C++ functions, you have to add AutoStat to each function you want to track. More about it at end of this PR.

It tracks all Lua function calls and SQL queries.
It generates files lua.log and sql.log in stats/ that looks like:

[14/12/2024 02:19:43] Players online: 0
 Time (ms)     Calls     Rel usage %    Real usage % Description
         0        70       41.44366%        0.00420% data/npc/scripts/default.lua:onThink
         0        30       19.20608%        0.00195% data/npc/scripts/runes.lua:onThink
         0        10       11.75076%        0.00119% data/npc/scripts/bank.lua:onThink
         0        10       10.38807%        0.00105% data/npc/scripts/The Oracle.lua:onThink
         0        10        9.37099%        0.00095% data/npc/scripts/promotion.lua:onThink

Report interval is configurable in config.lua.

It also reports slow/very slow function calls in files lua_slow.log and sql_slow.log. Ex. startup.lua onStartup event execution time is higher than 10 ms, it's reported as slow in lua_slow.log:

[14/12/2024 02:19:23] Execution time: 14 ms - data/globalevents/scripts/startup.lua:onStartup - 

SQL took over 10 ms, it's reported as slow:

[29/01/2023 16:06:07] Execution time: 24 ms - TRUNCATE TABLE `towns` - TRUNCATE TABLE `towns`

Slow/very slow time is configurable in config.lua

There are also special.log, special_slow.log and special_very_slow.log, which report your custom statistics. Ex. to report decay time of items, in game.cpp under:

void Game::internalDecayItem(Item* item)
{

add:

AutoStat stat("internalDecayItem", std::to_string(item->getID()));

String internalDecayItem will be used for total CPU usage report. Second parameter (item ID) will be reported in slow and very_slow logs. In this case, it would say which item ID decayed over 10 ms (single item decay time, not all items with this item ID).
Most of OTSes pass player name as second parameter. It let them track which player generated 'slow' actions - lagged OTS.

@Codinablack
Copy link
Contributor

There are plenty of profiling tools already available. Even if someone wanted both instrumentation and sampling, there are still plenty of FOSS softwares out there that have whole teams of people who study this field of programming working on them. I think adding this to TFS would be a mistake, it's not needed, and for many like myself, not wanted. If I wanted to profile a TFS server, I certainly wouldn't use this code, I would get a professional grade profiler to do the job.

Adding this system is adding un-needed, un-necessary, and un-asked for bloat to the code base, even with the #ifdef's, the core part of the code is added (stats.cpp/stats.h) whether or not it's enabled. Even if this system does an "ok" job, there are still plenty of FOSS profilers and paid profilers out there that would undoubtedly do a much better job, and still allows users the freedom to decide if they want to profile by sampling, or profile by instrumentation.

It's just my opinion on the matter.

@Tofame
Copy link
Contributor

Tofame commented Dec 15, 2024

This will be difference between having something or having nothing at all.

I cannot even imagine devs downloading or paying for 3rd party profilers (obviously first a person would have to learn how to use and attach them to the engine (like its gonna happen xd))... all while there is ready&&free commmit that can do it for them.

Simple, free, and all work already done by Gesior/Kondrah. 100% needs to be merged.

@ArturKnopik
Copy link
Contributor

I think it's a useful piece of code, it allows developer to catch weak points at the development stage. I agree with @Tofame about 3rd party profilers.

@Codinablack
Copy link
Contributor

Codinablack commented Dec 16, 2024

I cannot even imagine devs downloading or paying for 3rd party profilers (obviously first a person would have to learn how to use and attach them to the engine (like its gonna happen xd))...

Well, if the user doesn't have the ability or time to learn how to setup a profiler, then what are they actually gaining from even using one in the first place? I mean seriously, the people who can't figure out how to attach a profiler to a process, aren't going to know what to do with the data they are getting!

Besides they don't need to download or pay, both Windows and Linux come stock with profilers. Visual studio has profiler built into it , and linux has various tools at its disposal as stock (depends on which ones for which distro). Both windows and linux users can use Jetbrains Rider for free, which also has a profiler.

At any rate, I obviously highly disagree with the thought that this needs merged. This "limited" version isn't even worth its overhead. While I will agree it's beneficial to have detailed data about used resources, I think this "limited" instrumentation is just a bloat that would be in the way for a real instrumentation job.
Besides, sampling is highly effective, we should not be taking the instrumentation route automatically for the end users... that kind of decision for extreme measures should always be their own choice.

@gesior
Copy link
Contributor Author

gesior commented Dec 16, 2024

I mean seriously, the people who can't figure out how to attach a profiler to a process, aren't going to know what to do with the data they are getting!

I tuned a bit first post stats to some values you may get on real OTS with bugged Lua script:

[14/12/2024 02:19:43] Players online: 5
 Time (ms)     Calls     Rel usage %    Real usage % Description
         9517       10      90.37099%        90.00951% data/npc/scripts/promotion.lua:onThink
         423        70       4.144366%        4.20123% data/npc/scripts/default.lua:onThink
         192        30       1.920608%        1.95123% data/npc/scripts/runes.lua:onThink
         175        10       1.175076%        1.19123% data/npc/scripts/bank.lua:onThink
         123        10       1.038807%        1.05123% data/npc/scripts/The Oracle.lua:onThink

[14/12/2024 02:19:39] Execution time: 9011 ms - data/npc/scripts/promotion.lua:onThink - 

Do you need to know how to setup profiler to read which Lua script is a problem?

If someone still doesn't get it, there is a tutorial: https://otland.net/threads/how-to-read-ots-statistics-logs.283722/

@Codinablack
Copy link
Contributor

I mean seriously, the people who can't figure out how to attach a profiler to a process, aren't going to know what to do with the data they are getting!

I tuned a bit first post stats to some values you may get on real OTS with bugged Lua script:

[14/12/2024 02:19:43] Players online: 5
 Time (ms)     Calls     Rel usage %    Real usage % Description
         9517       10      90.37099%        90.00951% data/npc/scripts/promotion.lua:onThink
         423        70       4.144366%        4.20123% data/npc/scripts/default.lua:onThink
         192        30       1.920608%        1.95123% data/npc/scripts/runes.lua:onThink
         175        10       1.175076%        1.19123% data/npc/scripts/bank.lua:onThink
         123        10       1.038807%        1.05123% data/npc/scripts/The Oracle.lua:onThink

[14/12/2024 02:19:39] Execution time: 9011 ms - data/npc/scripts/promotion.lua:onThink - 

Do you need to know how to setup profiler to read which Lua script is a problem?

Yes, because the kondra "stats" system is a profiler.
The log that you are using to see which script is a problem, its generated from your profiler (in this case the kondra stats) through a method known as "instrumentation"... Your way of doing it requires intrusive code into the codebase to find the lua script.

I prefer to just simply push the play/run button in my favorite IDE (both Rider and VS come with this feature built in) and launch my server through the IDE rather than add code to my server, and still get the relevant information I need via "sampling" instead of "instrumentation"... both will indicate the high usage of CPU from the process "onThink", and so ultimately both will get the job done, but with my way, I don't have a choice forced upon me.

@kojimhojo
Copy link

kojimhojo commented Dec 17, 2024

I prefer to just simply push the play/run button in my favorite IDE (both Rider and VS come with this feature built in) and launch my server through the IDE rather than add code to my server,

Off-topic related to OTS stats-profiler (source of this discussion):
@Codinablack
I wanted to make this recommendation to you on Discord, but you disappeared from OTA. Maybe it would be a good idea to add a tutorial video (I remember seeing you make one about Premake compilation) to your wiki or devdocs on how to "just simply push the play/run button in the IDE" for your Blacktek project (not how to interpret it, just how to do it)

(of course only if you have time.)

That could also be useful to this repo and the entire community.

@ArturKnopik
Copy link
Contributor

I prefer to just simply push the play/run button in my favorite IDE (both Rider and VS come with this feature built in) and launch my server through the IDE rather than add code to my server,

Off-topic: @Codinablack I wanted to make this recommendation to you on Discord, but you disappeared from OTA. Maybe it would be a good idea to add a tutorial video (I remember seeing you make one about Premake compilation) to your wiki or devdocs on how to "just simply push the play/run button in the IDE" for your Blacktek project

please, use his repo to talk about his project... open issue or something

@Codinablack
Copy link
Contributor

Off-topic related to OTS stats-profiler (source of this discussion): @Codinablack I wanted to make this recommendation to you on Discord, but you disappeared from OTA. Maybe it would be a good idea to add a tutorial video (I remember seeing you make one about Premake compilation) to your wiki or devdocs on how to "just simply push the play/run button in the IDE" for your Blacktek project (not how to interpret it, just how to do it)

(of course only if you have time.)

That could also be useful to this repo and the entire community.

I have to agree with @ArturKnopik this is not the place for that discussion, just reach out to me on discord or otland if you need to talk with me.

When I said "my server" I was speaking in a general context of an end user for TFS. As a user of TFS, I would rather have my server ran through my choice of profiler, than for the devs to just add a "limited" instrumentation job.

@Tofame
Copy link
Contributor

Tofame commented Dec 17, 2024

The amount of shit, unoptimized code will only increase, because of GPT overuse (abuse). OTland resources tab is full of GPT generated scripts that are terrible.

A step has to be taken, a code that simply generates logs for you without doing much work is very good.

You can then do whatever, analyse by yourself or even send those logs.txt to someone knowledgable, pay.him/her and he/she will fix parts of code that will be needed without a need to send that person full sources/exe to run and profile.

You forget about users of TFS/Canary, that are not developers (majority). A lot of amateurs that dont know what they are doing, if explained by a bit in documentation will learn to make use of it and improve their OTs.

@ArturKnopik ArturKnopik added enhancement Increase or improvement in quality, value, or extent feature New feature or functionality labels Dec 20, 2024
@nekiro
Copy link
Member

nekiro commented Dec 28, 2024

This is obviously a good portion of working code that helps by profiling what's going on, but tbh I wouldn't want this to be merged into tfs. These features could be kept separately as a patch of some sort, just so you know there is a way to add this to tfs.
More codes like this is more bloatware which I don't like

@gesior
Copy link
Contributor Author

gesior commented Dec 29, 2024

These features could be kept separately as a patch of some sort, just so you know there is a way to add this to tfs.

Nobody cares about features that 'can be added'. As Tofame posted, most of TFS/canary users are not developers. They download it, run and that's all what they do/can do.

This is a thing that most of OTS developers forgot. It's FOSS as any other. You don't care who/how wrote nginx/php/gcc or any other FOSS you use daily. You just download it and use.
If there is no build-in tool to detect lag reason, 99% of community won't be able to detect lag reason.

@Codinablack
Copy link
Contributor

These features could be kept separately as a patch of some sort, just so you know there is a way to add this to tfs.

Nobody cares about features that 'can be added'. As Tofame posted, most of TFS/canary users are not developers. They download it, run and that's all what they do/can do.

This is a thing that most of OTS developers forgot. It's FOSS as any other. You don't care who/how wrote nginx/php/gcc or any other FOSS you use daily. You just download it and use. If there is no build-in tool to detect lag reason, 99% of community won't be able to detect lag reason.

Yours and ToFame's argument counters your point!!!!

You want this to be there because you say that "users are not smart enough" to do these kinds of things on their own... well if that user isn't smart enough to do this on their own, then they wouldn't be smart enough to know what to do with the data they collect from it!

Adding it there so that "other users" can help the ones who don't know what to do with the data is just adding a penalty to users who DO KNOW how to add such profiling techniques AND know what to do with the data it provides!

Just because some dunce has performance logs, DOES NOT, even in the slightest, mean that someone else will be able to, or is going to, help the dunce solve w/e problem they have... in fact, most people don't even get help when they have all the information that they could possibly need to get help.... so what makes performance logs so beneficial in helping someone troubleshoot some problem they are having?????

I think you would be hard pressed to find a viable situation in which this kind of data could actually be useful to a dev attempting to help someone else who has these kinds of logs... unless the code they are profiling is just an infinite loop or some other pure garbage code that wouldn't make it into TFS on its worst day.

I'm on the same side as Nekiro on this one, there is 0 reason to add this sort of bloat to TFS. This all just comes back to others trying to make money on those who can't code, and so it has no place here!

@gesior
Copy link
Contributor Author

gesior commented Dec 29, 2024

so what makes performance logs so beneficial in helping someone troubleshoot some problem they are having?????

I think you would be hard pressed to find a viable situation in which this kind of data could actually be useful to a dev attempting to help someone else who has these kinds of logs

It's only profiling tool you can use on production server to track rare and unexpected lags.

@Codinablack
In every thread related to OTS Stats, you argue that there are better tools.
I give you very simple real world example.
Example:

  • you have TFS 1.6 server with 500-700 online
  • you host it on 4 cores, 8 GB RAM VPS with 80 GB SSD
  • dispatcher thread uses 80% of single core
  • every few days players report that between 2-5 AM (when you sleep) there was a 3-5 seconds lag

Possible reasons:

  • maybe only few players have lag, not everyone - invalid report from players
  • network lag on VPS
  • CPU speed drop on VPS, as it's shared CPU/RAM
  • bug in some C++/Lua code

Reason detected with OTS Stats is function onSay executed by tasks.lua NPC.

What would you do?
Tell us what pros do with problem like that, so everyone can follow your instruction and I will close this PR.

@MillhioreBT
Copy link
Contributor

Quote to Gesior

so what makes performance logs so beneficial in helping someone troubleshoot some problem they are having?????
I think you would be hard pressed to find a viable situation in which this kind of data could actually be useful to a dev attempting to help someone else who has these kinds of logs

It's only profiling tool you can use on production server to track rare and unexpected lags.

@Codinablack In every thread related to OTS Stats, you argue that there are better tools. I give you very simple real world example. Example:

  • you have TFS 1.6 server with 500-700 online
  • you host it on 4 cores, 8 GB RAM VPS with 80 GB SSD
  • dispatcher thread uses 80% of single core
  • every few days players report that between 2-5 AM (when you sleep) there was a 3-5 seconds lag

Possible reasons:

  • maybe only few players have lag, not everyone - invalid report from players
  • network lag on VPS
  • CPU speed drop on VPS, as it's shared CPU/RAM
  • bug in some C++/Lua code

Reason detected with OTS Stats is function onSay executed by tasks.lua NPC.

What would you do? Tell us what pros do with problem like that, so everyone can follow your instruction and I will close this PR.

Currently there are problems with NPCs, the library is very old and contains many bad Lua practices
Furthermore, trusting data that can be manipulated by the players themselves is something very strange.

At this point, we all know that the NPC library needs to be replaced. I believe @EvilHero90 did some work on this, but at the moment that library is not being used, or at least I don't know anyone who is utilizing new NPCs with the new system. The classic NPCs via XML are still being used, and we are not considering that most have poorly optimized custom scripts.

You can't imagine the number of bad scripts I've seen that inexperienced people create by themselves or through a poor use of AI.

Those data you obtained with the Kondra system, I also got with the VS2022 profiling tool, even with more technical details.
For example, when there are many players and all these players are keeping many NPCs or monsters awake, checkCreatures has to spend more time executing, which means the CPU remains more active than when there are few players.
This is normal; what needs to be done is to optimize the onThink events in Lua.

Some of the best optimizations in Lua are:

  • Avoid building tables; instead, reuse them whenever possible.
  • Concatenate with tables.
  • Avoid creating anonymous functions; instead, reuse them.
  • Don’t overuse Game.getSpectators in ranges other than 11x11, as this is the only configuration currently with caching enabled.
  • Don’t overuse the Tile(pos) constructor; instead, you can cache it in a local table.
  • Etc...

There are many more things to optimize in Lua. We’re talking about something that consumes 100ms; with these optimizations, it would only cost 1ms of CPU.

Should the engine be improved?

Of course, there is still a long way to go.

But please, do not base your conclusions on data obtained from servers with 1000 players.
Generally, a server that has this number of players has a lot of custom Lua systems, or even features built into the sources.
A default TFS without any modifications would not reach the numbers you mentioned.

@gesior
Copy link
Contributor Author

gesior commented Dec 29, 2024

I post some real world example of a problem and everyone respond with 'I do X on my home Windows PC with 1 online to check how many nanoseconds given function execute'.

This is not a problem that you solve with production monitoring tools like OTS Stats. Things you monitor with tools like that are rare lags (1 time per X hours/days) and changes in C++/Lua CPU usage after OTS updates.

Currently there are problems with NPCs, the library is very old and contains many bad Lua practices

It was an example. Imagine that you downloaded TFS 1.6, then installed 10 task systems/quests from OTLand. Made some nice map. It starts lagging someday. Maybe you have 5 online, maybe 500. This is a real world scenario of average TFS user.

I've helped 50+ OTSes in last year. Most of them were small and there were only 2 things that blocked them from going 100+ online:

  • CPU went 100% or there were random visible lags in game, but ONLY on production server with XX+ online
  • DDoSes

90% of otservlist top 50 servers already use OTS Stats. That was my argument when I asked Kondra to release this system for free. You could hire someone on BR forum in 2022 for just 10$ to add it to your OTS engine.

A default TFS without any modifications would not reach the numbers you mentioned.

TFS BASED servers can easily reach 500+ online. Ex. Nekiro downgrade 8.6 with some basic 8.6 logic fixes and nice evo map - NO OPTIMIZATIONS, but it can also go 100% CPU with 50 online, if you add 1 poorly optimized quest/system from OTLand.
Maybe someday Nekiro or someone else will prepare a new 8.6 downgrade engine. It would be nice if it included OTS statistics by default.
As I benchmarked here (line 16): https://docs.google.com/spreadsheets/d/19uSEjxdpXQGrdheusutf7ndru5D2OS2c7XNCEJ6PFjg/edit?gid=0#gid=0
TFS 1.4 without any changes can process 400 players running around with 10 monsters on each player screen.

@Codinablack
Copy link
Contributor

@gesior

You ignore counter arguments and valid points, and just repeat yourself.
You think you are smarter than everyone else and continue to just talk down to people as if they don't understand what you are saying, or understand how to read the logs output by this system.

You have resorted to trying to bully me into solving your fake problem, that has no bearing or relevance to this proposal. I would solve the problem by investigating and isolating what is the root cause, systematically, and once I have found the cause, I would then resolve the problem... if I couldn't think of a resolution immediately, I may seek guidance or help from another... The exact same way anyone would solve such a problem, isolate, confirm, resolve. I don't know what you are wanting from me, or trying to prove by posing this fake scenario in which "onSay" is a problem with your "tasks"... sounds to me like you are copying an actual example of something that happened in which you used this system to find the problem... good for you! There is not only one way to find a problem, and usually not only one way to solve it either.

All that above is nonsense and off topic.

The facts is this :
This is a profiling system that uses the method of instrumentation to get its information.
I consider this method to be intrusive and un-necessary, period. That is my opinion and I'm just going to leave it at that before being goaded into more circular debates with you.

Others are allowed their opinions, clearly there is some support for this feature into TFS.
My opinion is valid, and so is theirs... ultimately it will be up to the TFS Devs if this is merged or not.
There is no more need in arguing, and even less need in trying to ridicule each other, tear each other down, or try to act like one is smarter than another or superior in anyway..

I have spoken my peace about this "limited" system and feel no need in re-engaging anymore.
I'm tired of beating my head against the wall.
None of it matters anyways, its the Dev's call.

@gesior
Copy link
Contributor Author

gesior commented Dec 30, 2024

@Codinablack
Sorry, I don't have time to read your answer, as you clearly did not answer my very basic question (post instruction point by point how to detect lags of rare lagging Lua script on production server).

I posted very simple example of problem that real OTS owners meet. If you can't answer it, I won't argue with you about anything related to profiling OTS code on production server.
It looks like you don't have any experience on real world server (OTS with 50+ online).

I've broken first rule of internet for you for a first time in over 10 years - because you argued about OTS Stats for over an year on multiple forums - and I won't waste time on you anymore @Codinablack

I've talked with @MillhioreBT on Discord and I learnt that it's hard to understand OTS Stats without real world experience on lagging OTSes with XX+ players online.
Of course VS2022 profiling tools are better for C++ single function/line profiling, but you can't run them on Linux server with 500+ online and that's a whole point. Production software needs production solutions, are they OTS Stats? Yes!

I give you very simple real world example.
Example:

If anyone can answer this question, I will argue for OTS Stats. It is real world 'my OTS is lagging' answer that I will answer with OTS Stats results.
@Codinablack Are you able to answer this simple question?

To be clear: I'm not an author of OTS Stats. Kondra - OTCv8 author - made them years ago - they are part of OTCv8 (CTRL+ALT+D in client to get C++ and Lua logs) - and sold to 100+ OTSes for 100-300$ each. I just fought to release them for free to everyone and it was after he quited OTSes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Increase or improvement in quality, value, or extent feature New feature or functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants