Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect "total" numbers in Security chapter (2024, 2022, ?) #3912

Merged
merged 14 commits into from
Dec 29, 2024

Conversation

JannisBush
Copy link
Contributor

@JannisBush JannisBush commented Nov 21, 2024

Some queries in the security chapter incorrectly calculated a "total" number, for example the total number of iframes only took iframes that had either an allow or sandbox attribute into account.

This pull request fixes all the queries in the security chapter with incorrect "total" numbers.

  • Identify all queries with incorrect total numbers (security chapter 2024)
  • Fix all queries
  • Identify where the incorrect numbers where used in the Almanac (security chapter 2024)
  • Run the updated queries and store the results in google sheets (<old-tab-name> (Fixed))
    • Only for the instances where the incorrect numbers were actually used for the chapter
  • Update the text in the Almanac (2024)
  • Optional: update the main queries/results/text for prior Almanac instances as well.

@JannisBush
Copy link
Contributor Author

I went through all the queries and it seems like only iframe_attribute_usage.sql and meta_csp_disallowed_directives.sql were really "wrong". For a couple of others the word "total" was confusing as it referred to a subset, so I changed that as well, but we did not use the incorrect "total" in the text of the almanac.

I uploaded the new data for the two fixed queries on the Google Sheets. https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1587787684#gid=1587787684 and https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=2132002234#gid=2132002234

We still have to adapt the text. Text passages to change:

  • Allow 2024
    • 21.4 million <iframe> -> 30.4 million <iframe>; probably we should also add from the desktop crawl
    • half included the allow -> 35.2%
    • only 21% of <iframe> elements had the allow attribute -> 14.4%
  • Sandbox 2024
    • Change 28.4% and 27.5% to 19.9% and 19.8%
    • Change 35.2% and 32% to 22.1% and 21.2%
  • Meta CSP 2024
    • Simple option would be to simply change 1.70% of pages to 1.70% of pages that use CSP in a <meta> tag.
    • Another option would be to change the percentage to be of pages but then all will be <0.01

About the pre-2024 versions:

  • The Meta CSP issue only exist for 2024.
  • The Allow/Sandbox also exists for 2022, 2021, and 2020. 2019 does not contain the incorrect query.
  • I updated the queries with a comment only for now.
  • We could use the newest query to also get the data for 2021 and 2020 (it already contains the data for 2022) and only update the text.

@GJFR
Copy link
Member

GJFR commented Dec 6, 2024

Thanks for the detailed overview and changes @JannisBush. I finally came to checking it out and updating the text; no figures needed to be changed.

Since these are relatively small changes without any impact on drawn conclusions, I think it doesn't need any disclaimer? (@tunetheweb, do you agree?)

@JannisBush, I think it's a good idea to update the 2022, 2021 and 2020 articles with a disclaimer. If you run the newest query for 2021 and 2020 as well, I will update the articles with the new results.

@JannisBush
Copy link
Contributor Author

JannisBush commented Dec 6, 2024

I had to change the query to the new crawl.pages dataset to run it for 2020 and 2021.
The results are added here: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=2132002234#gid=2132002234

The numbers for 2022 (the only year which I ran both on crawl.pages and all.pages) are not exactly the same due to the use of the other table but the percentages do not change.

  • For 2022, the numbers change from 18.9% of 11.5 million frames in mobile to 12.6% of 17.4 million frames in mobile and desktop websites that embed an iframe, 35.2% also include the sandbox attribute to 21.2%
  • For 2021, the numbers change from 28.4% of 10.8 million mobile to 18.3% of 16.8 million and 32.6% to 20.9% and 19.7%
  • For 2020, the numbers change from 19.5% of the 8 million frames that were found on the desktop pages. On mobile pages, 16.4% of the 9.2 million frames contained the allow attribute. to 11.8%, 13,2 Million and 10.8%, 13,8 Million and from via the sandbox attribute: 30.29% of the iframes on desktop pages have a sandbox attribute while on mobile pages this is 33.16%. to 18.3% and 21.9%

@GJFR
Copy link
Member

GJFR commented Dec 6, 2024

Great, thanks a lot! Just updated the older articles, with a disclaimer.

@GJFR
Copy link
Member

GJFR commented Dec 6, 2024

CI scripts identify issues with translations, and indeed, all updates to previous articles will have to be translated as well. Unfortunately, I'm no polyglot and I assume the help of GenAI will not suffice? So not sure how to handle this.

@tunetheweb
Copy link
Member

CI scripts identify issues with translations, and indeed, all updates to previous articles will have to be translated as well. Unfortunately, I'm no polyglot and I assume the help of GenAI will not suffice? So not sure how to handle this.

I think the numbers in the paragraph can easily enough be updated in the other languages in this PR? For the note, running it through translate.google.com will be good enough IMHO.

@GJFR
Copy link
Member

GJFR commented Dec 10, 2024

CI scripts identify issues with translations, and indeed, all updates to previous articles will have to be translated as well. Unfortunately, I'm no polyglot and I assume the help of GenAI will not suffice? So not sure how to handle this.

I think the numbers in the paragraph can easily enough be updated in the other languages in this PR? For the note, running it through translate.google.com will be good enough IMHO.

Alright, will be able to update translations this week.

@GJFR
Copy link
Member

GJFR commented Dec 16, 2024

All values have been updated for the original chapter and its translations. Ready for merge.

@GJFR GJFR marked this pull request as ready for review December 16, 2024 15:51
@tunetheweb tunetheweb merged commit 4504d2d into HTTPArchive:main Dec 29, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants