Skip to content

Commit

Permalink
Merge branch 'staging' of https://github.com/dataesr/scanr-ui into st…
Browse files Browse the repository at this point in the history
…aging
  • Loading branch information
ahonestla committed Jan 13, 2025
2 parents 103e766 + 13783dd commit 384f8a1
Show file tree
Hide file tree
Showing 10 changed files with 127 additions and 17 deletions.
24 changes: 22 additions & 2 deletions doc_network/bso.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,34 @@ In practice, a PID is also stored (the wikidata for topics, for example) to disa

## 2.3 VOSviewer implementation

We use the open source VOSviewer online tool for network visualization [https://github.com/neesjanvaneck/VOSviewer-Online](https://github.com/neesjanvaneck/VOSviewer-Online). It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies [@DBLP:journals/corr/abs-1006-1032].
We use the open source VOSviewer online tool for network visualization [https://github.com/neesjanvaneck/VOSviewer-Online](https://github.com/neesjanvaneck/VOSviewer-Online). It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies [@DBLP:journals/corr/abs-1006-1032].

In graph theory, a community corresponds to a set of nodes in a graph that are strongly interconnected with each other, while being less connected with nodes outside this community. Communities can be identified in order to understand the underlying structure and patterns of the graph, as well as to analyze the relationships and interactions between the entities that make it up.
To identify communities, we use the Louvain method. This algorithm works by optimizing a modularity measure that evaluates the strength of communities in a graph. More precisely, Louvain seeks to maximize modularity by progressively moving the nodes of a graph into different communities, in an iterative fashion.
At each stage, he merges neighboring communities if this leads to an improvement in the overall modularity of the graph. This iterative process continues until no further moves can increase modularity.
Clusters are computed with the Louvain algorithm, from the open source javascript library graphology-communities-louvain.

## 2.4 LLM trick

# 3. Making insightful maps

This scanR feature is designed to help users gain a better understanding of the underlying structures via thematic or co-publication maps. To help the user, it's important to be able to characterize each of the communities automatically identified. It is therefore important to label each community before describing them.

## 3.1 LLM trick

To name the communities we use generative AI from Mistral AI ('open-mistral-nemo' model).
The names are obtained from the main themes of the publications collected for each community.
For the time being, we limit ourselves to the 2000 most relevant publications (in relation to the user's search) for each community. The following prompt is used:

> You have been tasked with naming distinct fields of study for several communities of research publications.
> Below are lists of topics and their weights representing each community.
> Your goal is to provide a unique and descriptive name for each field of study that best encapsulates the essence of the topics within that community.
> Each should be unique and as short as possible.
> If the list of topic is empty, output a empty string.
> Output as JSON object with the list number and the single unique generated name. ```
## 3.1 Citation / hot topics

A citation score is estimated for each cluster. This score relates the number of recent citations (over the last two years) to the number of total publications in the cluster. This score is intended to help detect hotspots in the communities identified in the corpus.
We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution [@alperin2024analysissuitabilityopenalexbibliometric].

## 3.2 Custom perimeter
Expand Down
Binary file modified doc_network/mapping_at_scale.pdf
Binary file not shown.
54 changes: 48 additions & 6 deletions doc_network/mapping_at_scale.tex
Original file line number Diff line number Diff line change
Expand Up @@ -340,19 +340,61 @@ \subsection{2.3 VOSviewer
the VOSviewer tool which is very popular for network analysis in
bibliometric studies (Waltman, Eck, and Noyons 2010).

\hypertarget{llm-trick}{%
\subsection{2.4 LLM trick}\label{llm-trick}}
In graph theory, a community corresponds to a set of nodes in a graph
that are strongly interconnected with each other, while being less
connected with nodes outside this community. Communities can be
identified in order to understand the underlying structure and patterns
of the graph, as well as to analyze the relationships and interactions
between the entities that make it up. To identify communities, we use
the Louvain method. This algorithm works by optimizing a modularity
measure that evaluates the strength of communities in a graph. More
precisely, Louvain seeks to maximize modularity by progressively moving
the nodes of a graph into different communities, in an iterative
fashion. At each stage, he merges neighboring communities if this leads
to an improvement in the overall modularity of the graph. This iterative
process continues until no further moves can increase modularity.
Clusters are computed with the Louvain algorithm, from the open source
javascript library graphology-communities-louvain.

\hypertarget{making-insightful-maps}{%
\section{3. Making insightful maps}\label{making-insightful-maps}}

This scanR feature is designed to help users gain a better understanding
of the underlying structures via thematic or co-publication maps. To
help the user, it's important to be able to characterize each of the
communities automatically identified. It is therefore important to label
each community before describing them.

\hypertarget{llm-trick}{%
\subsection{3.1 LLM trick}\label{llm-trick}}

To name the communities we use generative AI from Mistral AI
(`open-mistral-nemo' model). The names are obtained from the main themes
of the publications collected for each community. For the time being, we
limit ourselves to the 2000 most relevant publications (in relation to
the user's search) for each community. The following prompt is used:

\begin{quote}
You have been tasked with naming distinct fields of study for several
communities of research publications. Below are lists of topics and
their weights representing each community. Your goal is to provide a
unique and descriptive name for each field of study that best
encapsulates the essence of the topics within that community. Each
should be unique and as short as possible. If the list of topic is
empty, output a empty string. Output as JSON object with the list number
and the single unique generated name. ```
\end{quote}
\hypertarget{citation-hot-topics}{%
\subsection{3.1 Citation / hot topics}\label{citation-hot-topics}}
We use citations data from OpenAlex, which is as of today one of the
best open source datasource. However, citations metadata from OpenAlex
remains incomplete and must therefore be interpreted with caution
(Alperin et al. 2024).
A citation score is estimated for each cluster. This score relates the
number of recent citations (over the last two years) to the number of
total publications in the cluster. This score is intended to help detect
hotspots in the communities identified in the corpus. We use citations
data from OpenAlex, which is as of today one of the best open source
datasource. However, citations metadata from OpenAlex remains incomplete
and must therefore be interpreted with caution (Alperin et al. 2024).
\hypertarget{custom-perimeter}{%
\subsection{3.2 Custom perimeter}\label{custom-perimeter}}
Expand Down
Binary file modified doc_network/out.docx
Binary file not shown.
2 changes: 1 addition & 1 deletion doc_network/out.enriched.json

Large diffs are not rendered by default.

Binary file modified doc_network/out.epub
Binary file not shown.
10 changes: 8 additions & 2 deletions doc_network/out.html
Original file line number Diff line number Diff line change
Expand Up @@ -560,10 +560,16 @@ <h2 id="elasticsearch-implementation">2.2 Elasticsearch implementation</h2>
<span id="cb1-32"><a href="#cb1-32" aria-hidden="true"></a> <span class="fu">}</span><span class="er">,</span></span></code></pre></div>
<h2 id="vosviewer-implementation">2.3 VOSviewer implementation</h2>
<p>We use the open source VOSviewer online tool for network visualization <a href="https://github.com/neesjanvaneck/VOSviewer-Online">https://github.com/neesjanvaneck/VOSviewer-Online</a>. It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies <span class="citation" data-cites="DBLP:journals/corr/abs-1006-1032">(Waltman, Eck, and Noyons 2010)</span>.</p>
<h2 id="llm-trick">2.4 LLM trick</h2>
<p>In graph theory, a community corresponds to a set of nodes in a graph that are strongly interconnected with each other, while being less connected with nodes outside this community. Communities can be identified in order to understand the underlying structure and patterns of the graph, as well as to analyze the relationships and interactions between the entities that make it up. To identify communities, we use the Louvain method. This algorithm works by optimizing a modularity measure that evaluates the strength of communities in a graph. More precisely, Louvain seeks to maximize modularity by progressively moving the nodes of a graph into different communities, in an iterative fashion. At each stage, he merges neighboring communities if this leads to an improvement in the overall modularity of the graph. This iterative process continues until no further moves can increase modularity. Clusters are computed with the Louvain algorithm, from the open source javascript library graphology-communities-louvain.</p>
<h1 id="making-insightful-maps">3. Making insightful maps</h1>
<p>This scanR feature is designed to help users gain a better understanding of the underlying structures via thematic or co-publication maps. To help the user, it’s important to be able to characterize each of the communities automatically identified. It is therefore important to label each community before describing them.</p>
<h2 id="llm-trick">3.1 LLM trick</h2>
<p>To name the communities we use generative AI from Mistral AI (‘open-mistral-nemo’ model). The names are obtained from the main themes of the publications collected for each community. For the time being, we limit ourselves to the 2000 most relevant publications (in relation to the user’s search) for each community. The following prompt is used:</p>
<blockquote>
<p>You have been tasked with naming distinct fields of study for several communities of research publications. Below are lists of topics and their weights representing each community. Your goal is to provide a unique and descriptive name for each field of study that best encapsulates the essence of the topics within that community. Each should be unique and as short as possible. If the list of topic is empty, output a empty string. Output as JSON object with the list number and the single unique generated name. ```</p>
</blockquote>
<h2 id="citation-hot-topics">3.1 Citation / hot topics</h2>
<p>We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution <span class="citation" data-cites="alperin2024analysissuitabilityopenalexbibliometric">(Alperin et al. 2024)</span>.</p>
<p>A citation score is estimated for each cluster. This score relates the number of recent citations (over the last two years) to the number of total publications in the cluster. This score is intended to help detect hotspots in the communities identified in the corpus. We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution <span class="citation" data-cites="alperin2024analysissuitabilityopenalexbibliometric">(Alperin et al. 2024)</span>.</p>
<h2 id="custom-perimeter">3.2 Custom perimeter</h2>
<p>scanR offers this mapping tool for the entire indexed corpus, but it is also possible to adapt the tool to a restricted perimeter, at the user’s discretion. For example, an institution or laboratory can define its own corpus (based on a list of publications) and a mapping tool dedicated to this perimeter is automatically created. Technically, elasticsearch queries are the same, with just an additional filter to query only the publications within the perimeter. The tool can be embedded in any website using an iframe. It’s the same principle as the local barometer. This approach eliminates the need for automatic alignment of affiliations, which remains a highly complex task. Automation is possible to a certain extent <span class="citation" data-cites="lhote_using_2021">(L’Hôte and Jeangirard 2021)</span>, but human curation remains necessary in the majority of cases <span class="citation" data-cites="jeangirard:hal-04598201">(Jeangirard, Bracco, and L’Hôte 2024)</span>. In this way, users retain control over the definition of their perimeter, and can, if they wish, have several distinct perimeters.</p>
<h1 id="code-availibility">4. Code availibility</h1>
Expand Down
54 changes: 48 additions & 6 deletions doc_network/out.latex
Original file line number Diff line number Diff line change
Expand Up @@ -340,19 +340,61 @@ We use the open source VOSviewer online tool for network visualization
the VOSviewer tool which is very popular for network analysis in
bibliometric studies (Waltman, Eck, and Noyons 2010).

\hypertarget{llm-trick}{%
\subsection{2.4 LLM trick}\label{llm-trick}}
In graph theory, a community corresponds to a set of nodes in a graph
that are strongly interconnected with each other, while being less
connected with nodes outside this community. Communities can be
identified in order to understand the underlying structure and patterns
of the graph, as well as to analyze the relationships and interactions
between the entities that make it up. To identify communities, we use
the Louvain method. This algorithm works by optimizing a modularity
measure that evaluates the strength of communities in a graph. More
precisely, Louvain seeks to maximize modularity by progressively moving
the nodes of a graph into different communities, in an iterative
fashion. At each stage, he merges neighboring communities if this leads
to an improvement in the overall modularity of the graph. This iterative
process continues until no further moves can increase modularity.
Clusters are computed with the Louvain algorithm, from the open source
javascript library graphology-communities-louvain.

\hypertarget{making-insightful-maps}{%
\section{3. Making insightful maps}\label{making-insightful-maps}}

This scanR feature is designed to help users gain a better understanding
of the underlying structures via thematic or co-publication maps. To
help the user, it's important to be able to characterize each of the
communities automatically identified. It is therefore important to label
each community before describing them.

\hypertarget{llm-trick}{%
\subsection{3.1 LLM trick}\label{llm-trick}}

To name the communities we use generative AI from Mistral AI
(`open-mistral-nemo' model). The names are obtained from the main themes
of the publications collected for each community. For the time being, we
limit ourselves to the 2000 most relevant publications (in relation to
the user's search) for each community. The following prompt is used:

\begin{quote}
You have been tasked with naming distinct fields of study for several
communities of research publications. Below are lists of topics and
their weights representing each community. Your goal is to provide a
unique and descriptive name for each field of study that best
encapsulates the essence of the topics within that community. Each
should be unique and as short as possible. If the list of topic is
empty, output a empty string. Output as JSON object with the list number
and the single unique generated name. ```
\end{quote}

\hypertarget{citation-hot-topics}{%
\subsection{3.1 Citation / hot topics}\label{citation-hot-topics}}

We use citations data from OpenAlex, which is as of today one of the
best open source datasource. However, citations metadata from OpenAlex
remains incomplete and must therefore be interpreted with caution
(Alperin et al. 2024).
A citation score is estimated for each cluster. This score relates the
number of recent citations (over the last two years) to the number of
total publications in the cluster. This score is intended to help detect
hotspots in the communities identified in the corpus. We use citations
data from OpenAlex, which is as of today one of the best open source
datasource. However, citations metadata from OpenAlex remains incomplete
and must therefore be interpreted with caution (Alperin et al. 2024).

\hypertarget{custom-perimeter}{%
\subsection{3.2 Custom perimeter}\label{custom-perimeter}}
Expand Down
Binary file modified doc_network/out.odt
Binary file not shown.
Binary file modified doc_network/out.pdf
Binary file not shown.

0 comments on commit 384f8a1

Please sign in to comment.