Merge branch 'staging' of https://github.com/dataesr/scanr-ui into st…

…aging
dataesr · Jan 13, 2025 · 384f8a1 · 384f8a1
2 parents 103e766 + 13783dd
commit 384f8a1
Show file tree

Hide file tree

Showing 10 changed files with 127 additions and 17 deletions.
diff --git a/doc_network/bso.md b/doc_network/bso.md
@@ -101,14 +101,34 @@ In practice, a PID is also stored (the wikidata for topics, for example) to disa
 
 ## 2.3 VOSviewer implementation
 
-We use the open source VOSviewer online tool for network visualization [https://github.com/neesjanvaneck/VOSviewer-Online](https://github.com/neesjanvaneck/VOSviewer-Online). It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies [@DBLP:journals/corr/abs-1006-1032]. 
+We use the open source VOSviewer online tool for network visualization [https://github.com/neesjanvaneck/VOSviewer-Online](https://github.com/neesjanvaneck/VOSviewer-Online). It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies [@DBLP:journals/corr/abs-1006-1032].
+
+In graph theory, a community corresponds to a set of nodes in a graph that are strongly interconnected with each other, while being less connected with nodes outside this community. Communities can be identified in order to understand the underlying structure and patterns of the graph, as well as to analyze the relationships and interactions between the entities that make it up.
+To identify communities, we use the Louvain method. This algorithm works by optimizing a modularity measure that evaluates the strength of communities in a graph. More precisely, Louvain seeks to maximize modularity by progressively moving the nodes of a graph into different communities, in an iterative fashion.
+At each stage, he merges neighboring communities if this leads to an improvement in the overall modularity of the graph. This iterative process continues until no further moves can increase modularity.
+Clusters are computed with the Louvain algorithm, from the open source javascript library graphology-communities-louvain. 
 
-## 2.4 LLM trick
 
 # 3. Making insightful maps
 
+This scanR feature is designed to help users gain a better understanding of the underlying structures via thematic or co-publication maps. To help the user, it's important to be able to characterize each of the communities automatically identified. It is therefore important to label each community before describing them.
+
+## 3.1 LLM trick
+
+To name the communities we use generative AI from Mistral AI ('open-mistral-nemo' model).
+The names are obtained from the main themes of the publications collected for each community.
+For the time being, we limit ourselves to the 2000 most relevant publications (in relation to the user's search) for each community. The following prompt is used:
+
+> You have been tasked with naming distinct fields of study for several communities of research publications.
+> Below are lists of topics and their weights representing each community.
+> Your goal is to provide a unique and descriptive name for each field of study that best encapsulates the essence of the topics within that community.
+> Each should be unique and as short as possible.
+> If the list of topic is empty, output a empty string.
+> Output as JSON object with the list number and the single unique generated name. ```
+
 ## 3.1 Citation / hot topics
 
+A citation score is estimated for each cluster. This score relates the number of recent citations (over the last two years) to the number of total publications in the cluster. This score is intended to help detect hotspots in the communities identified in the corpus. 
 We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution [@alperin2024analysissuitabilityopenalexbibliometric].
 
 ## 3.2 Custom perimeter

diff --git a/doc_network/mapping_at_scale.pdf b/doc_network/mapping_at_scale.pdf
diff --git a/doc_network/mapping_at_scale.tex b/doc_network/mapping_at_scale.tex
@@ -340,19 +340,61 @@ \subsection{2.3 VOSviewer
 the VOSviewer tool which is very popular for network analysis in
 bibliometric studies (Waltman, Eck, and Noyons 2010).
 
-\hypertarget{llm-trick}{%
-\subsection{2.4 LLM trick}\label{llm-trick}}
+In graph theory, a community corresponds to a set of nodes in a graph
+that are strongly interconnected with each other, while being less
+connected with nodes outside this community. Communities can be
+identified in order to understand the underlying structure and patterns
+of the graph, as well as to analyze the relationships and interactions
+between the entities that make it up. To identify communities, we use
+the Louvain method. This algorithm works by optimizing a modularity
+measure that evaluates the strength of communities in a graph. More
+precisely, Louvain seeks to maximize modularity by progressively moving
+the nodes of a graph into different communities, in an iterative
+fashion. At each stage, he merges neighboring communities if this leads
+to an improvement in the overall modularity of the graph. This iterative
+process continues until no further moves can increase modularity.
+Clusters are computed with the Louvain algorithm, from the open source
+javascript library graphology-communities-louvain.
 
 \hypertarget{making-insightful-maps}{%
 \section{3. Making insightful maps}\label{making-insightful-maps}}
 
+This scanR feature is designed to help users gain a better understanding
+of the underlying structures via thematic or co-publication maps. To
+help the user, it's important to be able to characterize each of the
+communities automatically identified. It is therefore important to label
+each community before describing them.
+
+\hypertarget{llm-trick}{%
+\subsection{3.1 LLM trick}\label{llm-trick}}
+
+To name the communities we use generative AI from Mistral AI
+(`open-mistral-nemo' model). The names are obtained from the main themes
+of the publications collected for each community. For the time being, we
+limit ourselves to the 2000 most relevant publications (in relation to
+the user's search) for each community. The following prompt is used:
+
+\begin{quote}
+You have been tasked with naming distinct fields of study for several
+communities of research publications. Below are lists of topics and
+their weights representing each community. Your goal is to provide a
+unique and descriptive name for each field of study that best
+encapsulates the essence of the topics within that community. Each
+should be unique and as short as possible. If the list of topic is
+empty, output a empty string. Output as JSON object with the list number
+and the single unique generated name. ```
+\end{quote}
+
 \hypertarget{citation-hot-topics}{%
 \subsection{3.1 Citation / hot topics}\label{citation-hot-topics}}
 
-We use citations data from OpenAlex, which is as of today one of the
-best open source datasource. However, citations metadata from OpenAlex
-remains incomplete and must therefore be interpreted with caution
-(Alperin et al. 2024).
+A citation score is estimated for each cluster. This score relates the
+number of recent citations (over the last two years) to the number of
+total publications in the cluster. This score is intended to help detect
+hotspots in the communities identified in the corpus. We use citations
+data from OpenAlex, which is as of today one of the best open source
+datasource. However, citations metadata from OpenAlex remains incomplete
+and must therefore be interpreted with caution (Alperin et al. 2024).
 
 \hypertarget{custom-perimeter}{%
 \subsection{3.2 Custom perimeter}\label{custom-perimeter}}

diff --git a/doc_network/out.docx b/doc_network/out.docx
diff --git a/doc_network/out.enriched.json b/doc_network/out.enriched.json
diff --git a/doc_network/out.epub b/doc_network/out.epub
diff --git a/doc_network/out.html b/doc_network/out.html
@@ -560,10 +560,16 @@ <h2 id="elasticsearch-implementation">2.2 Elasticsearch implementation</h2>
 <span id="cb1-32"><a href="#cb1-32" aria-hidden="true"></a>                <span class="fu">}</span><span class="er">,</span></span></code></pre></div>
 <h2 id="vosviewer-implementation">2.3 VOSviewer implementation</h2>
 <p>We use the open source VOSviewer online tool for network visualization <a href="https://github.com/neesjanvaneck/VOSviewer-Online">https://github.com/neesjanvaneck/VOSviewer-Online</a>. It is based on the VOSviewer tool which is very popular for network analysis in bibliometric studies <span class="citation" data-cites="DBLP:journals/corr/abs-1006-1032">(Waltman, Eck, and Noyons 2010)</span>.</p>
-<h2 id="llm-trick">2.4 LLM trick</h2>
+<p>In graph theory, a community corresponds to a set of nodes in a graph that are strongly interconnected with each other, while being less connected with nodes outside this community. Communities can be identified in order to understand the underlying structure and patterns of the graph, as well as to analyze the relationships and interactions between the entities that make it up. To identify communities, we use the Louvain method. This algorithm works by optimizing a modularity measure that evaluates the strength of communities in a graph. More precisely, Louvain seeks to maximize modularity by progressively moving the nodes of a graph into different communities, in an iterative fashion. At each stage, he merges neighboring communities if this leads to an improvement in the overall modularity of the graph. This iterative process continues until no further moves can increase modularity. Clusters are computed with the Louvain algorithm, from the open source javascript library graphology-communities-louvain.</p>
 <h1 id="making-insightful-maps">3. Making insightful maps</h1>
+<p>This scanR feature is designed to help users gain a better understanding of the underlying structures via thematic or co-publication maps. To help the user, it’s important to be able to characterize each of the communities automatically identified. It is therefore important to label each community before describing them.</p>
+<h2 id="llm-trick">3.1 LLM trick</h2>
+<p>To name the communities we use generative AI from Mistral AI (‘open-mistral-nemo’ model). The names are obtained from the main themes of the publications collected for each community. For the time being, we limit ourselves to the 2000 most relevant publications (in relation to the user’s search) for each community. The following prompt is used:</p>
+<blockquote>
+<p>You have been tasked with naming distinct fields of study for several communities of research publications. Below are lists of topics and their weights representing each community. Your goal is to provide a unique and descriptive name for each field of study that best encapsulates the essence of the topics within that community. Each should be unique and as short as possible. If the list of topic is empty, output a empty string. Output as JSON object with the list number and the single unique generated name. ```</p>
+</blockquote>
 <h2 id="citation-hot-topics">3.1 Citation / hot topics</h2>
-<p>We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution <span class="citation" data-cites="alperin2024analysissuitabilityopenalexbibliometric">(Alperin et al. 2024)</span>.</p>
+<p>A citation score is estimated for each cluster. This score relates the number of recent citations (over the last two years) to the number of total publications in the cluster. This score is intended to help detect hotspots in the communities identified in the corpus. We use citations data from OpenAlex, which is as of today one of the best open source datasource. However, citations metadata from OpenAlex remains incomplete and must therefore be interpreted with caution <span class="citation" data-cites="alperin2024analysissuitabilityopenalexbibliometric">(Alperin et al. 2024)</span>.</p>
 <h2 id="custom-perimeter">3.2 Custom perimeter</h2>
 <p>scanR offers this mapping tool for the entire indexed corpus, but it is also possible to adapt the tool to a restricted perimeter, at the user’s discretion. For example, an institution or laboratory can define its own corpus (based on a list of publications) and a mapping tool dedicated to this perimeter is automatically created. Technically, elasticsearch queries are the same, with just an additional filter to query only the publications within the perimeter. The tool can be embedded in any website using an iframe. It’s the same principle as the local barometer. This approach eliminates the need for automatic alignment of affiliations, which remains a highly complex task. Automation is possible to a certain extent <span class="citation" data-cites="lhote_using_2021">(L’Hôte and Jeangirard 2021)</span>, but human curation remains necessary in the majority of cases <span class="citation" data-cites="jeangirard:hal-04598201">(Jeangirard, Bracco, and L’Hôte 2024)</span>. In this way, users retain control over the definition of their perimeter, and can, if they wish, have several distinct perimeters.</p>
 <h1 id="code-availibility">4. Code availibility</h1>

diff --git a/doc_network/out.latex b/doc_network/out.latex
@@ -340,19 +340,61 @@ We use the open source VOSviewer online tool for network visualization
 the VOSviewer tool which is very popular for network analysis in
 bibliometric studies (Waltman, Eck, and Noyons 2010).
 
-\hypertarget{llm-trick}{%
-\subsection{2.4 LLM trick}\label{llm-trick}}
+In graph theory, a community corresponds to a set of nodes in a graph
+that are strongly interconnected with each other, while being less
+connected with nodes outside this community. Communities can be
+identified in order to understand the underlying structure and patterns
+of the graph, as well as to analyze the relationships and interactions
+between the entities that make it up. To identify communities, we use
+the Louvain method. This algorithm works by optimizing a modularity
+measure that evaluates the strength of communities in a graph. More
+precisely, Louvain seeks to maximize modularity by progressively moving
+the nodes of a graph into different communities, in an iterative
+fashion. At each stage, he merges neighboring communities if this leads
+to an improvement in the overall modularity of the graph. This iterative
+process continues until no further moves can increase modularity.
+Clusters are computed with the Louvain algorithm, from the open source
+javascript library graphology-communities-louvain.
 
 \hypertarget{making-insightful-maps}{%
 \section{3. Making insightful maps}\label{making-insightful-maps}}
 
+This scanR feature is designed to help users gain a better understanding
+of the underlying structures via thematic or co-publication maps. To
+help the user, it's important to be able to characterize each of the
+communities automatically identified. It is therefore important to label
+each community before describing them.
+
+\hypertarget{llm-trick}{%
+\subsection{3.1 LLM trick}\label{llm-trick}}
+
+To name the communities we use generative AI from Mistral AI
+(`open-mistral-nemo' model). The names are obtained from the main themes
+of the publications collected for each community. For the time being, we
+limit ourselves to the 2000 most relevant publications (in relation to
+the user's search) for each community. The following prompt is used:
+
+\begin{quote}
+You have been tasked with naming distinct fields of study for several
+communities of research publications. Below are lists of topics and
+their weights representing each community. Your goal is to provide a
+unique and descriptive name for each field of study that best
+encapsulates the essence of the topics within that community. Each
+should be unique and as short as possible. If the list of topic is
+empty, output a empty string. Output as JSON object with the list number
+and the single unique generated name. ```
+\end{quote}
+
 \hypertarget{citation-hot-topics}{%
 \subsection{3.1 Citation / hot topics}\label{citation-hot-topics}}
 
-We use citations data from OpenAlex, which is as of today one of the
-best open source datasource. However, citations metadata from OpenAlex
-remains incomplete and must therefore be interpreted with caution
-(Alperin et al. 2024).
+A citation score is estimated for each cluster. This score relates the
+number of recent citations (over the last two years) to the number of
+total publications in the cluster. This score is intended to help detect
+hotspots in the communities identified in the corpus. We use citations
+data from OpenAlex, which is as of today one of the best open source
+datasource. However, citations metadata from OpenAlex remains incomplete
+and must therefore be interpreted with caution (Alperin et al. 2024).
 
 \hypertarget{custom-perimeter}{%
 \subsection{3.2 Custom perimeter}\label{custom-perimeter}}

diff --git a/doc_network/out.odt b/doc_network/out.odt
diff --git a/doc_network/out.pdf b/doc_network/out.pdf