Skip to content

Commit 936e353

Browse files
committed
updated example for reviews
1 parent effe8db commit 936e353

9 files changed

+163
-268
lines changed

Diff for: README.md

+28-4
Original file line numberDiff line numberDiff line change
@@ -258,19 +258,19 @@ When using the "pprompt" or "prompt" function, TopicGPT can behave differently t
258258

259259
TopicGPT is centrally built on top of text embeddings and the prompting mechanisms obtained via LLMs and provided by the OpenAI API. Please also see the section [References](#references) for more details on the models and ideas used in TopicGPT.
260260

261-
### Embeddings
261+
#### Embeddings
262262
When no embeddings are provided, TopicGPT automatically computes the embeddings of the documents of the provided corpus and also of the vocabulary that is extracted from the corpus. This happens after the fit-method is called.
263263

264264
The class ```GetEmbeddingsOpenAI``` is used for this purpose.
265265

266-
### Clustering
266+
#### Clustering
267267
In order to identify topics among the documents, TopicGPT reduces the dimensionality of the document embeddings via UMAP and then uses Hdbscan to identify the clusters. Dimensionality reduction is necessary since the document embeddings are of very high dimensionality and thus the curse of dimensionality would make it very difficult, if not impossible, to identify the clusters.
268268

269269
When not specifying the number of topics in the ```Topic GPT``` class, Hdbscan is used to automatically determine the number of topics. If the number of topics is specified, agglomerative clustering is used on top of the clusters identified by HDBSCAN.
270270

271271
The class ```Clustering``` is used for this purpose.
272272

273-
### Extraction of Top-Words
273+
#### Extraction of Top-Words
274274

275275
After the clusters have been identified, TopicGPT extracts the top-words of each topic. This is done via two different methods:
276276
- **Tf-idf**: The tf-idf method is based on the idea that words that occur frequently in a topic but rarely in other topics are good indicators for the topic. The top-words are thus the words with the highest tf-idf scores.
@@ -280,7 +280,7 @@ Note that the Tf-idf heuristic was introduced for the BerTopic Model (Grootendor
280280

281281
Topword extraction is performed with help of the class ```ExtractTopWords```.
282282

283-
### Describing and naming topics
283+
#### Describing and naming topics
284284

285285
In the next step, all topics are provided with a short name and a description. This is done via prompting an LLM provided by OpenAI with around 500 top-words of each topic. The LLM then generates a short name and a description for each topic.
286286

@@ -289,6 +289,30 @@ The class ```TopwordEnhancement``` is used for this purpose.
289289

290290
Note that computation of Embeddings, Extraction of Top-Words and Describing and Naming Topics are all performed when calling the ```fit``` method of the ```TopicGPT``` class.
291291

292+
#### Prompting
293+
294+
The main way to interact with TopicGPT is via direct textual prompts. Those prompts are augmented with basic information about desired behavior and potentially useful information. Additionally, information on available functions and their parameters is provided. Then this information is used to prompt an LLM via the OpenAI API. The LLM then decides if it should call a function of the ones provided and if so, which parameters to use. The respective function call is executed and part of the result is returned to the LLM which uses the original prompt together with the function call and the result to generate a response.
295+
296+
#### Functions available for prompting:
297+
298+
The following functions are available for the LLM to use:
299+
- ```knn_search```: This function is used to find documents that are related to a certain keyword. The LLM can specify the number of documents to be found and the number of keywords to be used. The result is retrieved by performing retrieval-augmented-generation (RAG) where the query is embedded and the most similar documents are retrieved.
300+
- ```identify_topic_idx```: This function is used to identify the topic that is most related to a certain keyword. This is simply done by providing all topic descriptions to the LLM and then asking for index of the the topic that is most related to the keyword.
301+
302+
- ```get_topic_information```: This function is used to obtain information on certain topics. This can be useful to compare similar topics.
303+
304+
- ```split_topic_kmeans```: This function is used to split a topic into subtopics. The LLM can specify the number of subtopics to be created. The result is retrieved by performing k-means clustering on the document embeddings of the documents in the topic. Note that when splitting a topic, the top-words are not completely recomputed, but rather the top-words of the "super"-topic are distributed among the subtopics.
305+
306+
- ```split_topic_hdbscan```: Works analogously to ```split_topic_kmeans``` but uses Hdbscan instead of k-means clustering. This means that the number of subtopics is not specified by the user but rather automatically determined by Hdbscan.
307+
308+
- ```split_topic_keywords```: This function is used to split a topic into subtopics based on provided keywords. Each keyword is embedded and the topic is split according to cosine similarity of the document embeddings within the "super"-topic. This means that documents among the "super"-topic that are most similar to a certain keyword are assigned to the corresponding subtopic.
309+
310+
- ```add_new_topic_keyword```: This function is used to add a new topic based on a keyword. The documents belonging to this new topic are computed as the documents from all other topics that are more similar to the embedding of the new keyword than the centroid of the original topic. Then all topwords and the topic description are recomputed.
311+
312+
- ```delete_topic```: This function is used to delete a topic. The LLM can specify the topic to be deleted. The result is retrieved by simply removing the topic from the list of topics and assigning the documents of the deleted topic to the topic with the most similar centroid. Then all topwords and the topic description are recomputed.
313+
314+
- ```combine_topics```: This function is used to combine two topics into a single topic. The LLM can specify the two topics to be combined. The result is retrieved by simply combining the documents of the two topics and re-computing the embeddings and top-words of the new topic.
315+
292316

293317
## Limitations and Caveats
294318

Diff for: examples/AmazonReviews.ipynb

+113-175
Large diffs are not rendered by default.

Diff for: setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
setup(
88
name='topicgpt',
9-
version='0.0.1',
9+
version='0.0.2',
1010
packages=find_packages(where='src'),
1111
package_dir={'': 'src'},
1212
install_requires=[

Diff for: src/TopicGPT/TopicRepresentation.py

+2
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,8 @@ def extract_topics_labels_vocab(corpus: list[str], document_embeddings_hd: np.nd
403403

404404
word_topic_mat = extractor.compute_word_topic_mat(corpus, vocab, labels, consider_outliers = False) # compute the word-topic matrix of the corpus
405405

406+
dim_red_centroid_dict = {label: centroid for label, centroid in zip(centroid_dict.keys(), dim_red_centroids)}
407+
406408
if "tfidf" in topword_extraction_methods:
407409
tfidf_topwords, tfidf_dict = extractor.extract_topwords_tfidf(word_topic_mat = word_topic_mat, vocab = vocab, labels = labels, top_n_words = n_topwords) # extract the top-words according to tfidf
408410
if "cosine_similarity" in topword_extraction_methods:

Diff for: test/Test_Package/TestTopicGPT_init_and_fit.py renamed to test/TestPackage/TestTopicGPT_init_and_fit.py

+4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
"""
2+
This class tests the init and fit functions of the TopicGPT module.
3+
"""
4+
15
import os
26
import sys
37
import inspect

Diff for: test/Test_Package/TestTopicGPT_prompting.py renamed to test/TestPackage/TestTopicGPT_prompting.py

+4
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
"""
2+
This class is used to mainly test the prompting functionality of the TopicGPT package.
3+
"""
4+
15
import os
26
import sys
37
import inspect

Diff for: test/TestTopicGPT_init_and_fit.py

+6
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
"""
2+
This class is used to test the init and fit functions of the TopicGPT class
3+
"""
4+
5+
6+
17
import os
28
import sys
39
import inspect

Diff for: test/TestTopicGPT_prompting.py

+5
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
"""
2+
This class is used to test the init and fit functions of the TopicGPT class
3+
"""
4+
5+
16
import os
27
import sys
38
import inspect

Diff for: test/Test_Package/prem_test.ipynb

-88
This file was deleted.

0 commit comments

Comments
 (0)