Introduction

Endogenous retroviruses (ERVs), accounting for approximately 8% of the human genome, consist of the two major families, human endogenous retroviruses (HERVs) and mammalian apparent long terminal repeat (LTR) retrotransposons (MaLRs)1, substantially larger than the regions encoded by human genes. ERVs had often been regarded as junk DNA or remnants of ancient retroviruses with accumulated numerous inactivating mutations or deletions; however, a link between HERV activation and the pathogenesis of diverse human diseases, including neurological and psychiatric disorders, autoimmune diseases, and cancers has been reported2,3,4,5. In addition, an increasing body of evidence suggests that HERVs and retrotransposons contribute causally to disease progression and neurotoxicity in aging and age-related neurodegenerative disorders6. The LTRs of the MaLR family (Transposon-like human element 1B: THE1B) fusion transcripts were also observed in Hodgkin’s lymphoma1,7,8. Its aberrant expression has been noted in various pathological states, suggesting a complex role in human disease mechanisms.

Sarcoidosis presents a unique context to explore the impact of retrotransposons due to its idiopathic nature and characteristic granuloma formation, affecting multiple organs including muscle, skin, and lung9. The etiology of sarcoidosis, and particularly its muscular manifestation known as sarcoid myopathy (SM), remains poorly understood, with current theories pointing toward an interplay of genetic susceptibilities and environmental triggers10,11,12.

In this study, we aim to dissect the contribution of THE1B and its associated transcripts to the pathogenesis of SM. Utilizing a combination of short-read and long-read RNA sequencing, we analyzed muscle biopsies from individuals diagnosed with SM and compared them to samples from individuals with other muscle diseases (non-SM) serving as controls. Our approach integrated established bioinformatic tools—LeafCutter13,14, Telescope15, and Salmon16—with a novel method we developed, named “Comprehensive Fetching of Fusion transcripts between ERVs and Exons of human genes” (COFFEE). This strategy enabled the identification of 19 THE1B-fused transcripts uniquely expressed in SM samples. Subsequent validation efforts, including RT-qPCR and analysis of public RNA-seq datasets, reinforced the potential significance of these findings in the broader context of sarcoidosis pathogenesis.

Results

Identification of a THE1B element fused with CIR1 exon in the individuals with SM

An overview of this study is shown in Fig. 1. Our study commenced with an examination of aberrantly expressed and spliced transcripts in muscle disease, selecting SM (Supplementary Fig. 1) and antisynthetase syndrome (ASS) as the primary conditions for investigation during the discovery stage. Utilizing the Spliced Transcripts Alignment to a Reference (STAR) alignment algorithm17 followed by LeafCutterMD, we analyzed short-read RNA-seq data in 16 SM and 15 ASS samples (Supplementary Data 1).

Fig. 1: Overview of retrotrans-genomics and validation studies.
figure 1

Retrotrans-genomics has a double meaning: genome-wide retrotransposon analysis and reverse genomics from retro-transcriptome to whole genome analysis. a Schematic diagram of genomic structures of CIR1 locus. Orange rectangles are shown to the exon of the gene. The green arrow shows the direction of the transcription. b Schematic diagram of full-length cDNA of a THE1B fusion transcript by Nanopore long-read seq. c Expression analysis of THE1B/CIR1 by Salmon quantification. Error bar means the standard deviation value. The adjusted P values (Padj) were calculated by a two-sided Wald test and Benjamin-Hochberg correction. Source data are provided as a Source Data file. d Retrotrans-genomics of THE1B fusion transcripts in sarcoid myopathy by COFFEE. Red blots show highly expressed THE1B elements in sarcoid myopathy by using PhenoGram64. e Schematic diagram of THE1B/DNAJC5B.1, THE1B/DNAJC5B.2 and canonical DNAJC5B. THE1B element is located 42 kb upstream from exon 1 of DNAJC5B. Brown rectangles are shown to be exons of the genes. f, g Validation studies by (f) RNA-seq data on a public database, (g) RT-qPCR. h Single-cell and single-nucleus cell RNA-seq shows the cells expressing THE1B fusion transcripts.

Our analysis unexpectedly identified an aberrantly spliced transcript originating from THE1B, functioning as exon 1 within intron 7 of the CIR1 located on chromosome 2q31.1. This gene encodes the “corepressor interacting with RBPJ” protein18. The THE1B-fused CIR1 transcript, which we have named THE1B/CIR1, was consistently observed in all 16 SM samples but not in ASS samples (Fig. 2a). Notably, the THE1B/CIR1 shares 98% sequence identity with an uncharacterized 5’-EST (BP305191) found in macrophages from the Sugano cDNA library19, albeit a full-length cDNA sequence has not been available in public genomic databases such as UCSC genome browser20, Ensembl21 and ZENBU genome browser22. To address this gap, we conducted Nanopore long-read RNA-seq analysis on six SM samples, successfully delineating the full-length cDNA sequence of THE1B/CIR1.

Fig. 2: Discovery of THE1B fusion transcripts associated with SM.
figure 2

a Sashimi plot for THE1B/CIR1 that are alternatively spliced from THE1B element (is indicated in blue line) within intron 7 of CIR1 (in red) in SM, but not in non-SM (in light blue). Common exons of THE1B/CIR1 and CIR1 are indicated in purple. Genomic structures of THE1B/CIR and CIR1 are also shown below the Sashimi plots. b Each dot denotes the expression level (transcripts per million) of THE1B/CIR in individuals. Red dots denote the expression level in SM individuals (n = 16), and blue dots denote the expression level in non-SM (n = 400). The adjusted P values (Padj) were calculated by a two-sided Wald test and Benjamin-Hochberg correction. The gray bar indicates the mean value. Error bar means the standard deviation value. Source data are provided as a Source Data file. c Volcano plot showing each highly expressed THE1B (red) fused with neighbor exons of human genes (Padj < 0.05, log2FC > 2). The adjusted P values (Padj) were calculated by a two-sided LRT test and Benjamin-Hochberg correction. Source data are provided as a Source Data file. d Each dot denotes the expression level (transcripts per million) of THE1B/CIR in individuals. Blue dots denote the expression level before the tofacitinib treatment, and yellow dots denote the expression level after the tofacitinib treatment. “Pre” and “Post” are indicated measurements before and during treatment, respectively. “CR,” “PR,” and “HC” are indicated as complete responders to tofacitinib treatment (n = 3), partial responders to tofacitinib treatment (n = 3), and healthy controls (n = 2), respectively. The gray bar indicates the mean value. Error bar means the standard deviation value. The adjusted P-values (Padj) were calculated by two-sided LRT and Benjamin-Hochberg correction. Source data are provided as a Source Data file. e Measurement of discrimination ability of each THE1B fusion transcript by ROC-AUC curve.

To further explore THE1B/CIR1’s expression in SM relative to other conditions, we expanded our non-SM sample pool to 400 (Supplementary Data 2) and utilized the Salmon quantification tool on short-read RNA-seq data from both 16 SM and 400 non-SM samples, with long-read RNA-seq data serving as a reference. This comparative analysis revealed that THE1B/CIR1 expression was significantly elevated in SM, being 116-fold higher than non-SM samples (Padj = 5.72 × 10−7), as demonstrated in Fig. 2b.

Retrotrans-genomics of THE1B fusion transcripts associated with SM by COFFEE

To identify additional THE1B fusion transcript variants associated with SM throughout the human genome, we developed a novel method named COFFEE. This method facilitated a targeted retro-transcriptome analysis on aberrantly expressed THE1B fusion transcript variants (Supplementary Note). Specifically, we extracted 27,233 THE1B loci from the GRCh38_GENCODE_rmsk_TE.gtf file and conducted an intronome analysis in 16 SM samples. Subsequently, we compared the expression levels of THE1B loci in 16 SM samples with 400 non-SM controls, employing Telescope and DESeq223 tools. Following this, we identified highly expressed THE1B loci fused with neighbor exons (GTF file) and intersected these findings with the intronome data from the 16 SM individuals using BEDTools intersect24 (Supplementary Fig. 2). This analysis revealed 31 THE1B loci significantly overexpressed in the SM compared to non-SM (Padj < 0.05, log2FC > 1; Fig. 2c and Supplementary Data 3).

Using the Nanopore long-read RNA-seq data from SM samples, we identified 35 full-length THE1B fusion transcript variants from the 31 THE1B loci. Remarkably, 24 of these 35 THE1B fusion transcript variants had not been previously mapped in major genomic databases, including UCSC genome browser, Ensembl and ZENBU genome browser. Among these, one transcript, ENCT00000425851 was excluded from further analysis as it was identical to THE1B/DNAJC5B.1, differing only in the THE1B element.

We identified 19 THE1B fusion transcript variants that were highly expressed in individuals with SM (Padj < 0.05, log2FC > 2; Table 1 and Supplementary Fig. 3). Among these, specific variants such as LOC107984475, THE1B/ENSG00000259760.1 and THE1B/ENSG00000259760.2 demonstrated remarkably higher expression in SM samples—453-fold, 85-fold, and 14-fold, respectively, compared to non-SM controls (Padj = 8.39 × 10−15, 2.39 × 1051 and 6.80 × 1025). Furthermore, we successfully mapped 15 of these previously unmapped transcript variants to the human genome using Blat tool25,26 (Supplementary Figs. 45).

Table 1 Highly expressed THE1B fusion transcripts in 16 SM individuals compared with 400 non-SM

Reduced expression of THE1B fusion transcripts in cutaneous sarcoidosis by tofacitinib treatment

To further validate the discovery of 19 highly expressed THE1B fusion transcript variants in SM, we investigated their expression in the context of cutaneous sarcoidosis in short-read RNA-seq datasets obtained from a sarcoidosis study by Damsky et al.27, accessible via the Gene Expression Omnibus, NCBI28. Notably, the original dataset from Damsky et al. did not report any THE1B fusion transcript variants. In their study, ten individuals with cutaneous sarcoidosis were treated with tofacitinib, a JAK1/3 inhibitor, in a six-month open-label trial. We analyzed all available short-read RNA-seq data from their study, including three complete responders and three partial responders, using the Salmon tool for expression analysis, and found that eight of 19 THE1B fusion transcripts, (ENTHD1, LINC02154, THE1B/DNAJC5B.1, THE1B/ENSG00000259760.1, LOC107984475, THE1B/ENSG00000259760.2, SMFT4, and THE1B/CIR1) (briefly summarized in Supplementary Note) exhibited a significant reduction in expression in the complete responders, almost to zero, while no remarkable change was observed in the partial responders (Padj < 0.05; Fig. 2d and Supplementary Fig. 6a), indicating that the expression of these eight THE1B fusion transcript variants may well reflect disease activity in cutaneous sarcoidosis.

RT-qPCR validation of THE1B fusion transcripts in additional SM samples

To validate our findings on these eight highly expressed THE1B fusion transcripts in SM, we expanded our analysis to include 14 additional individuals diagnosed with SM. These were compared against 29 individuals with non-SM (the validation set 1; Supplementary Data 4). Using the TaqMan qPCR assay, we observed significantly higher expression levels of all eight THE1B fusion transcripts in the skeletal muscle tissues of SM individuals compared to the non-SM group (Padj < 0.01; Supplementary Fig. 6b). To assess the discriminative power of these eight THE1B fusion transcripts between SM and non-SM cases, we conducted a Receiver Operating Characteristic (ROC) analysis, which highlighted LINC02154 as the most potent diagnostic predictor for SM, with an Area Under the Curve (AUC) of 0.993 (95% Confidence Interval [CI]: 0.976-1). It was closely followed by THE1B/ENSG00000259760.1 with an AUC of 0.990 (95% CI: 0.969-1) and THE1B/ENSG00000259760.2 with an AUC of 0.964 (95% CI: 0.894-1). Other notable transcripts included THE1B/CIR1 with an AUC of 0.958 (95% CI: 0.880-1), LOC107984475 with an AUC of 0.936 (95% CI: 0.845-1), ENTHD1 with an AUC of 0.911 (95% CI: 0.824-0.999), THE1B/DNAJC5B.1 with an AUC of 0.857 (95% CI: 0.730-0.984) and SMFT4 with a similar AUC of 0.857 (95% CI: 0.734-0.980) (Fig. 2e).

Expression of THE1B fusion transcripts in macrophages in SM, cutaneous sarcoidosis and tuberculosis

To characterize the cellular expression of THE1B fusion transcripts in SM, we conducted analysis on 11,229 cells from single-nucleus RNA-seq (snRNA-seq) data from the skeletal muscle tissues of two individuals diagnosed with SM. The data was integrated using ComBat, followed by imputation missing values using MAGIC (v.3.0.0)29. We then applied uniform manifold approximation and projection (UMAP) for clustering, which identified thirteen major cell types (Fig. 3a).

Fig. 3: Identification of THE1B fusion transcripts expressing cells.
figure 3

a Cell annotation. UMAP projection of integrated snRNA-seq data from two individuals with SM. b Relative expression of CHIT1, a marker for GA macrophage in sarcoidosis. c Dot plot showing the expression level per cell in individuals with SM. d Relative expression of ZNF430. e MUSCLE alignment65 of putative ZNF binding sites in 8 THE1B elements. f and g Dot plots showing the expression level per cell in individuals with cutaneous sarcoidosis and tuberculosis, respectively.

In the study of cutaneous sarcoidosis, Krausgruber et al. found an upregulation of CHIT1 and DCSTAMP within granuloma-associated (GA) macrophages using single-cell RNA-seq (scRNA-seq) analysis30. Similarly, in our analysis of SM, we observed that cells expressing CHIT1 and DCSTAMP showed partial overlap with CD68-expressing macrophages (Fig. 3b and Supplementary Fig. 7a), which were also found to abundantly express IFNGR1 (Supplementary Fig. 7a).

To discern the difference between a canonical transcript and the corresponding THE1B fused transcripts, we utilized alevin-fry paired with a splici reference sequence31 for our analysis of snRNA-seq and scRNA-seq data. This is a method for splicing-isoforms quantification when analyzing 3’-tagged snRNA-seq and scRNA-seq data. We found that THE1B/DNAJC5B.1 was expressed not only in GA macrophages but also in monocytes and dendritic cells, in contrast to its canonical counterpart, DNAJC5B, which lacked expression in macrophages (Supplementary Fig. 7b), highlighting a unique expression profile for the fused variant. THE1B/CIR1 demonstrated main expression in GA macrophages with additional presence in 11 other cell types, the expression pattern of which closely mirrored that of the canonical CIR1 transcript (Supplementary Fig. 7b). ENTHD1, LINC02154, LOC107984475, THE1B/ENSG00000259760.1 and THE1B/ENSG00000259760.2 showed predominant expression within GA macrophages (Supplementary Fig. 7c). SMFT4 exhibited main expression in Mesenchymal stem cells (Supplementary Fig. 7c). We also confirmed genetic relationships between THE1B fusion transcripts and CHIT1 or DCSTAMP using snRNA-seq data from two individuals with SM by Pearson’s correlation coefficient. THE1B fusion transcripts showed strong positive correlation with CHIT1 (ENTHD1: r = 0.85; LINC02154: r = 0.99; LOC107984475: r = 0.98; THE1B/ENSG00000259760.1: r = 0.98 and THE1B/ENSG00000259760.2: r = 0.97) and DCSTAMP (ENTHD1: r = 0.85; LINC02154: r = 1.00; LOC107984475: r = 0.99; THE1B/ENSG00000259760.1: r = 0.99 and THE1B/ENSG00000259760.2: r = 0.98) (Supplementary Data 5).

ZNF430 is known to bind preferentially to THE1B elements, potentially acting as a silencer or repressor32. Notably, we observed strong expression of ZNF430 in GA macrophages in SM (Fig. 3d). Analysis of UCSC genome data (GRCh38/hg38) revealed that insertion/deletion or nucleotide substitution at putative ZNF binding sites within THE1B elements of eight fusion transcripts (Fig. 3e). This suggests that THE1B fusion transcripts might evade ZNF430’s regulatory control in GA macrophages, implicating a possible mechanism for their sustained expression in SM.

To extend our characterization of THE1B fusion transcript expression to individuals with cutaneous sarcoidosis, we analyzed integrated scRNA-seq data, leveraging the dataset deposited by Damsky et al. for three individuals (Supplementary Fig. 8a). We identified CHIT1 as a marker for GA macrophages (Supplementary Fig. 8b) and observed the expression of all the eight THE1B fusion transcripts in GA macrophages in cutaneous sarcoidosis, with the exception of SMFT4 (Fig. 3f and Supplementary Fig. 8c).

Tuberculosis (TB) and sarcoidosis are both characterized by the presence of granulomas, sharing not only histopathological but also certain clinical and radiological features33. We therefore examined the expressions of the eight THE1B fusion transcripts in TB by utilizing scRNA-seq data from TB lung tissues demonstrating high 18F-labeled fluorodeoxyglucose avidity34 (Supplementary Fig. 9a). IDO1 was used as a granuloma (GA) marker gene in individuals with TB35 (Supplementary Fig. 9b). We analyzed the scRNA-seq data by directly integrating the count data obtained from the Cell Ranger pipeline. THE1B/DNAJC5B.1 was expressed in macrophages, whereas DNAJC5B was found in both macrophages and monocytes (Supplementary Fig. 9c). Similarly, THE1B/CIR1 was primarily expressed in macrophages, but CIR1 was found in macrophages and 10 other cell types (Supplementary Fig. 9c). A distinctive feature observed in individuals with TB was the exclusive expression of five THE1B fusion transcripts—LINC02154, THE1B/ENSG00000259760.1, THE1B/ENSG00000259760.2, THE1B/DNAJC5B.1 and THE1B/CIR1—specifically in macrophages, but not in GA cells (Fig. 3g, Supplementary Fig. 9c, and Supplementary Data 7). In TB, ENTHD1 was predominantly expressed in dendritic cells, in contrast to its main expression in macrophages and T cells in SM (Fig. 3g and Supplementary Fig. 9d). LOC107984475 was primarily found in tissue stem cells in TB, but was absent in SM (Fig. 3g and Supplementary Fig. 9d). Furthermore, SMFT4 was mainly expressed in macrophages in TB, whereas in SM, it was detected in Mesenchymal stem cells (Fig. 3g and Supplementary Fig. 9d).

Genetic association of TREM2.1 expression with THE1B fusion transcripts in macrophages of individuals with SM

Accumulating evidence suggests that certain bacteria, especially Cutibacterium acnes and Mycobacterium tuberculosis, may play a role in the development of sarcoidosis36. Host-associated bacteria is known to trigger the expression of HERVs through Pattern Recognition Receptors (PRRs)37. We, therefore, investigated the genetic association between THE1B fusion transcripts and PRRs in snRNA-seq data from muscle tissues of individuals with SM by calculating Pearson’s correlation (Supplementary Data 5). We observed a strongly positive correlation between the expression of five THE1B fusion transcripts and the gene encoding triggering receptor (TREM2) expressed on myeloid cells 2 in SM individuals (raverage = 0.71, Fig. 4a and Supplementary Data 5). A similar correlation was noted with CLEC4E (raverage = 0.72, Fig. 4a and Supplementary Data 5). However, we focused on TREM2 over CLEC4E not only because TREM2 is involved in the formation of multinucleated giant cells, a hallmark of granulomatous inflammation, but also because TREM2-expressing macrophages are known to drive inflammation in acne lesions caused by Cutibacterium acnes38.

Fig. 4: Expression of TREM2 in individuals with SM.
figure 4

a Heatmap showing Pearson’s correlation coefficients between Pattern Recognition Receptors (PRRs) and 8 THE1B fusion transcripts. b Relative expression of TREM2 in individuals with SM. c 2D scatter plot shows relationship between TREM2 and THE1B/CIR1. df Each dot denotes the expression level (transcripts per million) of (d) TREM2.1, (e) TREM2.2, and (f) TREM2.3 in individuals, respectively. Red dots denote the expression level in SM individuals and blue dots denote the expression level in non-SM. The adjusted P-values (Padj) were calculated by two-sided LRT and Benjamin-Hochberg correction. Source data are provided as a Source Data file. g Each dot denotes expression level (transcripts per million) of TREM2.1 in the same way as in Fig. 2d. h RT-qPCR assay. Each dot denotes the relative expression level of TREM2.1 to TBP mRNA of individuals. Error bar means the standard deviation value. We compared the relative expression of target genes between SM (n = 14) and non-SM (n = 29), followed by two-sided permuted Brunner-Munzel test and Benjamini-Hochberg correction. The experiment was performed at least two times, and the similar results were confirmed. Source data are provided as a Source Data file. i Graphical visualization of Gene Ontology Enrichment analysis using ShinyGO 0.80. jm UMAP projection of snRNA-seq data from individuals with SM. Relative expression of (j) NFKB1, (k) RELA, (l) NFKB2 and (m) RELB.

The expression of TREM2 also corresponded with that of CHIT1 and DCSTAMP in macrophages, indicating that TREM2 is expressed mainly in GA macrophages (Fig. 4b and c). To further clarify the association between TREM2 expression and SM, we compared the expression between 16 SM and 400 non-SM individuals using short-read RNA-seq data. TREM2 has three transcript variants due to alternative splicing: one transmembrane form (ENST00000373113.8, hereinafter TREM2.1) and two soluble forms (ENST00000338469.3: TREM2.2 and ENST00000373122.8: TREM2.3)39. In SM, the expression level of TREM2.1 was 8.5-fold higher than that in 400 non-SM (Padj < 0.05; Fig. 4d), while TREM2.2 and TREM2.3 showed 5.0-fold and 14.0-fold increases, respectively (Padj < 0.05; Fig. 4e and f). Notably, when comparing expression levels between three TREM2 transcript variants within SM, expression levels of TREM2.1 was 17- and 38-fold higher than those of TREM2.2 and TREM2.3, respectively, highlighting TREM2.1 as the predominant transcript in SM. In the aforementioned study of cutaneous sarcoidosis, treatment with tofacitinib resulted in a more significant decrease in TREM2.1 expression among complete responders compared to partial responders (Padj < 0.05; Fig. 4g). This difference in expression was also confirmed by RT-qPCR in the validation set1 (Padj < 0.01; Fig. 4h and Supplementary Data 4).

To investigate the transcription factors regulating THE1B fusion transcripts in muscle tissues of SM, we analyzed snRNA-seq data. Gene ontology enrichment analysis40 revealed that the transcription factors associated with NF-κB signaling pathway were highly ranked (Fig. 4i). Notably, we discovered NF-κB binding sites within the regions of THE1B that form fusion transcripts, suggesting a direct regulatory mechanism (Supplementary Fig. 10). Expression analysis demonstrated that key NF-κB subunits—NFKB1, RELA, NFKB2 and RELB—were actively expressed in macrophages in SM individuals, highlighting their potential role in the regulation of THE1B fusion transcripts (Fig. 4j–m).

Novel read-through transcript, SIRPB1-SIRPD, expressed in GA macrophages of the individuals with cutaneous sarcoidosis

We also analyzed the genetic association between THE1B fusion transcripts and PRRs in scRNA-seq data obtained and integrated from three individuals with cutaneous sarcoidosis, employing Pearson’s correlation coefficient to assess the strength of these associations (Supplementary Data 6). Notably, the expression of signal regulatory protein delta (SIRPD), typically testis-specific, showed a positive correlation with five THE1B fusion transcripts (raverage = 0.82; Supplementary Data 6). Given the genomic proximity of SIRPD to SIRPB1, we hypothesized that SIRPD might be regulated by the SIRPB1 promoter in these individuals.

Further analysis with long-read RNA-seq data from muscle tissues of six SM individuals revealed a novel SIRPB1-SIRPD read-through transcript, denoted as SIRPB1-SIRPD (Fig. 5a). Amino acid sequence of putative SIRPB1-SIRPD protein includes segments from both SIRPB1 (N-terminal sequence) (O00241-1, amino acids 1-144) and SIRPD (Q5TFQ5, amino acids 25-198) (Fig. 5b), featuring two immunoglobulin-like V-type domains but lacking the transmembrane and cytoplasmic domains typical of SIRPB1 (Fig. 5b, c). The predicted 3D structure by AlphaFold (v.2.3)41 suggests that SIRPB1-SIRPD protein, lacking the SIRPB1 transmembrane domain (amino acids 372-392) essential for DAP12 dimer assembly42, might function as a secreted protein rather than a membrane-bound receptor (Fig. 5d).

Fig. 5: Gene structure and expression of SIRPB1-SIRPD.
figure 5

a Long-read RNA-seq of SIRPB1-SIRPD (above). Genomic structure of SIRPB1 and SIRPD (below). b Amino acid sequence of SIRPB1-SIRPD. Magenta literature denotes the part identical to SIRPB1 amino acid sequence (O00241-1, amino acids 1-144) and cyan literature denotes the part identical to SIRPD amino acid sequence (Q5TFQ5, amino acids 25-198). c Predicted domains of SIRPB1, SIRPB1-SIRPD and SIRPD using SMART66 (https://smart.embl.de/). The C-terminal region colored with dark blue in SIRPB1 indicates the transmembrane domain. d Alphafold2 prediction of 3D structure of SIRPB1-SIRPD. e UMAP projection of scRNA-seq data from individuals with cutaneous sarcoidosis. Relative expression of SIRPB1-SIRPD. f RT-qPCR assay. We compared the relative expression of target genes between SM (n = 14) and non-SM (n = 29), followed by two-sided permuted Brunner-Munzel test and Benjamini-Hochberg correction. The experiment was performed at least two times, and the similar results were confirmed. Each dot denotes the relative expression level of SIRPB1-SIRPD to TBP mRNA of individuals. Error bar means the standard deviation value. Source data are provided as a Source Data file. g Measurement of discrimination ability of SIRPB1-SIRPD by ROC-AUC curve.

Based on the observation of SIRPB1-SIRPD expression in GA macrophages of individuals with cutaneous sarcoidosis (Fig. 5e), we further evaluated its expression in the validation set 1 using TaqMan qPCR. Notably, SIRPB1-SIRPD exhibited significantly higher expression in SM individuals compared to those with the non-SM group (Padj < 0.05; Fig. 5f). Furthermore, ROC curve analysis demonstrated that SIRPB1-SIRPD serves as a potent diagnostic predictor for SM with an AUC of 0.985 (95% CI 0.960-1; Fig. 5g), underscoring its potential as a biomarker for this condition.

Discussion

In this study, we delineated the genomic landscape, highlighting newly identified THE1B fusion transcripts in sarcoidosis, through retrotrans-genomics.

Beyond mere identification, we further substantiated these associations through replication studies and in-depth sc- and snRNA sequencing, enriching the robustness of our findings. A pivotal discovery was the expression of SIRPB1-SIRPD and TREM2 in GA macrophages, which are known to contribute to the formation of multinucleated giant cells43, a hallmark of granuloma. Notably, SIRPB1-SIRPD, characterized by the two immunoglobulin-like V domains similar to soluble TREM244, was predominantly expressed in the macrophages of individuals with cutaneous sarcoidosis, as evidenced by scRNA-seq data. Our findings suggest a potential mechanism where SIRPB1-SIRPD may facilitate the transcription of THE1B fusion transcripts transcription through the activation of the NF-κB signaling pathway, offering new insights into the molecular dynamics of sarcoidosis. Although the exact role of these transcripts in sarcoidosis remains unclear, our findings lay the groundwork for future research to explore how they contribute to the development of the disease.

One of the key findings in this study was the notable reduction in THE1B fusion transcript variant expression, analyzed using gene expression data from Damsky’s study. This reduction was closely aligned with clinical amelioration in sarcoidosis patients treated with tofacitinib, underscoring the critical function of CD4+ T cell-derived IFN-γ in macrophage activation within the context of sarcoidosis. Damsky et al. highlighted IFN-γ as a central mediator, with additional type 1 cytokines such as IL-6, IL-12, IL-15, and GM-CSF also implicated in the disease’s pathogenesis. The suppression of these cytokines, particularly IFN-γ, was associated with clinical improvement, emphasizing the role of tofacitinib in modulating type 1 immune responses27.

The broad immunological functions of IFN-γ, including its critical role in immunological defense against viruses, bacteria, tumors, macrophage activation, granuloma formation, and protection against sarcoidosis and Mycobacterium tuberculosis45 are particularly pertinent given the hypothesized bacterial etiology of sarcoidosis. Our results showed that cells expressing THE1B fusion transcripts also concurrently expressed IFNGR1 (Supplementary Fig. 7a), which encodes the IFN-γ receptor, suggesting a direct interaction between IFN-γ signaling and these transcript variants.

Notably, in complete responders to Damsky’s study, a marked reduction to almost undetectable levels was observed in specific THE1B fusion transcripts—ENTHD1, LINC02154, THE1B/DNAJC5B.1, THE1B/ENSG00000259760.1, LOC107984475, THE1B/ENSG00000259760.2, SMFT4 and THE1B/CIR1—following tofacitinib treatment (Fig. 2d and Supplementary Fig. 6a). This suggests that IFN-γ may activate these THE1B fusion transcript variants in macrophages, potentially contributing to the pathogenesis of sarcoidosis through granuloma formation. Interestingly, partial responders exhibited THE1B fusion transcript levels similar to those in SM in our study, highlighting their potential as biomarkers for disease activity. To investigate these variants further, we developed RT-qPCR assays for the eight THE1B fusion transcript variants and confirmed their expression with TaqMan PCR, reinforcing their diagnostic relevance in sarcoidosis.

In our analysis, we observed differences in the expression of THE1B fusion transcripts between individuals with SM and pulmonary TB, based on our own snRNA-seq data and the scRNA-seq data reported by Wang et al.34. McCaffrey et al. reported consistent co-expression of CD274 and IDO1 in myeloid-rich regions of the granulomas in pulmonary TB due to Mycobacterium avium (M. avium)35. In line with this, IDO1 has been reported to be significantly upregulated in the macrophage-rich regions of TB granulomas in macaques46. In our re-analysis of TB scRNA-seq data, we observed strong expression of IDO1 in macrophages, while CD274 showed weak expression in IDO1-expressing cells, suggesting the results may be influenced by the anti-TB treatment received by individuals with TB analyzed by Wang et al.34. Notably, eight THE1B fusion transcripts were detected in macrophages around granulomas but were not expressed in IDO1 expressing cells (Fig. 3g and Supplementary Fig. 9d).

Granulomas are classified into two types: caseating and non-caseating. TB is characterized by granulomas with central caseous necrosis, while sarcoidosis is characterized by non-caseating granulomas composed of activated immune cells, such as macrophages and T-lymphocytes instead of necrotic tissues47. One of the THE1B fusion transcripts, LINC02154, has been identified as a macrophage-associated long noncoding RNA that elevates hypoxia-inducible factor 1 alpha (HIF1α) levels by interacting with the interleukin enhancer binding factor. This process has been documented in esophageal squamous cell carcinoma48. Interestingly, HIF1α upregulation has also been observed in GA macrophages in cutaneous sarcoidosis30. This finding suggests potential similarities between GA macrophages and the hypoxic conditions associated with the tumor microenvironment48, suggesting similarities with the hypoxic tumor microenvironment. In contrast, myeloid HIF1α conditional knockout mice displayed earlier granuloma necrosis and reduced resistance to M. avium49, underscoring HIF1α‘s significance. Our findings indicate that, unlike in TB, LINC02154 and the other THE1B fusion transcripts were co-expressed in GA macrophages in SM and cutaneous sarcoidosis, pointing to the potential role of these transcripts in the unique pathology of non-caseous granulomas in sarcoidosis.

In conclusion, retrotrans-genomics has revealed transcripts fused with a MaLR element, THE1B, and a novel read-through transcript in sarcoidosis. These discoveries not only substantiate the role of retrotransposons in the pathogenesis of sarcoidosis but also provide new mechanistic insights, opening potential avenues for innovative diagnostic and therapeutic strategies in granulomatous diseases. However, this study is limited by the modest sample sizes used in the validation phase and in assessing the impact of tofacitinib treatment. Furthermore, the precise functions of THE1B fusion transcripts, the SIRPB1-SIRPD read-through transcript, and their associated gene products in sarcoidosis and other granulomatous conditions remain to be elucidated. Future studies with larger cohorts for validation, including detection of the new transcripts in body fluids from the patients, detailed functional studies, and RNA-FISH, are needed to elucidate how these genetic factors contribute to sarcoidosis.

Methods

Participants

The National Center of Neurology and Psychiatry (NCNP) has been functioning as a nationwide referral center for muscle diseases since 1978 and is estimated to collect more than 70% of the muscle biopsies performed in Japan. After diagnosis, the samples are stored in the muscle repository for future re-analysis and research use. This study was approved by the institutional review board of NCNP (reference: A2021-016). We obtained written informed consent from all of the participants for diagnostic purposes and the use of the samples for research. A set of routine histochemical staining and immunohistochemical staining was performed for the diagnosis of the frozen muscle samples. Among the samples that were sent to NCNP for diagnostic purposes between 2007 and 2021 from all over Japan, we included a total of 30 individuals who were clinicopathologically diagnosed with SM. All individuals with SM had granulomas on muscle pathology upon re-evaluation. The average age of the 16 individuals in the discovery set and 14 individuals in the validation set 1 were 68.3 ± 5.7 years old (range 58–81) and 67.4 ± 10.0 years old (range 52–81), respectively (Supplementary Data 1 and Supplementary Data 2). The sex ratio (male: female) of SM in the discovery set and the validation set 1 was 1:15 and 5:9, respectively. Non-SM samples as controls in the discovery set consisted of 400 individuals diagnosed (or suspected) to have muscular diseases. Non-SM in the validation set 1 consisted of 29 individuals with muscular diseases that were already identified as the causative gene in our laboratory by either method as follows: genomic sequencing, repeat primed PCR or Southern blotting. The average age of 15 individuals with ASS was 68.4 years old (range 53–84). The sex ratio (male: female) of ASS was 1: 14. The age of the 385 non-SM individuals in the discovery set and the 29 individuals in the validation set 1 were 31.3 ± 26.1 years old (range 0–81) and 63.9 ± 8.8 years old (range 52–84), respectively. The sex ratio (male: female) of non-SM in the discovery set and the validation set 1 were 232: 153 and 11: 18, respectively.

Immunohistochemical evaluation

We performed hematoxylin and eosin (H&E) staining and immunohistochemical staining methods50: CD3 (polyclonal, Abcam, The dilution ratio used for this antibody was 1:50), CD4 (clone MT310, Santa Cruz Biotechnology, The dilution ratio was 1:100), CD8 (clone DK25, Dako, The dilution ratio was 1:100), CD20 (clone L26, eBioscience, The dilution ratio was 1:50) and CD68 (clone KP1, Dako, The dilution ratio was 1:1000).

RNA extraction

We extracted total RNA from unstained, 10 µm frozen muscle sections in individuals by Maxwell RSC simply RNA Tissue Kit (Promega), followed by Maxwell RSC automated DNA/RNA extraction machine (Promega) according to the manufacturer’s instruction. We assessed the total RNA of the samples using Agilent 2100 Bioanalyzer (Agilent Technologies) and synthesized cDNA from total RNA (RIN > 7 except for one control sample in the validation set 1 with an RIN of 6.2, 150 ng) using SuperScript IV reverse transcriptase (Thermo Fisher) for TaqMan PCR.

Short-read RNA-seq and long-read RNA-seq

We performed oligo dT based mRNA enrichment from total RNA (RIN > 7) and synthesized cDNA using a random N6 primer. Short-read RNA-seq were performed by MGISEQ-2000 (MGI Tech). Regarding long-read RNA-seq, we constructed libraries by 100 ng total RNA (RIN > 7) extracted from 6 SM samples using the cDNA-PCR Sequencing Kit (SQK-PCS111, Oxford Nanopore Technologies) according to the manufacturer’s manual. We performed sequencing using PromethION with the R9.4.1 flowcell (Oxford Nanopore Technologies) at our laboratory (MGC, NCNP). We performed the reference-based transcript assembly by the wf-variants reference-based transcript assembly workflow (https://github.com/epi2me-labs/wf-variants) using the human genome assembly (GRCh38.p13) as a reference. We assembled the mapped transcripts using StringTie (v.2.2.1)51, (-L, -g gencode.v39.chr_patch_hapl_scaff.annotation.gff3). We predicted coding regions by TransDecoder (v.5.5.0) (https://github.com/TransDecoder/TransDecoder/) and ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/).

Detection of aberrant splicing events in SM

We mapped sequencing reads to the human genome assembly (GRCh38.p13) by STAR (v.2.7.10). We compared the alignment result from each of 15 SM individuals and those from 15 ASS individuals, using LeafCutterMD (v.0.2.9), with default parameters. We identified outlier excised introns according to these three criteria, (i) the excised intron was spliced with adjusted P-value (Padj < 0.05, Benjamini-Hochberg correction), (ii) the excised intron made a pseudo-exon, (iii) the outlier splicing event occurred in more than 10 of 15 SM individuals.

COFFEE method

This method mainly consists of these four steps.

Step 1: Differentially expressed ERVome analysis in a locus-specific manner (THE1B in this study). We cleaned sequencing reads using fastp (v.0.23.2)52 and then mapped them to the human genome assembly (GRCh38.p13) by STAR (v.2.7.10), (--outFilterMultimapNmax 100 --winAnchorMultimapNmax 100). We extracted THE1B elements from GRCh38_GENCODE_rmsk_TE.gtf (https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/). We calculated read quantification for individual ERV locus (THE1B elements in this study) using Telescope (v.1.0.3) across SM individuals and other muscular diseases, including multiple mapped reads. We identified differentially expressed THE1B elements by running the two-sided Likelihood ratio test (LRT) using R package DESeq2 (v.1.34.0) for the count matrices from TE quantification. We created a new BED file that contained differentially expressed ERV loci (named as “DE-ERV file” hereinafter).

Step 2: Intronome analysis in individuals with a particular disease (SM in this study). STAR (v.2.7.10) alignment followed by running coffee.py enabled us to obtain intronome in samples of interest from one of the output files (JUNC file).

Step 3: Extracting introns whose one end were within the differentially expressed ERV loci. We split the intronome obtained in Step 1 into a list of 5’-ends of introns and 3’-ends of introns and created a new BED file which contained both lists with information about directions of original introns taking their directions into account (named as “exon-intron boundary file” hereinafter). We intersected between the DE-ERV file and the exon-intron boundary file using BEDTools (v.2.27.1) intersect followed by extracting introns whose one side were overlapped with ERVs with information about directions of original introns. Next, we created a BED file which contained a list of one end of introns whose opposite ends are within ERV. We named this BED file as a “fused ERV file.”

Step 4: Extracting ERV loci fused with a neighbor exon of a human gene. We intersected between the fused ERV file and a GTF file which contains exome of human genes (Homo_sapiens.GRCh38.105.gtf in this study) using BEDTools (v.2.27.1) intersect taking their directions into account. This analysis resulted in a list of introns between ERV loci and exons of human genes.

Data curation from GEO, NCBI

We retrieved and cleaned publicly available bulk RNA-seq data on GEO, NCBI, using prefetch (v.3.0.0) (https://github.com/ncbi/sra-tools) and fastp (v.0.23.2). Some retrieved FASTQ files were broken and then repaired using repair.sh (BBMap v.38.98) (https://github.com/BioInfoTools/BBMap/blob/master/sh/bbmap.sh). Cutaneous sarcoidosis bulk RNA-seq and scRNA-seq data were deposited from Damsky et al.27 (GSE169146 and GSE169147). Tuberculosis scRNA-seq data were deposited by Wang et al.34 (GSE192483). We converted FASTQ files from SRA files using the fasterq-dump command (v.2.11.0).

Test for group-specific treatment effects

Cutaneous sarcoidosis bulk RNA-seq data27 (GSE169146) consisted of two groups (complete responders and partial responders). Each group has three individuals and their pre- and post-treatment data. We analyzed this data according to “Group-specific condition effects, individuals nested within groups” in DESeq2 vignettes (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups).

Quantification of THE1B containing transcripts using Salmon

We quantified THE1B containing transcripts using Salmon (v.1.9.0) with selective alignment mode, (--numGibbsSamples 100). We built the “SAF” salmon index using the human genome assembly (GRCh38.p13) as a decoy. When building the salmon index, we used a reference transcriptome containing both transcripts from Ensembl release 107 and differentially expressed THE1B containing transcripts detected by Telescope screening (Table 1). The THE1B-containing transcripts were basically based on the de novo transcript assembled sequences, but small indels were repaired compared with the reference genome manually. We identified differentially expressed THE1B elements by running the two-sided Wald test using R package DESeq2 (v.1.34.0)53 for the count matrices from TE quantification.

RT-qPCR

We performed TaqMan qPCR assay with TaqMan Fast Advanced Master Mix (Applied biosystems) on Applied Biosystems 7500 Real-Time PCR system (Applied Biosystems). All primers and probes used in this study were listed in Supplementary Data 8. The quantitative expression of targeted genes was normalized to TBP (NM_003194). Each experiment was repeated at least two times. We performed a two-sided permuted Brunner-Munzel test for statistical analysis using brunnermunzel R package (v.2.0), followed by Benjamini-Hochberg correction using p.adjust() R function.

Single-cell RNA-seq (snRNA-seq) data analysis for cutaneous sarcoidosis samples

We first processed the data using Cell Ranger (v.7.0.0, 10x genomics) to estimate the number of cells. We established an hg38 custom reference containing some differentially expressed THE1B containing transcripts using the Cell Ranger makeref pipeline. We performed demultiplexing, and sequence alignment to the hg38 custom reference using the Cell Ranger count pipeline. In the Cell Ranger pipeline, multimapping reads will be discarded54. Therefore, we quantified THE1B containing transcripts using alevin-fry (v.0.8.1) in USA mode (--resolution parsimony-em). We transformed alevin-fry outputs into anndata objects55 using the pyroe loadfry() function. We annotated genes using clusterProfiler (v.4.2.2)56: and then integrated scRNA-seq data from three individuals with cutaneous sarcoidosis27 using ComBat (https://github.com/brentp/combat.py) and impute missing values using MAGIC29 (t = 6). In this imputation, we selected 25000 or more highly variable genes to avoid the risk of removing rare cells expressing THE1B-containing transcripts. We also calculated the Pearson’s correlation between two genes using SciPy squareform function and drew scatter plots using scprep (v.1.2.2) plot.scatter() function (https://github.com/KrishnaswamyLab/scprep). We illustrated the UMAP clusters using scanpy pl.umap() function57 and converted anndata h5ad file to h5Seurat file by SeuratDisk (https://github.com/mojaveazure/seurat-disk). For cell annotation, we identified highly variable features with an nfeatures of 2000 using the Seurat FindVariableFeatures() function and applied a linear transformation using Seurat ScaleData() function. We performed Principle Component Analysis using the Seurat RunPCA() function and clustered cells using the Seurat FindNeighbors() function and Seurat FindClusters() function. We performed UMAP dimensional reduction using the Seurat RunUMAP() function based on 10 principal components. We annotated scRNA-seq data using SingleR R package (v.1.8.1)58 and transformed the Seurat object to SingleCellExperiment object using as.SingleCellExperiment() function. We used HumanPrimaryCellAtlasData59 in the celldex R package (v.1.4.0) as a reference and annotated cells using SingleR() function with main labels.

Single-cell RNA-seq data analysis in tuberculosis

We performed this data analysis in the same way as scRNA-seq data analysis for cutaneous sarcoidosis samples except for the exclusion of the alevin-fry quantification and the direct integration of the count data obtained from the Cell Ranger pipeline. This is because Alevin was designed for 3’ tagged-end single-cell sequencing data (https://github.com/COMBINE-lab/salmon/blob/master/doc/source/alevin.rst), but the data from (GSE192483) were 5’ scRNA-seq library.

Nuclei extraction from fresh frozen tissue and sorting

The tissues stored at − 80 °C were sectioned to 50 µm /slice x 20-25 slices and placed in 4 ml of cold Lysis Buffer, which contains Tris-HCl pH 7.4, NaCl, MgCl2, NP 40, and nuclease-free ddH20; and 9 ml of cold bovine serum albumin (BSA) Wash Buffer, which contains BSA, RNase Inhibitor (TOYOBO, SIN-201), and phosphate-buffered saline (PBS). The tissues were homogenized using the Dounce homogenizer attached to a Wheaton Overhead Stirrer while on ice for 10 strokes. The homogenized tissues were filtered through a 100 µm cell strainer, followed by filtration into a 40 µm cell strainer. The nuclei were stained with 7AAD and incubated on ice for 15 min. The labeled samples were resuspended in a cold wash buffer and underwent sorting by flow cytometry (BD-FACS Aria Fusion instrument, BD, USA).

Single-nucleus RNA-seq

Library generation of the isolated nuclei was done using Chromium Next GEM Single Cell 3’ Reagent Kits v3.1 (Dual Index) (10x Genomics) as described in the integrated protocol of 10x Genomics containing the cDNA generation, amplification, and library construction pipeline. cDNA and post-library construction quality control were done using the Agilent High Sensitivity DNA Kit and Agilent 2100 Bioanalyzer (Agilent Technologies). The fragment length range was set from 200 to 9000 bp, and the peak of the size distribution of the indexed sample was in the range of the 450–500 bp length region. Paired-end, dual-index sequencing was performed using NovaSeq 6000 (Illumina). The target read depth was 50,000 to 100,000 reads per nuclei library. Downstream analysis, including quality control, integration, imputation, clustering, plotting, and calculation of Pearson’s correlation, were performed in the same way as the scRNA-seq data analysis described above. We manually performed cell type annotation referring to previously reported marker genes60,61.

Alphafold prediction

We used nf-core62 protein fold pipeline (https://nf-co.re/proteinfold) to run AlphaFold (v.2.3). We draw the predicted structures using PyMOL 2.3.0 (https://pymol.org/2/).

Statistics and reproducibility

No statistical method was used to predetermine the sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Methods for statistical hypothesis testing and exact n numbers were stated in the corresponding figure legends and materials and methods.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.