Using Gephi Network Graphs to Analyze Unlinked Family Clusters

In the last post, I presented a strategy for expanding a genetic network of shared matches all within a DNA testing website. The strategy is called Viewed Match Switching. Now, I present a similar but more holistic strategy using Gephi network graphs. This process is more demanding of our mental and computational resources, but it is worth your consideration.

While you have likely seen Gephi network graphs presented elsewhere in genealogical research (see the image to the right), this post is different in that I concentrate more on interpretation of unknown clusters of shared matches we refer to as unlinked family clusters rather than how to create them[1]. Our community is fortunate in that two pioneering genetic genealogists have created tutorials to address data collection and the novel software used for analysis (Gephi), which has afforded me the opportunity to focus principally on interpretation. Nevertheless, I do add to their processes in meaningful ways to aid the analysis.

An image of a network graph of autosomal DNA matches using Gephi software.

Even if you don’t believe you would ever create a network graph, you can learn a great deal from this post about DNA match analysis from how its results are interpreted in the graphs.

In this post, I present the four items related to network graphs listed below. I use my cousin’s autosomal DNA matches from Ancestry, which I have used in the genetic network series and the viewed match switching post to learn more about our shared Hill line. I apologize for the article’s length, but I believe it’s worth your time.

  1. A brief overview of Gephi network graphs.
  2. Research objectives using network graphs.
  3. How to create network graphs.
  4. How to interpret network graphs to answer research objectives.

Gephi Network Graph Overview

Network graphs organize our visual autosomal DNA matches into clusters based on the shared DNA each match has with one another. If set up correctly within Gephi, clusters correlate to specific ancestral lines of the DNA tester. Clusters are color coded for easier identification and analysis. Each data point or node within the graph represents a DNA match with the size of the node corresponding to the amount of shared DNA in centiMorgans (cM) with the DNA tester. The lines connecting a match to other matches indicate that the respective nodes match each other in addition to the DNA tester.

Match nodes more densely grouped share a most recent common ancestor (MRCA) among themselves, e.g., maternal great grandparents. Clusters positioned closer to one another share more genetic connections to each other and suggest their ancestral lines are more closely related to one another. Spatially close clusters represent different MRCAs but perhaps on the same ancestral line of the DNA tester.

Network Graph Research Objectives

Appropriate objectives when using network graphs are generally two-fold: 1) to confirm previously known lineage as identified by clusters possessing the correct MRCA and 2) to discover unlinked family clusters, which may inform subsequent documentary research and DNA analysis to learn the identities of unknown ancestors.

To explicate the use of network graphs in this post, I use an inquiry into my Hill ancestry. As such, the objective here is to:

  1. Identify unlinked family clusters associated with my Hill line beyond what is currently known for the earliest known ancestor, William Hill (1775-1836), and
  2. Assess how the unlinked family clusters relate to one another to inform subsequent research efforts to construct generations of my Hill family further back in time.

As a brief introduction, William Hill (1775-1836) first appeared in Lycoming County, Pennsylvania in 1805 and removed to Northeastern Ohio by 1811. Through my initial genetic network analysis (see Part 5 and Part 7 of the genetic network series) and other analyses (see the Viewed Match Switching post), I have surmised that my Hill line likely originated from Dorchester County, Maryland near the Choptank and Blackwater River areas likely arriving Dorchester about 1669. The immigrant ancestors were William (d. 1701) and Agatha (d. 1730) Hill. My 4x great grandfather likely descends through William and Agatha’s son, John, who moved from Dorchester County to Kent County, Delaware in 1737. However, the generations in between my 4x great grandfather and the Hills of Dorchester are unknown.

Creating the Gephi Network Graph

Required Software. To create network graphs, you need to download your DNA matches and their shared matches using DNAGedcom, which charges a monthly ($5) or annual ($50) fee. DNAGedcom requires their local client software be downloaded for data collection to access Ancestry, 23andMe, or FamilyTreeDNA match details. You also need Gephi software to visualize the collected match data.

Ready Your Match Data. Before collecting data, it’s critical you have viewed and organized at least your larger matches, e.g., (75+ cM), especially if your research interests are further back in your tree. These larger matches help you identify clusters as belonging to specific ancestral lines. Using the Leeds Method, Ancestry’s Common Ancestor or ThruLines® hints, or manual review of your matches family trees, you should add notes within the DNA testing website to each of these larger matches indicating the presumed ancestral line. This data is carried forward from the DNA testing website to the DNAGedcom file used for analysis. For example, Ancestry may suggest the common ancestor between me and my match is William Hill and Elizabeth Winland with the match descending through their son William and me through their daughter Susan. Therefore, I write in the note field, “Hill/Winland via William” so I can later understand how this match is related to me using the MRCA.

Data Collection. The downloading process of match information using the DNAGedcom client can take several days depending on the number of matches requested for download, but it’s worth the wait. To facilitate it, I recommend disabling your screen saver or sleep mode so your computer will run all day or night.

Gephi Visualization. Once you have your data files and you are ready to use Gephi, I recommend using the process outlined by Nicole Dyer of Family Locket and Research Like A Pro®. Her step-by-step process is easy to follow with plenty of visuals. I also recommend watching Dave Vance’s YouTube video prior to using Nicole’s process as it is a good introduction to what you will ultimately do although he presents it a bit differently.

Here are a few items I do differently than recommended by Nicole and Dave.

  1. Limit Collected Data. During data collection using the DNAGedcom client, I collect only segment information and in common with (ICW, i.e., shared matches) data. The other data options are not needed and slow down data collection.
  2. Use Smaller Subset of Matches. After importing your match data into Gephi, I recommend using only matches between 15 and 100 cM. This will help you better visualize ancestral lines at the 2x great grandparent level and simply the network’s presentation. While matches below 15 cM can offer insight into unknown ancestors, their presence in the network creates too much noise (too many nodes) making interpretation more difficult. Likewise, matches above 100 cM have too many connections to other nodes preventing unique clusters from forming.
  3. Choose Your Gravity Setting. How you wish to visually present your clusters is a matter of preference. The image shown below and after the last point (#4) presents three methods for using the Gravity settings within the Tuning component of the ForceAtlas 2 layout within Gephi. I prefer option A (no Stronger Gravity selection and Gravity set to 1.0). Here, I can better see how clusters relate to one another and the important matches that help make that determination. The downside is that it creates a lager and more spread out graph area. Options B and C provide a more dense layout with varying degrees of space in between clusters that may hinder or aid how clusters are related. In the next step (removing maternal or paternal matches), you may need to again run ForceAtlas 2 with your preferred gravity settings.
  4. Remove Maternal or Paternal Matches. Once you create your clusters using ForceAtlas 2 and run the Modularity Class feature, I remove all the maternal or paternal matches depending on my research question. While this step is optional, I find it simplifies the graph’s presentation to only the side of the tree that is of interest.
    • Organize the Data Laboratory. I review the Data Laboratory tab organizing matches first by clicking on the sharedcm column header to organize entries by the amount of shared DNA and then clicking on the modularity class column header. This groups matches by modularity class (approximating ancestral lines) and then by shared cM.
    • Strategically Remove Modularity Class Groups. For each modularity class group, I randomly select three to five matches making sure I select matches across the spectrum of shared cM. I then go back into the DNA testing website for each selected match to determine which side of the family tree the match belongs. I use the name and admin fields in the Data Laboratory tab to help me find the match in the DNA testing website. I assess whether most of the randomly selected matches within each modularity cluster are maternal or paternal thereby deleting all matches in the modularity class within the Data Laboratory tab if I determine that the cluster is not on the side of the family tree being researched. Once all appropriate matches are deleted, I repeat the cluster formation (ForceAtlas2), modularity class, and coloration processes outlined in Nicole Dyer’s tutorial to reorient clusters in the graph.
Visual presentation of the gravity setting options for visualizing genetic network clusters in Gephi

Interpreting Gephi Network Graphs

Here, I use my previously stated research objective to explicate how to analyze the network graph to gain greater insight into the ancestral line of interest.

You first start with what you know. The image below presents the maternal ancestral lines for my cousin who shares my Hill ancestry. For anonymity and simplicity reasons, I only present his known ancestral lines at the 2x great grandparent level, which corresponds well to his network graph.

Maternal family tree at the second great grandparent level.

Using the DNAGedcom client, I downloaded my cousin’s Ancestry matches and used those having between 15 and 100 cM of shared DNA. As previously described, I removed all his paternal matches. Then, using the notes field in the Data Laboratory tab within Gephi as a guide, I labeled each cluster with the presumed ancestor or ancestral couple it represents. My notes came from my earlier analyses of my cousin’s matches (see Part 5 and Part 7 of the genetic network series) and Ancestry’s suggested common ancestor and ThruLines® hints. From this, I was able to associate all clusters as either affiliated with confirmed MRCA or newly discovered unlinked family clusters. Unlinked family clusters offer potential hints to unknown ancestral lines.

Maternal ancestral lines for known ancestors for unlinked family clusters are labeled on the Gephi network graph

General Orientation

Near the center of the graph, is a red cluster representing the confluence of his Wilson, Boyd, and Hill lines. The blue and lime green clusters to the upper right represent his Pettit and Freed lines, respectively. Closely connected to the Freed cluster is an unlinked family cluster (fern green) largely associated with the surname Bixler from Pennsylvania, which I discovered in other analyses of my cousin’s DNA matches. Pennsylvania is where the Freed line originated, too. It is probable that Bixler is somehow connected to my cousin’s ancestor Benjamin Freed’s wife Mary whose surname is unknown, which is a research project for another day.

In the lower right-hand corner are the ancestral clusters associated with my cousin’s Irish lines from County Mayo (mauve and fuchsia pink: Duffy, Kelly, and Minnigan) and County Leitrim (burnt orange: Cullen and Gallagher). These lines are more recent immigrants (mid 1800s) compared to the other clusters who were pre-Revolutionary War and Colonial Pennsylvania immigrants.

The remaining clusters appear connected to my and my cousin’s Hill line as dominant members of each cluster contain descendants of William Hill (1775-1836) through multiple child lines in addition to members of the MRCA for the unlinked family cluster. The orange cluster represents the parents of William Hill’s (1775-1836) first wife, Elizabeth Winland (1777-1822), whose parents were Henry Winland and Dorthea Pence. My cousin and I descend from Elizabeth Winland. The light purple cluster to the far left (Knortzer/Upp and Owens) and the seafoam green, pink, and army green clusters near the bottom center (Clark, Harris, Keel, Linn, Matkins, and Vickers) are unlinked family clusters, which are analyzed next as part of the research objective for this project.

Bridges, Hubs and Paths

I know the previously listed unlinked family clusters are related to the Hill line because there are several Hill-descendant matches found in strategic locations within the network graph (see the image below). All the highlighted matches descend through William Hill (1775-1836) and his wife Elizabeth Winland through three different child lines: Susan (my cousin’s line), Mary, and William. The matches identified with a solid line are bridge nodes in that the connections between my cousin’s most recent Wilson/Hill MRCA (red cluster) and the Knortzer/Upp and Owens unlinked family cluster (light purple) go through these matches. If these matches did not exist, the Knortzer/Upp and Owens cluster would be largely disconnected from graph.

Bridge and hub matches linking clusters together

The matches identified with a dashed line provide an alternate path between the unlinked family clusters (light purple, seafoam green, and army green) without going through my cousin’s most recent Wilson/Hill MRCA (red cluster). These nodes act like a hub and tie together the disparate unlinked family clusters. The hub nodes are not simply random matches but matches having an MRCA within the DNA tester’s family tree that connects clusters along related ancestral lines. In my cousin’s case, it’s the Hill line. For these hub and bridge nodes to have ancestral meaning, they should represent several different child lines of a common MRCA.

Knortzer/Upp and Owens Cluster

The Knortzer/Upp and Owens cluster contains two unlinked family subclusters. The MRCA for one subcluster is George Christoph Knortzer (1723-1793) and Anna Upp from York County, Pennsylvania with descendants who migrated to North Carolina. The other subcluster is represented by David Owens (1759-1822) from Rock Castle County, Kentucky with prior origins from North Carolina. The connection between the two groups is unknown but may reside in North Carolina.

The cluster contains 237 matches, which are densely compacted indicating most matches in this cluster also match one another. While Ancestry does not provide segment data, I was able to find several of the unlinked family cluster matches on other DNA testing websites that provide chromosome data (see an earlier post for how cross-database identification can be accomplished). It appears that cluster matches are on chromosome 1 (approximate genomic position 102417093-164744793).

The amount of shared cM for members of the cluster ranges from 16 cM to 42 cM,[2] which is generally consistent with the 6th to 8th cousin relationship. It is therefore possible that these matches relate to William Hill (1775-1836) through one of his grandparents, which may carry the Knortzer, Upp, or Owens surname. In particular, the York County, Pennsylvania origins of the Knortzer/Upp MRCA in the mid-1700s is consistent with the presumed migration route and timing of the Hills from Delaware or Maryland to Lycoming County, Pennsylvania. In fact, the Pence family, which is connected to William Hill’s wife Elizabeth Winland, is reportedly from York County as well, which may hint to how William ended up in Lycoming.

Pile-up Region?

An alternate explanation for the dense appearance of the Knortzer/Upp and Owens cluster is that it is associated with a pile-up region on a chromosome where more matches are found than expected and may therefore represent older shared ancestry beyond a genealogically relevant timeframe. However, because I can trace a large majority of the matches back to an MRCA from a geographically relevant location for my cousin’s ancestry, it is probably not a pile-up region.[3] More importantly, I find other clues that it is not a pile-up region in that it is aligned with my Hill ancestry, i.e., the previously presented bridge matches between the Knortzer/Upp and Owens and Wilson/Hill clusters.

Confounded Ancestry?

If you review the previously presented image of the full network graph containing all clusters, you may notice seemingly stray links between the Knortzer/Upp and Winland/Pence clusters. This might be a cause for concern as the Pence line is believed to share Upp ancestry from York County, Pennsylvania,[4] but the exact connection is uncertain. Also, Knortzer, Upp, Winland, and Pence are all Pennsylvania German lines.

The potential confounds can be allayed with the observation that the links between the two clusters are few and nearly all links between clusters are to other descendants of William Hill (1775-1836) and Elizabeth Winland rather than Winland or Pence cousins. It’s also worth noting that the Knortzer/Upp cluster is more distanced from the other clusters, especially the central Wilson/Hill red cluster and the Winland/Pence orange cluster. My other research has confirmed that the Wilson and related Boyd lines are largely Scottish with a bit of English. So, it’s not unexpected that the German Knortzer/Upp cluster is more distanced from the other clusters. Indeed, my cousin’s Irish clusters are similarly distanced.

Clark, Harris, Keel, Linn, and Vickers Clusters

Earlier blog posts in the genetic network series (Part 5 and Part 7) and the viewed match switching post indicated the Clark, Harris, Keel, Linn, and Vickers unlinked family clusters were on my Hill side as well. The seafoam green cluster contains all these surnames. The pink cluster only contains Clark and Keel members while the army green cluster contains Clark and Linn as well as a smaller unlinked family group of Matkins identified in the viewed match switching post.

While it is unclear how these three unlinked family clusters connect, the Keel line, which was originally McKeel,[5] and Matkins line come from Dorchester County, Maryland from the early 1700s, which is where the Hill line is believed to have originated in the late 1600s. The Clark line is believed to come from Kent County, Delaware[6] in the mid 1700s where the Hill line resided in between Dorchester and Pennsylvania.[7] Linn is from North Carolina in the late 1700s, and Harris is from Arkansas in the mid-1800s. Vickers is from Tennessee from the early 1800s.

The seafoam green cluster contains 111 matches with unlinked family cluster members ranging from 15 to 43 cM; the army green cluster has 107 matches (15 to 37 cM range); and the pink cluster has 80 matches (15 to 37 cM range). There are fewer matches here vis-à-vis the Knortzer/Upp cluster due perhaps to the connection being further back in time or fewer descendants and/or testers. The cM range is generally consistent with the 6th to 8th cousin relationship. These clusters are intriguing due to the overlapping occurrences of MRCAs (Clark, Keel, and Linn) across the three clusters despite each cluster distinctly forming separately. Something appears to separate the groups despite common ancestry.

Multiple DNA Segments?

There are 478 links between my cousin’s most genetically recent Hill cluster (red) and the seafoam green (Clark, Keel, Harris, Linn, Vickers), army green (Clark, Linn), and pink (Clark, Keel) clusters and an additional 135 links among the three unlinked family clusters. It is interesting that only 22% (135) of the links are among the clusters with common MRCAs. I suspect the spatially distinct but common unlinked family cluster surnames might be related to inherited DNA on different chromosomes or different sections on the same chromosome. Indeed, I found several of the cluster member matches on other testing websites bearing this out:

  • Seafoam green cluster (Clark, Keel, Harris, Linn, Vickers), chromosome 10 (approximate genomic position 34397550-64662627)
  • Pink cluster (Clark, Keel), chromosome 5 (approximate genomic position 13599516-52636896)
  • Army green cluster (Clark, Linn, Matkins), chromosome 2 (approximate genomic position 201423387-216783829)

Theoretical Assessment

Based on the above analysis of the network graph, I now have a theory about my Hill ancestry beyond my earliest known ancestor, William Hill (1775-1836). Assuming my prior research presented in earlier posts is correct (e.g., Genetic Network series Part 5 and Part 7 and the viewed match switching post), my working theory is as follows.

  • Dorchester County, Maryland. My Hill line originated in Dorchester County, Maryland, which is where the McKeels (i.e., Keel) and Matkins similarly resided. The Matkins matches are a result of Hannah Hill or Sarah Hill marrying a Matkins, which documentary evidence supports and is the same Hill line as the one being researched.[9] The Keel (McKeel) matches may be related to a direct ancestor (great grandmother) or a sibling of a direct Hill ancestor marrying a male McKeel. However, not enough evidence is available to draw any further conclusions.
  • Kent County, Delaware. In 1737, John Hill moved from Dorchester to Murderkill Hundred, Kent County, Delaware[10] where the Clarks are presumed to have lived before they too migrated to Pennsylvania. There are two facts that lead me to believe a son of John Hill married a Clark AND a male Clark child married a daughter of John Hill.
    • First, the Clark, Keel (McKeel), Matkins, and Hill descendants all match one another meaning they share a common ancestor. However, there is no evidence that the McKeels or Matkins spent time in Delaware, so it seems that a male Clark would have to have married a Hill for the Clarks to match both McKeels and Matkins from Maryland. Yet, there is at least two missing generations in between John Hill and my ancestor William Hill (1775-1836) so another possibility exists.
    • Second, there were three unlinked family clusters (seafoam green, army green, and pink) where Clarks matches were found, and these matches were on at least three different chromosomes. With an MRCA between the Clarks and Hills being so far back in time (early 1700s), it seems unlikely that so many different unique segments would have survived this many generations. Two Hill-Clark unions are more probable.
  • York County, Pennsylvania. My Hill line ended up in Lycoming County, Pennsylvania by 1805. York County is a probable stop on the migration route up the Susquehanna River between southeastern Pennsylvania and north central Pennsylvania. York County is where the Knortzers and Upps resided in the 1700s. It is possible that the mother of my 4x great grandfather, William Hill (1775-1836), was a descendant of the Knortzer/Upp union.

The above figure visualizes my working theory. I cannot yet accept these hypotheses as proven until I find documentary research and other DNA evidence to confirm it. Direct evidence may not exist, and it may be necessary to construct a proof argument using indirect evidence.

The other unlinked family clusters of Owens (light purple), Linn (seafoam and army green), Harris (seafoam green), and Vickers (seafoam green) offer additional and possible insights into the constructed timeline presented above. Determining how these clusters fit into the larger Hill timeline will assist in the development of the proof argument.

In the next post, I attempt to replicate these findings with other Hill cousins to determine if these observations hold or if new evidence emerges. DNA inheritance is random, and so it is important to increase the coverage of an MRCA’s DNA. In this instance, my Hill ancestor. Coverage refers to the amount of DNA represented in DNA testing website databases through the combined testing of multiple descendants of the target ancestor.[11]



Sources

[1] Unlinked family clusters are a relatively large group of shared matches who all descend from a single ancestor or ancestral couple but to whom you are unable to establish your genetic relationship. To be a proper unlinked family cluster, the subgroup of shared matches should descend through multiple child lines of their common ancestor within the cluster.

[2] While the lower end of the range is fixed at 15 cM because only matches sharing 15-100 cM were included in the network graph, the upper end of the range (42 cM) was not fixed.

[3] Bettinger, Blaine (2023, March 16). The Growing Phenomenon of the Unlinked Family Cluster. The Genetic Genealogist (blog), accessed 17 January 2025 from https://thegeneticgenealogist.com/2023/03/16/the-growing-phenomenon-of-the-unlinked-family-cluster/.

[4] Register of Wills, York County, Pennsylvania, Jacob Opp (1793), Volume IJ, p. 13-16, image 302-304 of 779; database with an image (www.familysearch.org), film 5534483.

[5] For example, Stark Count, Ohio, Tax Records, Joseph McKeel (1834), Bethlehem Township; database with image, FamilySearch (www.familysearch.org, accessed 13 October 2024), Family History Library Film 4849277, image 319 of 417. And 1830 U.S. census, Stark County, Ohio, population schedule, Cadiz, Joseph Keil, p. 330, image 1 of 12; database with image, Ancestry (www.ancestry.com, accessed 15 October 2024); NARA series M19, roll 140.

[6] Inter-state Publishing Company (1886). Biographical and Historical Record of Wayne and Appanoose Counties, Iowa. Chicago, IL: Inter-state Publishing Company, p. 469.

[7] Dorchester County, Maryland, Recorder of Deeds, John Hill to Thomas Mackeel (1741), land deed, Old Book 10, p. 349. And Kent County, Delaware, Levy Court Commissioners, John Hill (1737), tax lists, Murderkill Hundred, image 254 of 940; database with an image (www.familysearch.org), film 7834262.

[8] Mowbray, Calvin W. and Mary I. Mowbray (1992). The Early Settlers of Dorchester County and their Lands, Volume II. Westminster, MD: Family Line Publications.

[9] Maryland, U.S., Wills and Probate Records, 1635-1777, Amy Hill (1739), Dorchester County, Vol. 22 (Book DD1), p. 125-126, image 609-610 of 823; database with image, Ancestry (www.ancestry.com, accessed 18 December 2024); citing Maryland County, District, and Probate Courts. And Dorchester County, Maryland, will, William Hill (1733), volume 20, p. 884-885, Prerogative Court, Cambridge; online database with images, Maryland State Archives (www.msa.maryland.gov, accessed 18 December 2024), will series S538-30, book CC3 (1732-1734), image 464-465 of 518.

[10] IBID 7.

[11] Woodbury, Paul (2020). Covering Your Bases: Introduction to Autosomal DNA Coverage. Legacy Tree Genealogists (blog), accessed 20 January 2025 at https://www.legacytree.com/blog/introduction-autosomal-dna-coverage.

Leave a Comment