I have been experimenting the last week with Neo4j and indexes. I have created fake data with the faker library. Creating 10k nodes called A with the attributes: a1, a2, a3, a4. Also created 10k nodes called B with attributes: b1, b2, b3. I have made it so a1, b1 and b2 are of a 100 values, repeated over the 10k. I created 2 indexes one containing all the attributes and one only containing a few. I wanted to test the speeds.
import pandas as pdimport timeitquery_A_index_all = "CREATE INDEX IF NOT EXISTS FOR (a:A) ON (a.a1, a.a2, a.a3, a.a4)"query_B_index_all = "CREATE INDEX IF NOT EXISTS FOR (b:B) ON (b.b1, b.b2, b.b3)"query_A_index_partial = "CREATE INDEX IF NOT EXISTS FOR (a:A) ON (a.a1, a.a4)"query_B_index_partial = "CREATE INDEX IF NOT EXISTS FOR (a:Address) ON (a.address)"
I created a data frame with certain columns including index values.
def create_dataframe_with_indexes(): # Fetch data from Neo4j and construct DataFrame query = """ MATCH (a:A), (b:B) RETURN a.a1 AS A1, a.a2 AS A2, a.a3 AS A3, a.a4 AS A4, b.b1 AS B1, b.b2 AS B2, b.b3 AS B3, id(a) AS A_index, id(b) AS B_index LIMIT 10000""" result = graph.run(query).data() df = pd.DataFrame(result) return df
Next I created two function one to loop through the data frame and make relationships. Another to time how long it took.
def time_relationship_creation(create_relationships_func, df): execution_time = timeit.timeit(lambda: create_relationships_func(df), number=1) print(f"Relationship creation took {execution_time} seconds")# Define function to create relationshipsdef create_relationships(df): for _, row in df.iterrows(): A_index = row['A_index'] B_index = row['B_index'] # Create relationship between A and B nodes query = f"MATCH (a:A), (b:B) WHERE id(a) = {A_index} AND id(b) = {B_index} CREATE (p)-[:HAS_RELATION]->(a)" graph.run(query)
And finally
# For all attributes indexedgraph.run(query_A_index_all)graph.run(query_B_index_all)df_all = create_dataframe_with_indexes() # Function to create DataFrame with indexestime_relationship_creation(create_relationships, df_all)# For a subset of attributes indexedgraph.run(query_A_index_partial)graph.run(query_B_index_partial)df_partial = create_dataframe_with_indexes() # Function to create DataFrame with indexestime_relationship_creation(create_relationships, df_partial)
The main point of this with the fake data was to see if there was any difference in speed between using more or less indexes. The output for each was very similar with the timeit function. I thought there would be More difference.
Relationship creation took 105.32913679201738 secondsRelationship creation took 105.4880417920067 seconds
Was just wondering if this output makes sense? Or maybe there was something I had overlooked in my code? Maybe this is the expected output I was just confused by it. Any ideas are appreciated. Thanks.