Midterm 52002 - 2024-2025¶
General instructions: The mid-term should be done in pairs. Submit your solutions in the course by uploading the solution files to the course Moodle page by 10.1.2025
.ipynb solution file: Fill-in the missing code blocks (for parts 1 and 3)and text blocks (for all three parts) in this ipynb notebook, change the name to MidTerm_52002_2025_25_<ID1>_<ID2>.ipynb where replace <ID1> and <ID2> by your ID nuymbers. Run the filled-notebook it in jupyter notebooks/google colab and upload to moodle the filled notebook with results (tables, graphs etc.).
In addition, submit a pdf/html export of the executed notebook with all the output, named MidTerm_52002_2025_25_<ID1>_<ID2>.html (or .pdf). Failing to submit both files as instructed will lead to a reduction in your midtem grade.
Good luck!
BigQuery & SQL - 40 points¶
There are 5 sub-questions in this part. Each sub-question is worth 8 points
BigQuery is Google's serverless data warehouse that enables scalable analysis
over petabytes of data. It is a Platform as a Service (PaaS) that supports querying
using SQL. In this part of the midterm you'll be asked to query BigQuery tables
using its python API. Please use the sandbox BigQuery environment and query
only public datasets.
Please note:
Your BigQuery's resources are limited to 1 TB per user per month - be mindful with how many queries you
execute and try to optimize your queries as much as possible.
We will utilize two datasets about New-York City:
The first one called new_york_311 is the 311 calls or service requests dataset, which contains resident complaints from 2010 to 2022.
The second dataset called new_york_trees is the "2015 Street Tree Census - Tree Data," which includes information from the 1995, 2005, and 2015 Street Tree Censuses. This dataset catalogs trees by address and provides details such as species, diameter, and condition.
More information on the two datasets can be found here and here.
Guidelines:
- Fill this notebook with python commands including SQL queries in the designated places and run it using a jupyter notebook environment (e.g. google colab)
- Write efficient SQL queries and code. Points may be taken off for inefficient queries and code.
- The python API enables us to write code that interacts with BigQuery and convert between the SQL tables to python objects, thus allowing analysis and plotting using python code. You should write your SQL commands within the API. It is recommended to first browse the dataset and run the SQL command manualy in the BigQuery web-browser environment, before copying it to the python notebool.
- Your plots should be clear, with titles, and with propoer x-y labels.
You may use the
matplotliblibrary, orpandas.DataFrame.plotfor your plots. - When reading BigQuery tables, used the `` symbols to around the table's name, and use as prefix
'Big Query'and the name of the dataset. For example, to extract a table present in thenew_york_treesdataset you should use the name'bigquery-public-data.new_york_trees.<table-name>'where<table-name>is replaced by the name of the table.
Necessary Libraries:
(Do not use any libraries that are not in this list without permission in writing form the course staff)
import pandas as pd
import matplotlib as plt
from google.cloud import bigquery
from sqlite3 import connect
Q1: BigQuery client & Initital Query¶
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
Construct a BigQuery client object using the client method of the bigquery module. The project name you use here should match the name of the project you open in the BigQuery environment. You should use in both places the same non-huji google user name
client = bigquery.Client(project="part-one-mid-term")
Make a simple query displaying five columns of your choice from one of the tables of your choice in the two datasetsin order to check the connection. Limit the number of rows in the displayed output to 10
# the SQL query
initital_query = """
SELECT tree_id, spc_common,spc_latin, health, sidewalk
FROM `bigquery-public-data.new_york_trees.tree_census_2015`
WHERE tree_id IS NOT NULL AND spc_common IS NOT NULL AND health IS NOT NULL
AND spc_latin IS NOT NULL AND sidewalk IS NOT NULL
LIMIT 10
"""
# Execute it
initital_query_job = client.query(initital_query)
# Convert to a Pandas DataFrame
df = initital_query_job.to_dataframe()
# Display the results
df.head(10)
| tree_id | spc_common | spc_latin | health | sidewalk | |
|---|---|---|---|---|---|
| 0 | 80549 | pagoda dogwood | Cornus alternifolia | Fair | Damage |
| 1 | 451320 | northern red oak | Quercus rubra | Fair | NoDamage |
| 2 | 80547 | Callery pear | Pyrus calleryana | Good | Damage |
| 3 | 451259 | pin oak | Quercus palustris | Good | NoDamage |
| 4 | 451258 | pin oak | Quercus palustris | Good | NoDamage |
| 5 | 451316 | ash | Fraxinus | Good | Damage |
| 6 | 451321 | northern red oak | Quercus rubra | Poor | Damage |
| 7 | 451315 | honeylocust | Gleditsia triacanthos var. inermis | Good | NoDamage |
| 8 | 451317 | honeylocust | Gleditsia triacanthos var. inermis | Good | NoDamage |
| 9 | 451318 | American linden | Tilia americana | Good | NoDamage |
The query retrieves meaningful data from the dataset bigquery-public-data.new_york_trees.tree_census_2015 by selecting five columns: tree_id, spc_common, spc_latin, health, and sidewalk. To ensure the data is useful, a WHERE clause filters out rows where any of these columns contain NULL values. This guarantees that all rows in the result have complete information for the selected columns. The LIMIT 10 clause restricts the output to only 10 rows.
The resulting table provides a clear and informative view of the data, showing details of 10 individual trees. Each row represents a unique tree identified by its tree_id, along with its common name (spc_common), scientific name (spc_latin), health status (health), and sidewalk condition (sidewalk) which indicate whether the tree has caused damage to its surroundings or not.
Q2: Exploratory Data Analysis¶
Remark: unless specified otherwise, refer to the following dataset when considering data for trees:
'bigquery-public-data.new_york_trees.tree_census_2015'
2a. How many trees were there in New York in 2015, and how many of these were healthy (whose health status is not Poor or null)?
2b. What are the ten most common tree species in New York? Run a query that returns a table with the number of trees for each species, the number of healthy trees and the percentage of healthy trees within each species (relative to the total trees of that species). Display the output table with these details only for the ten species with the highest overall counts.
2c. Analyze the status distribution of New York City's trees by examining each ZIP code area. Your analysis should reveal the total number of trees per ZIP code along with the percentage of trees in each of the health status categories within each ZIP code. Exclude records with missing data (zipcode or status) . Show a table with the results for the five ZIP codes with the highest number of total trees.
Answer:
# the SQL query
q2a_query = """
SELECT
COUNT(DISTINCT tree_id) AS total_tree_count,
COUNT(DISTINCT CASE WHEN health IS NOT NULL AND health != 'Poor' THEN tree_id END) AS healthy_tree_count
FROM bigquery-public-data.new_york_trees.tree_census_2015;
"""
q2a_query_job = client.query(q2a_query)
df_q2a = q2a_query_job.to_dataframe()
df_q2a
| total_tree_count | healthy_tree_count | |
|---|---|---|
| 0 | 683788 | 625354 |
total_tree_count = df_q2a.loc[0, 'total_tree_count']
total_healthy_tree_count = df_q2a.loc[0, 'healthy_tree_count']
2a. There are 683788 trees in New York in 2015, and 625345 of them were healthy.
# 2b)
# Fill query here inside triple quotes and display results:
q2b_query = """
WITH tree_stats AS (
SELECT
spc_latin,
COUNT(*) AS total_trees,
SUM(CASE WHEN health = 'Good' THEN 1 ELSE 0 END) AS healthy_trees
FROM bigquery-public-data.new_york_trees.tree_census_2015
WHERE spc_latin IS NOT NULL AND health IS NOT NULL
GROUP BY spc_latin
)
SELECT
spc_latin,
total_trees,
healthy_trees,
ROUND(healthy_trees * 100.0 / total_trees, 2) AS percent_healthy
FROM tree_stats
ORDER BY total_trees DESC
LIMIT 10;
"""
q2b_query_job = client.query(q2b_query)
df_q2b = q2b_query_job.to_dataframe()
df_q2b
| spc_latin | total_trees | healthy_trees | percent_healthy | |
|---|---|---|---|---|
| 0 | Platanus x acerifolia | 87014 | 73311 | 84.25 |
| 1 | Gleditsia triacanthos var. inermis | 64263 | 54501 | 84.81 |
| 2 | Pyrus calleryana | 58931 | 48073 | 81.58 |
| 3 | Quercus palustris | 53185 | 45556 | 85.66 |
| 4 | Acer platanoides | 34189 | 21248 | 62.15 |
| 5 | Tilia cordata | 29742 | 23582 | 79.29 |
| 6 | Prunus | 29279 | 24526 | 83.77 |
| 7 | Zelkova serrata | 29258 | 25285 | 86.42 |
| 8 | Ginkgo biloba | 21024 | 17126 | 81.46 |
| 9 | Styphnolobium japonicum | 19338 | 15832 | 81.87 |
2b. These are the ten most common tree species in New York. With the number of trees for each species, the number of healthy trees and the percentage of healthy trees within each species. We excluded None species.
# 2c)
# Your answer code with queries here:
q2c_query = """
WITH zip_tree_stats AS (
SELECT
zipcode,
COUNT(*) AS total_trees,
SUM(CASE WHEN health = 'Good' THEN 1 ELSE 0 END) AS good_trees,
SUM(CASE WHEN health = 'Fair' THEN 1 ELSE 0 END) AS fair_trees,
SUM(CASE WHEN health = 'Poor' THEN 1 ELSE 0 END) AS poor_trees
FROM bigquery-public-data.new_york_trees.tree_census_2015
WHERE zipcode IS NOT NULL AND health IS NOT NULL
GROUP BY zipcode
)
SELECT
zipcode,
poor_trees,
fair_trees,
good_trees,
total_trees,
ROUND(good_trees * 100.0 / total_trees, 2) AS percent_good,
ROUND(fair_trees * 100.0 / total_trees, 2) AS percent_fair,
ROUND(poor_trees * 100.0 / total_trees, 2) AS percent_poor
FROM zip_tree_stats
ORDER BY total_trees DESC;
"""
q2c_query_job = client.query(q2c_query)
df_q2c = q2c_query_job.to_dataframe()
df_q2c.head()
| zipcode | poor_trees | fair_trees | good_trees | total_trees | percent_good | percent_fair | percent_poor | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10312 | 1071 | 3594 | 16691 | 21356 | 78.16 | 16.83 | 5.01 |
| 1 | 10314 | 662 | 2340 | 13328 | 16330 | 81.62 | 14.33 | 4.05 |
| 2 | 10306 | 369 | 1801 | 10446 | 12616 | 82.80 | 14.28 | 2.92 |
| 3 | 10309 | 593 | 1623 | 9889 | 12105 | 81.69 | 13.41 | 4.90 |
| 4 | 11234 | 341 | 1270 | 9227 | 10838 | 85.14 | 11.72 | 3.15 |
2c) The table shows the results for the five ZIP codes with the highest number of total trees.
# Plot histograms for the health status percentages
plt.pyplot.figure(figsize=(6, 4))
# Histograms
plt.pyplot.hist(df_q2c['good_trees'], bins=10, alpha=0.7, label='Good', color='green', edgecolor='black')
plt.pyplot.hist(df_q2c['fair_trees'], bins=10, alpha=0.7, label='Fair', color='orange', edgecolor='black')
plt.pyplot.hist(df_q2c['poor_trees'], bins=10, alpha=0.7, label='Poor', color='red', edgecolor='black')
plt.pyplot.title('Distribution of Tree Health Percentages Across ZIP Codes', fontsize=16)
plt.pyplot.xlabel('Number of trees', fontsize=14)
plt.pyplot.ylabel('Frequency', fontsize=14)
plt.pyplot.legend(title='Health Status', fontsize=12)
plt.pyplot.grid(axis='y', linestyle='--', alpha=0.7)
plt.pyplot.tight_layout()
plt.pyplot.show()
2c. The histogram shows the distribution of the number of trees by health status (Good, Fair, and Poor) across various ZIP codes. Most ZIP codes have a relatively small number of trees, with the majority of these trees being in "Good" health, as indicated by the green bars. Trees in "Fair" and "Poor" health, represented by orange and red bars respectively, occur less frequently across ZIP codes. The distribution shows that only a few ZIP codes having a very large number of trees.
Q3: Changes Over Time¶
3a. Write a query that extracts for each tree species the change in the the number of trees in New York State from 1995 to 2015, using the dataset 'bigquery-public-data.new_york_trees.tree_census_1995' in addition to 'bigquery-public-data.new_york_trees.tree_census_2015'. Sort the resulting list by the absolute value of the change in the number of trees, and display the five species with the highest absolute change in the number of trees. Show the number of trees in 1995, in 2015 and the change for these species. Since the names sometimes differ between the years, use the uppercase version of spc_latin for each year. Then, merge the data using a CROSS JOIN, with the condition that the EDIT_DISTANCE between the names is less than or equal to 2.
3b. Compare the percentage trees of each fall foliage color for the years 1995 and 2015 separately using the bigquery-public-data.new_york_trees.tree_species dataset. If a tree species is not found in the dataset, assume its foliage color is "Unknown." Output a table showing for each fall foilage color the percent of trees in 1995, the percent of trees in 2015 and their difference, sorted by the aboslute value of the difference.
Use the same criteria for CROSS JOIN as in 3a. ('EDIT_DISTANCE' at most 2 for the uppercase names).
Remark: Consider each combination of colors (e.g. Red/Bronze) as a color of its own (different from the colors Red or Bronze in this example).
Answer:
q3a1_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.tree_counts_1995` AS
SELECT
UPPER(spc_latin) AS species,
COUNT(*) AS tree_count_1995
FROM `bigquery-public-data.new_york_trees.tree_census_1995`
WHERE spc_latin IS NOT NULL
GROUP BY UPPER(spc_latin);
"""
client.query(q3a1_query).result()
q3a2_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.tree_counts_2015` AS
SELECT
UPPER(spc_latin) AS species,
COUNT(*) AS tree_count_2015
FROM `bigquery-public-data.new_york_trees.tree_census_2015`
WHERE spc_latin IS NOT NULL
GROUP BY UPPER(spc_latin);
"""
client.query(q3a2_query).result()
q3a_query = """
-- Join the two tables and calculate the change
SELECT
t95.species AS species_from_1995,
t15.species AS species_from_2015,
t95.tree_count_1995,
t15.tree_count_2015,
(t15.tree_count_2015 - t95.tree_count_1995) AS change_in_tree_count
FROM `part-one-mid-term.dataset_part1midterm.tree_counts_1995` t95
CROSS JOIN `part-one-mid-term.dataset_part1midterm.tree_counts_2015` t15
WHERE EDIT_DISTANCE(t95.species, t15.species) <= 2
ORDER BY ABS(t15.tree_count_2015 - t95.tree_count_1995) DESC
LIMIT 5;
"""
q3a_query_job = client.query(q3a_query)
df_q3a = q3a_query_job.to_dataframe()
df_q3a.head()
| species_from_1995 | species_from_2015 | tree_count_1995 | tree_count_2015 | change_in_tree_count | |
|---|---|---|---|---|---|
| 0 | ACER PLATANOIDES | ACER PLATANOIDES | 109325 | 34189 | -75136 |
| 1 | PYRUS CALLERYANA | PYRUS CALLERYANA | 31295 | 58931 | 27636 |
| 2 | ZELKOVA SERRATA | ZELKOVA SERRATA | 5740 | 29258 | 23518 |
| 3 | ACER SACCHARINUM | ACER SACCHARUM | 22347 | 2844 | -19503 |
| 4 | QUERCUS PALUSTRIS | QUERCUS PALUSTRIS | 36553 | 53185 | 16632 |
This table shows the number of the 5 most species with the changes in number from 1995 till 2015.
q3b1_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.tree_counts_1995` AS
SELECT
UPPER(c95.spc_latin) AS species_1995,
UPPER(ts.species_scientific_name) AS species_tree_species,
IFNULL(ts.fall_color, 'Unknown') AS fall_color,
COUNT(*) AS tree_count_1995
FROM `bigquery-public-data.new_york_trees.tree_census_1995` c95
CROSS JOIN `bigquery-public-data.new_york_trees.tree_species` ts
WHERE EDIT_DISTANCE(UPPER(c95.spc_latin), UPPER(ts.species_scientific_name)) <= 2
GROUP BY species_1995, species_tree_species, fall_color;
"""
q3b2_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.tree_counts_2015` AS
SELECT
UPPER(c15.spc_latin) AS species_2015,
UPPER(ts.species_scientific_name) AS species_tree_species,
IFNULL(ts.fall_color, 'Unknown') AS fall_color,
COUNT(*) AS tree_count_2015
FROM `bigquery-public-data.new_york_trees.tree_census_2015` c15
CROSS JOIN `bigquery-public-data.new_york_trees.tree_species` ts
WHERE EDIT_DISTANCE(UPPER(c15.spc_latin), UPPER(ts.species_scientific_name)) <= 2
GROUP BY species_2015, species_tree_species, fall_color
;
"""
q3b3_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.combined_tree_counts` AS
SELECT
t95.species_1995 As species_1995,
t15.species_2015 As species_2015,
t95.fall_color AS fall_color_1995,
t15.fall_color AS fall_color_2015,
t95.tree_count_1995 AS Number_of_trees_1995,
t15.tree_count_2015 AS Number_of_trees_2015
FROM `part-one-mid-term.dataset_part1midterm.tree_counts_1995` t95
CROSS JOIN `part-one-mid-term.dataset_part1midterm.tree_counts_2015` t15
WHERE EDIT_DISTANCE(t95.species_1995, t15.species_2015) <= 2 AND t95.fall_color = t15.fall_color;
"""
q3b4_query = """
CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.total_tree_counts` AS
SELECT
(SELECT SUM(tree_count_1995) FROM `part-one-mid-term.dataset_part1midterm.tree_counts_1995`) AS total_1995,
(SELECT SUM(tree_count_2015) FROM `part-one-mid-term.dataset_part1midterm.tree_counts_2015`) AS total_2015;
"""
q3b5_query = """
SELECT
COALESCE(com_tree.fall_color_1995, com_tree.fall_color_2015) AS fall_color,
ROUND(SAFE_DIVIDE(SUM(IFNULL(com_tree.Number_of_trees_1995, 0)), total_tree_counts.total_1995) * 100, 2) AS percent_1995,
ROUND(SAFE_DIVIDE(SUM(IFNULL(com_tree.Number_of_trees_2015, 0)), total_tree_counts.total_2015) * 100, 2) AS percent_2015,
ROUND(
ROUND(SAFE_DIVIDE(SUM(IFNULL(com_tree.Number_of_trees_2015, 0)), total_tree_counts.total_2015) * 100, 2)
- ROUND(SAFE_DIVIDE(SUM(IFNULL(com_tree.Number_of_trees_1995, 0)), total_tree_counts.total_1995) * 100, 2),2)
AS percent_difference
FROM `part-one-mid-term.dataset_part1midterm.combined_tree_counts` com_tree
CROSS JOIN `part-one-mid-term.dataset_part1midterm.total_tree_counts` total_tree_counts
GROUP BY fall_color, total_tree_counts.total_1995, total_tree_counts.total_2015
ORDER BY ABS(percent_difference) DESC;
"""
q3b1_query_job = client.query(q3b1_query).result()
q3b2_query_job = client.query(q3b2_query).result()
#df_q3b2 = q3b2_query_job.to_dataframe()
#df_q3b2.head()
q3b3_query_job = client.query(q3b3_query).result()
q3b4_query_job = client.query(q3b4_query).result()
q3b5_query_job = client.query(q3b5_query).result()
df_q3b5 = q3b5_query_job.to_dataframe()
df_q3b5.head()
| fall_color | percent_1995 | percent_2015 | percent_difference | |
|---|---|---|---|---|
| 0 | Yellow | 61.85 | 45.23 | -16.62 |
| 1 | Red/Bronze | 2.16 | 5.72 | 3.56 |
| 2 | Red | 7.10 | 3.83 | -3.27 |
| 3 | Maroon | 28.12 | 24.92 | -3.20 |
| 4 | Orange/Brown | 0.01 | 0.84 | 0.83 |
This table show the 5 highest absolute value of the difference between percent of trees in 1995, the percent of trees in 2015 in fall foliage color.
Q4: Broken Windows Theory¶
4a. The Broken Windows Theory suggests that poor condition of city areas may affect crime levels. In this section, we’ll look for a correlation between the proportion of dead trees (relative to total trees) and the number of NYPD service requests in different zip codes.
Before calculating the correlation, run a query showing the different categories of the service requests to the NYPD and their counts (number of requests for each category) and explain whether they can be used as an imperfect measure of crime trends.
Next, extract the proportion of dead trees in 2015
for each zip code that is not null and has at least 100 trees, and the number of NYPD servise request excluding Noise normalized by the total number of requests in each of these zip codes for that year using SQL commands within python.
Next, extract the results as a python object (e.g. a pandas data-frame), display a scatter-plot showing the percent of dead trees and the number of service requests across the zip codes and compute their Pearson correlation.
Analyze the results to determine if they support the hypothesis. If they do, provide an explanation; if they don't, propose a confounding variable that might lead to inaccuracies.
Hint: Consider using Having in your query
q4a1_query = """
SELECT
complaint_type,
COUNT(*) AS request_count
FROM `bigquery-public-data.new_york_311.311_service_requests`
WHERE agency = 'NYPD' -- Filter only for NYPD service requests
GROUP BY complaint_type
ORDER BY request_count DESC;
"""
q4a1_query_job = client.query(q4a1_query).result()
df_q4a1 = q4a1_query_job.to_dataframe()
df_q4a1.head()
| complaint_type | request_count | |
|---|---|---|
| 0 | Noise - Residential | 2540450 |
| 1 | Illegal Parking | 1376331 |
| 2 | Blocked Driveway | 1165274 |
| 3 | Noise - Street/Sidewalk | 854517 |
| 4 | Noise - Commercial | 440092 |
This table show the top 5 reasons for complains, the first one is noise complaints which may indicate disturbances or disorder, but it is not direct indicators of criminal activity. Second one is Illegal Parking which relates to traffic violations, but not crimes. The third one is Blocked Driveway which also does not indecate about crimes. The forth and fifth are Noise - Street/Sidewalk and Noise - Commercial and they are also not a strong indicators about crime. So we can use them as an indirect measure of crime trends.
# Define the query
q4a2_query = """
-- CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.dead_tree_proportion` AS
SELECT
zipcode,
COUNT(*) AS total_trees,
SUM(CASE WHEN status IN ('Standing Dead', 'Stump') THEN 1 ELSE 0 END) AS dead_trees,
ROUND(SUM(CASE WHEN status IN ('Standing Dead', 'Stump') THEN 1 ELSE 0 END) / COUNT(*) * 100, 2) AS dead_tree_proportion
FROM `bigquery-public-data.new_york_trees.tree_census_2015`
WHERE zipcode IS NOT NULL
GROUP BY zipcode
HAVING COUNT(*) >= 100
"""
q4a3_query = """
-- CREATE OR REPLACE TABLE `part-one-mid-term.dataset_part1midterm.nypd_requests_normalized` AS
SELECT
SAFE_CAST(incident_zip AS INT64) AS zipcode,
COUNT(CASE WHEN complaint_type NOT LIKE 'Noise%' THEN 1 END) AS non_noise_requests,
COUNT(*) AS total_requests,
ROUND(
SAFE_DIVIDE(COUNT(CASE WHEN complaint_type NOT LIKE 'Noise%' THEN 1 END),
COUNT(*)) * 100, 2) AS normalized_non_noise_percentage
FROM `bigquery-public-data.new_york_311.311_service_requests`
WHERE agency = 'NYPD' AND SAFE_CAST(incident_zip AS INT64) IS NOT NULL
GROUP BY zipcode;
"""
q4a2_query_job = client.query(q4a2_query)
q4a3_query_job = client.query(q4a3_query)
# Convert the result to a Pandas DataFrame
result_df_q4a2 = q4a2_query_job.to_dataframe()
result_df_q4a3 = q4a3_query_job.to_dataframe()
# Merge
combined_df = result_df_q4a2.merge(result_df_q4a3, on='zipcode')
combined_df = combined_df[['zipcode', 'dead_tree_proportion', 'non_noise_requests']]
plt.pyplot.figure(figsize=(6, 4))
plt.pyplot.scatter(combined_df['dead_tree_proportion'], combined_df['non_noise_requests'], alpha=0.7)
plt.pyplot.title('Proportion of dead trees vs. NYPD Non-Noise Requests')
plt.pyplot.xlabel('Proportion of dead trees %')
plt.pyplot.ylabel('Number of Non-Noise NYPD Requests')
plt.pyplot.grid(True)
plt.pyplot.show()
x = combined_df['dead_tree_proportion'].tolist()
y = combined_df['non_noise_requests'].tolist()
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)
# Pearson
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
denominator_x = sum((xi - mean_x) ** 2 for xi in x)
denominator_y = sum((yi - mean_y) ** 2 for yi in y)
denominator = (denominator_x * denominator_y) ** 0.5
if denominator != 0:
correlation = round(numerator / denominator,3)
else:
correlation = 0
print(f"Pearson Correlation: {correlation}")
Pearson Correlation: 0.101
The low correlation of 0.101 suggests that the proportion of dead trees and the number of non-noise NYPD service requests are mostly independent.
Q5: Tree-Related Complaints¶
5.a Use the 311 dataset to count how many tree-related complaints each agency (for example NYPD or other agencies) received. Tree-related complaints are those that include the word "Tree" in the complaint_type field (case sensitive).
5.b Filter the tree-related complaints to include only such complaints to the agency with the highest number of tree-related complaints you have found in (a.). Find the three most common types of these tree-related complains and display them in a table with their counts. Finally, create a bar chart that shows the number of the tree-related complaints of these three types for each year, where each bar is divided into three sub-bars of different colors, each indicating the tree-related complaint of each type in this year.
Hint: consider using a window function and the pivot function in Python to simplify the plot.
q5a1_query = """
SELECT
agency,
COUNT(*) AS tree_related_complaints
FROM `bigquery-public-data.new_york_311.311_service_requests`
WHERE complaint_type LIKE '%Tree%'
GROUP BY agency
ORDER BY tree_related_complaints DESC;
"""
q5a1_query_job = client.query(q5a1_query).result()
df_q5a1 = q5a1_query_job.to_dataframe()
df_q5a1.head()
| agency | tree_related_complaints | |
|---|---|---|
| 0 | DPR | 899528 |
| 1 | DSNY | 1986 |
| 2 | NYPD | 2 |
| 3 | ACS | 1 |
| 4 | DOHMH | 1 |
This table shows the top 5 agencies with the most commplains which related to trees.
q5b1_query = """
SELECT
complaint_type,
COUNT(*) AS complaint_count
FROM `bigquery-public-data.new_york_311.311_service_requests`
WHERE complaint_type LIKE '%Tree%' AND agency = 'DPR'
GROUP BY complaint_type
ORDER BY complaint_count DESC
LIMIT 3;
"""
q5b1_query_job = client.query(q5b1_query).result()
df_q5b1 = q5b1_query_job.to_dataframe()
df_q5b1.head()
| complaint_type | complaint_count | |
|---|---|---|
| 0 | Damaged Tree | 387695 |
| 1 | New Tree Request | 183445 |
| 2 | Overgrown Tree/Branches | 178582 |
This table show the top 3 complaint types which related to trees from the database.
q5b1_query = """
SELECT EXTRACT(YEAR FROM created_date) AS year, complaint_type, COUNT(*) AS complaint_count
FROM `bigquery-public-data.new_york_311.311_service_requests`
WHERE complaint_type LIKE '%Tree%' AND agency = 'DPR' AND (complaint_type ='Damaged Tree' OR complaint_type = 'New Tree Request' OR complaint_type ='Overgrown Tree/Branches')
Group by year, complaint_type
ORDER BY year, complaint_count DESC
"""
q5b1_query_job = client.query(q5b1_query).result()
df_q5b1 = q5b1_query_job.to_dataframe()
df_q5b1.head()
| year | complaint_type | complaint_count | |
|---|---|---|---|
| 0 | 2010 | Damaged Tree | 39178 |
| 1 | 2010 | Overgrown Tree/Branches | 14968 |
| 2 | 2010 | New Tree Request | 9859 |
| 3 | 2011 | Damaged Tree | 32104 |
| 4 | 2011 | Overgrown Tree/Branches | 13482 |
pivot_df = df_q5b1.pivot(index="year", columns="complaint_type", values="complaint_count")
pivot_df.plot(kind="bar", stacked=False, figsize=(6, 5), width=0.8)
plt.pyplot.title("Tree-Related Complaints by Type and Year (DPR)", fontsize=14)
plt.pyplot.xlabel("Year", fontsize=12)
plt.pyplot.ylabel("Number of Complaints", fontsize=12)
plt.pyplot.legend(title="Complaint Type", fontsize=10)
plt.pyplot.xticks(rotation=0)
plt.pyplot.tight_layout()
plt.pyplot.show()
This chart displays the number of tree-related complaints received by the DPR agency. The top three complaint categories include damaged trees, new tree requests, and overgrown trees or branches. It provides a comparison of these three complaint types across the years from 2010 to 2021.
Unix - 32 points¶
There are 5 questions in this part. Every question is worth 6 points except question 5 that is worth 8 points. You should solve all the questions using unix commands and include them in your answer, together with additional code/analysis required and the requested results.
Q1: Simple Counting¶
Copy the review-Oregon.json.gz file from the moriah cluster, located at the path:
/sci/labs/orzuk/orzuk/teaching/big_data_mining_52002/midterm_2024_25 or at /sci/home/orzuk.
Extract the file's content and use the wc command to display the number of lines, words, and characters in the review-Oregon.json file. Print the first line of the file and briefly explain what the data represents.
Commands executed on the Moriah cluster:
# Log in to the Moriah cluster
ssh tala.alsayed@moriah-gw.cs.huji.ac.il
# Copy the file from the specified path
cp /sci/home/orzuk/review-Oregon.json.gz .
# Extract the file
gunzip review-Oregon.json.gz
# Verify the extraction
ls
Analyze the file to count lines, words, and characters
wc review-Oregon.json
output:
11012170 424536800 3590454764 review-Oregon.json
Print the first line of the file
head -n 1 review-Oregon.json
output:
{
"user_id": "108990823942597776781",
"name": "Richard Carroll",
"time": 1604639688837,
"rating": 5,
"text": "Fabulous food and amazing service. Will be back for more! The beef ribs are killer.",
"pics": null,
"resp": null,
"gmap_id": "0x80dce9997c8d25fd:0xc6c81c1983060cbc"
}
explain what the data represents:
The review-Oregon.json file is a large dataset containing detailed user-generated reviews, with over 11 million lines, 42 million words, and 3.59 billion characters. Each line represents a JSON object that includes fields such as a unique user ID, the user's name, a numerical rating, the textual content of the review, and additional metadata.
Q2: Food and Images¶
Count the number of comments that include the word "food" (not case sensitive). Out of those comments, determine how many included at least one image.
Next, perform the same query for the rest of the comments (that don't contain the word "food") and compare the proportions of comments with an image in the two categories. Is the difference in proportions statistically significant?
Describe a statistical test and test statistic and display the results. For the last calculation of the test statistic and p-value, manual computation is allowed, but all previous steps should be implemented in Unix commands. Explain the Unix command you wrote in simple language.
Count Comments Containing the Word "Food"¶
grep -i "food" review-Oregon.json | wc -l
Explanation:¶
This command counts how many comments in review-Oregon.json contain the word "food", ignoring case. grep -i searches for "food", and wc -l counts the number of matching comments.
output:¶
1183304
Count "Food" Comments That Include at Least One Image¶
grep -i "food" review-Oregon.json | grep -v '"pics": null' | wc -l
Explanation:¶
The grep -i command searches for comments containing the word "food" (ignoring case). The grep -v command filters out comments without images by excluding lines with "pics": null. Finally, wc -l counts the remaining comments that mention "food" and include at least one image.
output:¶
62252
Calculate the Proportion:¶
Out of 1,183,304 comments that mention "food", 62,252 include at least one image, resulting in a proportion of 0.053.
Count the number of comments without the word “food”:¶
grep -vi "food" review-Oregon.json | wc -l
Explanation:¶
This command counts the number of comments in review-Oregon.json that do not contain the word "food", ignoring case. The -v option excludes lines with "food", and wc -l counts the remaining lines.
output:¶
9828866
Out of the comments that did not include the word “food”, how many of them included at least one image:¶
grep -vi "food" review-Oregon.json | grep -v '"pics": null' | wc -l
Explanation:¶
This command counts how many comments without the word "food" have at least one image. The grep -vi "food" excludes comments with "food", grep -v '"pics": null' filters out comments without images, and wc -l counts the remaining comments.
output:¶
285605
Calculate the Proportion:¶
Out of 9,828,866 comments that do not mention the word "food", 285,605 include at least one image, resulting in a proportion of 0.029.
To determine if the difference in proportions is statistically significant, we used the Two-Proportion Z-Test. This test evaluates whether there is a significant difference between the proportions of two independent groups.
Our hypotheses:
Null Hypothesis $H_0$: There is no difference in proportions between the two groups.
Alternative Hypothesis $H_0$: There is a difference in proportions between the two groups.
$$z = \frac{P1-P2} {\sqrt {P \cdot (1-P) \cdot (\frac{1}{n_1} + \frac{1}{n2})}}$$
Where:
- $ p_1 $ = Proportion of comments with images that mention "food"
- $ p_2 $ = Proportion of comments with images that do not mention "food"
- $ p $ = Combined proportion of both groups
- $ n_1, n_2 $ = Sample sizes of both groups
After calculation, the Z-test statistic is:
$$z = 144.6$$ And the p-value is $P = 2 \cdot P(Z>|z|)$ $=0$
So we concluded that there is a strong evidence that comments mentioning "food" are significantly more likely to include images compared to comments that do not mention "food". Therefore, we reject the null hypothesis $H_0$ and conclude that there is a statistically significant difference in the proportions of comments with images between the two groups.
Q3: Top Users¶
From the people who mentioned "food" (not case sensitive) in their comments, find the three users with the highest number of comments about food and display their user-IDs together with the number of such comments for each one.
# Extract comments containing “food”:
grep -i "food" review-Oregon.json > food_comments.json
# Extract users ID:
grep -o '"user_id": "[^"]*"' food_comments.json | awk -F': ' '{print $2}' > user_ids.txt
# Count comments per user:
sort user_ids.txt | uniq -c | sort -nr > user_counts.txt
# Find the top 3 users:
head -n 3 user_counts.txt
Result of the last executed command:
299 "100150831663760014553"
165 "114651618825277458119"
145 "110494514107558004797"
Explanation:¶
This solution first searches for all comments in the review-Oregon.json file that contain the word "food" (ignoring case) and saves them to food_comments.json. Then, it extracts the user IDs from these comments and stores them in user_ids.txt. The user IDs are then sorted and counted to determine how many comments each user made, with the results saved in user_counts.txt. Finally, the top three users with the highest number of comments about "food" are displayed using the head command.
Q4: Splitting¶
Split the review-Oregon.json file into different files based on the rating value. Name each file as reviews_rating_i, where i represents the rating value, and count the number of comments in each of these files.
# Find the range of ratings
grep -o '"rating": [0-9]*' review-Oregon.json | awk '{print $2}' | sort -n | uniq
# Split the file by rating
for i in {1..5}; do
grep "\"rating\": $i" review-Oregon.json > /tmp/reviews_rating_$i
done
# Count the number of comments in each file
for i in {1..5}; do
echo "Rating $i: $(wc -l < /tmp/reviews_rating_$i) comments"
done
Result of the last executed command:
Rating 1: 0 comments
Rating 2: 339248 comments
Rating 3: 858467 comments
Rating 4: 2059032 comments
Rating 5: 7012178 comments
Explanation:¶
This solution first extracts all unique rating values from the review-Oregon.json file. Then, it splits the file into separate files based on each rating (from 1 to 5), saving them as reviews_rating_i where i is the rating value. Finally, it counts how many comments are in each file. The result shows that Rating 1 has 0 comments, Rating 2 has 339,248, Rating 3 has 858,467, Rating 4 has 2,059,032, and Rating 5 has 7,012,178 comments.
Q5: Count Frequent Words¶
Write a Python program (.py format) that takes a file name (as a string) and a natural number k as input. The program should read the file, extract the words from the reviews, split them into individual words, and print the top k most frequent words. Remove any stop words (common words) before computing the frequencies.
Next, run a shell script that runs the Python program you wrote on all the different rating files created earlier in Q4 and print the top 5 most frequent words for each rating.
the python program:¶
import sys
from collections import Counter
import re
import json
# Define a set of stop words to exclude
STOP_WORDS = {
"and", "the", "is", "in", "to", "of", "a", "an", "for", "on", "with",
"at", "by", "from", "or", "that", "this", "it"
}
def count_frequent_words(file_name, k):
try:
word_counts = Counter()
with open(file_name, "r", encoding="utf-8") as f:
for line in f:
try:
review = json.loads(line)
if "text" in review and isinstance(review["text"], str):
text = review["text"]
words = (
word for word in re.findall(r'\b\w+\b', text.lower())
if word not in STOP_WORDS
)
word_counts.update(words)
except json.JSONDecodeError:
continue
top_k_words = word_counts.most_common(k)
print(f"Top {k} words in {file_name}:")
for word, count in top_k_words:
print(f"{word}: {count}")
except FileNotFoundError:
print(f"Error: File {file_name} not found.")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python program.py <file_name> <k>")
else:
try:
file_name = sys.argv[1]
k = int(sys.argv[2])
count_frequent_words(file_name, k)
except ValueError:
print("Error: <k> must be an integer.")
Explanation of the Python Program¶
This Python program takes a file name and a number k as inputs. It reads the JSON file line by line, extracts words from the "text" field in each review, removes common stop words, and counts how often each word appears. Finally, it prints the top k most frequent words in the file.
Note on Execution¶
I ran this program on my local computer. I downloaded the files generated in Question 4 from the Moriah cluster to my personal computer and then executed the program locally.
Running the Python Program on All Rating Files¶
Now, we want to run the Python program (count_words.py) on the five rating files: reviews_rating_1, reviews_rating_2, reviews_rating_3, reviews_rating_4, and reviews_rating_5.
# Change to the Project Directory
cd C:\Users\talaa\PycharmProjects\52002part2Q5
# Run the Python Program on reviews_rating_1
python count_words.py reviews_rating_1 5
# output:
Top 5 words in reviews_rating_1:
The output for the first file shows that it is empty.
# Run the Python Program on reviews_rating_2
python count_words.py reviews_rating_2 5
# output:
Top 5 words in reviews_rating_2:
i: 327434
was: 229975
they: 125507
not: 121077
my: 106994
# Run the Python Program on reviews_rating_3
python count_words.py reviews_rating_3 5
# output:
Top 5 words in reviews_rating_3:
i: 290500
was: 245968
but: 168129
not: 125866
they: 121791
# Run the Python Program on reviews_rating_4
python count_words.py reviews_rating_4 5
# output:
Top 5 words in reviews_rating_4:
i: 426018
good: 375311
was: 350124
great: 292092
food: 249772
# Run the Python Program on reviews_rating_5
python count_words.py reviews_rating_5 5
# output:
Top 5 words in reviews_rating_5:
i: 2521439
great: 1345740
was: 1329281
my: 1135555
they: 884331
Q1: Simple Counting¶
Copy the review-Oregon.json.gz file from the moriah cluster, located at the path:
/sci/labs/orzuk/orzuk/teaching/big_data_mining_52002/midterm_2024_25 or at /sci/home/orzuk.
Extract the file's content and use the wc command to display the number of lines, words, and characters in the review-Oregon.json file. Print the first line of the file and briefly explain what the data represents.
Commands executed on the Moriah cluster:
# Log in to the Moriah cluster
ssh tala.alsayed@moriah-gw.cs.huji.ac.il
# Copy the file from the specified path
cp /sci/home/orzuk/review-Oregon.json.gz .
# Extract the file
gunzip review-Oregon.json.gz
# Verify the extraction
ls
Analyze the file to count lines, words, and characters
wc review-Oregon.json
output:
11012170 424536800 3590454764 review-Oregon.json
Print the first line of the file
head -n 1 review-Oregon.json
output:
{
"user_id": "108990823942597776781",
"name": "Richard Carroll",
"time": 1604639688837,
"rating": 5,
"text": "Fabulous food and amazing service. Will be back for more! The beef ribs are killer.",
"pics": null,
"resp": null,
"gmap_id": "0x80dce9997c8d25fd:0xc6c81c1983060cbc"
}
explain what the data represents:
The review-Oregon.json file is a large dataset containing detailed user-generated reviews, with over 11 million lines, 42 million words, and 3.59 billion characters. Each line represents a JSON object that includes fields such as a unique user ID, the user's name, a numerical rating, the textual content of the review, and additional metadata.
Q2: Food and Images¶
Count the number of comments that include the word "food" (not case sensitive). Out of those comments, determine how many included at least one image.
Next, perform the same query for the rest of the comments (that don't contain the word "food") and compare the proportions of comments with an image in the two categories. Is the difference in proportions statistically significant?
Describe a statistical test and test statistic and display the results. For the last calculation of the test statistic and p-value, manual computation is allowed, but all previous steps should be implemented in Unix commands. Explain the Unix command you wrote in simple language.
Count Comments Containing the Word "Food"¶
grep -i "food" review-Oregon.json | wc -l
Explanation:¶
This command counts how many comments in review-Oregon.json contain the word "food", ignoring case. grep -i searches for "food", and wc -l counts the number of matching comments.
output:¶
1183304
Count "Food" Comments That Include at Least One Image¶
grep -i "food" review-Oregon.json | grep -v '"pics": null' | wc -l
Explanation:¶
The grep -i command searches for comments containing the word "food" (ignoring case). The grep -v command filters out comments without images by excluding lines with "pics": null. Finally, wc -l counts the remaining comments that mention "food" and include at least one image.
output:¶
62252
Calculate the Proportion:¶
Out of 1,183,304 comments that mention "food", 62,252 include at least one image, resulting in a proportion of 0.053.
Count the number of comments without the word “food”:¶
grep -vi "food" review-Oregon.json | wc -l
Explanation:¶
This command counts the number of comments in review-Oregon.json that do not contain the word "food", ignoring case. The -v option excludes lines with "food", and wc -l counts the remaining lines.
output:¶
9828866
Out of the comments that did not include the word “food”, how many of them included at least one image:¶
grep -vi "food" review-Oregon.json | grep -v '"pics": null' | wc -l
Explanation:¶
This command counts how many comments without the word "food" have at least one image. The grep -vi "food" excludes comments with "food", grep -v '"pics": null' filters out comments without images, and wc -l counts the remaining comments.
output:¶
285605
Calculate the Proportion:¶
Out of 9,828,866 comments that do not mention the word "food", 285,605 include at least one image, resulting in a proportion of 0.029.
Is the difference in proportions statistically significant?¶
We will be using the two proportion Z-Test, This statistical test is used to determine whether there is a significant difference between the proportions of two independent groups.
Hypotheses:
Null Hypothesis (H₀): There is no difference in proportions between the two groups.
Alternative Hypothesis (H₁): There is a difference in proportions between the two groups.
Q3: Top Users¶
From the people who mentioned "food" (not case sensitive) in their comments, find the three users with the highest number of comments about food and display their user-IDs together with the number of such comments for each one.
# Extract comments containing “food”:
grep -i "food" review-Oregon.json > food_comments.json
# Extract users ID:
grep -o '"user_id": "[^"]*"' food_comments.json | awk -F': ' '{print $2}' > user_ids.txt
# Count comments per user:
sort user_ids.txt | uniq -c | sort -nr > user_counts.txt
# Find the top 3 users:
head -n 3 user_counts.txt
Result of the last executed command:
299 "100150831663760014553"
165 "114651618825277458119"
145 "110494514107558004797"
Explanation:¶
This solution first searches for all comments in the review-Oregon.json file that contain the word "food" (ignoring case) and saves them to food_comments.json. Then, it extracts the user IDs from these comments and stores them in user_ids.txt. The user IDs are then sorted and counted to determine how many comments each user made, with the results saved in user_counts.txt. Finally, the top three users with the highest number of comments about "food" are displayed using the head command.
Q4: Splitting¶
Split the review-Oregon.json file into different files based on the rating value. Name each file as reviews_rating_i, where i represents the rating value, and count the number of comments in each of these files.
# Find the range of ratings
grep -o '"rating": [0-9]*' review-Oregon.json | awk '{print $2}' | sort -n | uniq
# Split the file by rating
for i in {1..5}; do
grep "\"rating\": $i" review-Oregon.json > /tmp/reviews_rating_$i
done
# Count the number of comments in each file
for i in {1..5}; do
echo "Rating $i: $(wc -l < /tmp/reviews_rating_$i) comments"
done
Result of the last executed command:
Rating 1: 0 comments
Rating 2: 339248 comments
Rating 3: 858467 comments
Rating 4: 2059032 comments
Rating 5: 7012178 comments
Explanation:¶
This solution first extracts all unique rating values from the review-Oregon.json file. Then, it splits the file into separate files based on each rating (from 1 to 5), saving them as reviews_rating_i where i is the rating value. Finally, it counts how many comments are in each file. The result shows that Rating 1 has 0 comments, Rating 2 has 339,248, Rating 3 has 858,467, Rating 4 has 2,059,032, and Rating 5 has 7,012,178 comments.
Q5: Count Frequent Words¶
Write a Python program (.py format) that takes a file name (as a string) and a natural number k as input. The program should read the file, extract the words from the reviews, split them into individual words, and print the top k most frequent words. Remove any stop words (common words) before computing the frequencies.
Next, run a shell script that runs the Python program you wrote on all the different rating files created earlier in Q4 and print the top 5 most frequent words for each rating.
the python program:¶
import sys
from collections import Counter
import re
import json
# Define a set of stop words to exclude
STOP_WORDS = {
"and", "the", "is", "in", "to", "of", "a", "an", "for", "on", "with",
"at", "by", "from", "or", "that", "this", "it"
}
def count_frequent_words(file_name, k):
try:
word_counts = Counter()
with open(file_name, "r", encoding="utf-8") as f:
for line in f:
try:
review = json.loads(line)
if "text" in review and isinstance(review["text"], str):
text = review["text"]
words = (
word for word in re.findall(r'\b\w+\b', text.lower())
if word not in STOP_WORDS
)
word_counts.update(words)
except json.JSONDecodeError:
continue
top_k_words = word_counts.most_common(k)
print(f"Top {k} words in {file_name}:")
for word, count in top_k_words:
print(f"{word}: {count}")
except FileNotFoundError:
print(f"Error: File {file_name} not found.")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python program.py <file_name> <k>")
else:
try:
file_name = sys.argv[1]
k = int(sys.argv[2])
count_frequent_words(file_name, k)
except ValueError:
print("Error: <k> must be an integer.")
Explanation of the Python Program¶
This Python program takes a file name and a number k as inputs. It reads the JSON file line by line, extracts words from the "text" field in each review, removes common stop words, and counts how often each word appears. Finally, it prints the top k most frequent words in the file.
Note on Execution¶
I ran this program on my local computer. I downloaded the files generated in Question 4 from the Moriah cluster to my personal computer and then executed the program locally.
Running the Python Program on All Rating Files¶
Now, we want to run the Python program (count_words.py) on the five rating files: reviews_rating_1, reviews_rating_2, reviews_rating_3, reviews_rating_4, and reviews_rating_5.
# Change to the Project Directory
cd C:\Users\talaa\PycharmProjects\52002part2Q5
# Run the Python Program on reviews_rating_1
python count_words.py reviews_rating_1 5
# output:
Top 5 words in reviews_rating_1:
The output for the first file shows that it is empty.
# Run the Python Program on reviews_rating_2
python count_words.py reviews_rating_2 5
# output:
Top 5 words in reviews_rating_2:
i: 327434
was: 229975
they: 125507
not: 121077
my: 106994
# Run the Python Program on reviews_rating_3
python count_words.py reviews_rating_3 5
# output:
Top 5 words in reviews_rating_3:
i: 290500
was: 245968
but: 168129
not: 125866
they: 121791
# Run the Python Program on reviews_rating_4
python count_words.py reviews_rating_4 5
# output:
Top 5 words in reviews_rating_4:
i: 426018
good: 375311
was: 350124
great: 292092
food: 249772
# Run the Python Program on reviews_rating_5
python count_words.py reviews_rating_5 5
# output:
Top 5 words in reviews_rating_5:
i: 2521439
great: 1345740
was: 1329281
my: 1135555
they: 884331
Networks - 28 points¶
There are 4 questions in this part. Every question is worth 7 points
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
from geopy.distance import geodesic
import random
from scipy.sparse import csr_matrix, coo_matrix
import time
import copy
from scipy.io import mmread
Q1: Exploratory Data Analysis¶
Download the web-sk-2005 network dataset available here:
https://networkrepository.com/web.php.
Unzip the file and read it to python (Note: this can take a few minutes to upload)
Display the first five rows of the dataset and explain what is shown and what the data represents. Count the number of nodes and edges and display them to verify that your file is correct.
Finally, show two separate histograms, one for the distribution of in-degrees in the network, and one for the distribution of out-degrees. Descrbine your results.
# Mount your drive so you can upload the network file
from google.colab import drive
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
# Load the .mtx file
sparse_matrix = mmread('/content/drive/MyDrive/mid/web-sk-2005.mtx')
# Convert sparse matrix to a pandas DataFrame for inspection
df = pd.DataFrame(sparse_matrix.tocoo().toarray())
print("First five rows of the dataset:")
df.head(5)
First five rows of the dataset:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 121412 | 121413 | 121414 | 121415 | 121416 | 121417 | 121418 | 121419 | 121420 | 121421 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 121422 columns
We has a sparse matrix which is an efficient representation of data in which most of the elements are zero. as shown in the first 5 rows of it. Where the rows and columns correspond to the nodes, and the nonzero entries indicate edges between nodes. If there is an edge from node $i$ to node $j$, the value at position $(i, j)$ in the matrix is $1$, while all other entries are $0$.
G_not_directed = nx.from_scipy_sparse_array(sparse_matrix, create_using=nx.Graph)
# Get the number of nodes and edges
num_nodes = G_not_directed.number_of_nodes()
num_edges = G_not_directed.number_of_edges()
# Print the results
print(f"Number of nodes: {num_nodes}")
print(f"Number of edges: {num_edges}")
# this is symatric graph so when looking at the directed graph we get dupple the number of the edges.
Number of nodes: 121422 Number of edges: 334419
sparse_matrix_coo = sparse_matrix.tocoo()
# extract the row, column, and value from sparse matrix
rows = sparse_matrix_coo.row
columns = sparse_matrix_coo.col
values = sparse_matrix_coo.data
print("This is the representation of sparse matrix as it is")
print("Row indices: ", rows)
print("Column indices: ", columns)
print("Values: ", values)
This is the representation of sparse matrix as it is Row indices: [101377 102890 112476 ... 121394 121406 121412] Column indices: [ 0 0 1 ... 121396 121407 121413] Values: [1. 1. 1. ... 1. 1. 1.]
# Get out-degree of all nodes
G = nx.from_scipy_sparse_array(sparse_matrix, create_using=nx.DiGraph)
all_out_degrees = dict(G.out_degree())
out_degree_values = list(all_out_degrees.values()) # out-degree values
plt.figure(figsize=(10, 5))
counts, bins, _ = plt.hist(out_degree_values, bins=range(0, max(out_degree_values) + 2),
color='skyblue', edgecolor='black', align='left')
# Find the last bin index where frequency > 10
last_nonzero_bin_index = (counts > 10).nonzero()[0][-1]
max_x = bins[last_nonzero_bin_index + 1]
plt.xlabel("Out-Degree", fontsize=15)
plt.ylabel("Frequency", fontsize=15)
plt.title("Histogram of Node Out-Degrees", fontsize=20)
plt.xticks(range(0, max(out_degree_values) + 1), rotation=60, fontsize=6)
plt.xlim(0, max_x + 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
all_in_degrees = dict(G.in_degree()) # Get in-degree of all nodes
in_degree_values = list(all_in_degrees.values()) # Extract in-degree values
plt.figure(figsize=(10, 5))
counts, bins, _ = plt.hist(in_degree_values, bins=range(0, max(in_degree_values) + 2),
color='skyblue', edgecolor='black', align='left')
# Find the last bin index where frequency > 10
last_nonzero_bin_index = (counts > 10).nonzero()[0][-1]
max_x_out = bins[last_nonzero_bin_index + 1]
plt.xlabel("In-Degree", fontsize=15)
plt.ylabel("Frequency", fontsize=15)
plt.title("Histogram of Node In-Degrees", fontsize=20)
plt.xticks(range(0, max(in_degree_values) + 1), rotation=60, fontsize=7)
plt.xlim(0, max_x_out + 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
The first histogram illustrates the distribution of out-degrees in the network, while the second represents the distribution of in-degrees. To make the histograms more visually clear, we focused on nodes with more than 10 edges (either incoming or outgoing). From the two histograms, we observe that the number of incoming edges to a node matches the number of outgoing edges from it.
Q2: Page Rank Analysis¶
Run the PageRank algorithm with $\beta=0.9$ (i.e. teleport probability of $0.1$) and with tolerance of $\epsilon = 10^{-6}$. Plot the tolerance at each iteraction. Record the number of iterations needed for convergence.
Plot the in-degree vs. PageRank of each node in a scatter plot. Next, plot the out-degree vs. PageRank of each node in a separate scatter-plot. Compute in each plot the correlation coefficient. Which one of them is more correlated to PageRank? are the results surprising? explain
def compute_pagerank(sparse_matrix, beta=0.9, tolerance=1e-6, max_iterations=1001):
sparse_matrix = sparse_matrix.tocoo()
# stochastic probability matrix
column_sums = np.array(sparse_matrix.sum(axis=0)).flatten()
column_sums[column_sums == 0] = 1 # Replace zeros with 1 to prevent division errors
normalized_data = sparse_matrix.data / column_sums[sparse_matrix.col]
normalized_matrix = coo_matrix((normalized_data, (sparse_matrix.row, sparse_matrix.col)), shape=sparse_matrix.shape)
num_rows = normalized_matrix.shape[0]
r_new = (1.0 + np.zeros([num_rows, 1])) / num_rows # Uniform initialization
c = (1 - beta) * r_new # Teleportation vector
r_prev = r_new
# Step 3: Power iteration loop
for i in range(1, max_iterations + 1):
# Ensure r_prev is a column vector
r_prev = r_prev.reshape(-1, 1)
# Compute new PageRank vector
r_new = beta * normalized_matrix.dot(r_prev) + c
# Check convergence
diff = np.sum(abs(r_new - r_prev))
if diff < tolerance:
print(f"Converged after {i} iterations with difference {diff:.6e}.")
break
# Update r_prev for the next iteration
r_prev = r_new
return r_new
compute_pagerank(sparse_matrix)
Converged after 104 iterations with difference 9.324176e-07.
array([[7.91947114e-06],
[2.46666341e-06],
[8.18048882e-06],
...,
[4.12617258e-06],
[3.65963302e-06],
[6.89133020e-06]])
# in and out vectors
sparse_matrix_csr = sparse_matrix.tocsr()
out_degree = np.diff(sparse_matrix_csr.indptr) # out-degree vector
sparse_matrix_csc = sparse_matrix.tocsc()
in_degree = np.diff(sparse_matrix_csc.indptr) # in-degree vector
page_rank = compute_pagerank(sparse_matrix).flatten() # the pagerank vector
Converged after 104 iterations with difference 9.324176e-07.
nodes = np.arange(len(page_rank)) # Node indices for reference
def compute_pearson_correlation(x, y, x_label="X", y_label="Y"):
if len(x) != len(y):
raise ValueError("Input lists must have the same length")
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
denominator_x = sum((xi - mean_x) ** 2 for xi in x) ** 0.5
denominator_y = sum((yi - mean_y) ** 2 for yi in y) ** 0.5
denominator = denominator_x * denominator_y
if denominator == 0:
raise ValueError("Division by zero in correlation calculation")
correlation = numerator / denominator
print(f"Correlation between {x_label} and {y_label}: {correlation:.4f}")
return correlation
def plot_and_correlate(x, y, x_label="X-axis", y_label="Y-axis", title="Scatter Plot", alpha=0.7):
# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, alpha=alpha, label=f"{x_label} vs {y_label}")
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.title(title)
plt.legend()
plt.grid()
plt.show()
# Compute Pearson correlation
compute_pearson_correlation(x, y, x_label="X", y_label="Y")
plot_and_correlate(in_degree, page_rank, "In-Degree", "PageRank", "In-Degree vs PageRank")
Correlation between X and Y: 0.7048
plot_and_correlate(out_degree, page_rank, "Out-Degree", "PageRank", "Out-Degree vs PageRank")
Correlation between X and Y: 0.7048
We are calculating the correlation between the number of incoming edges to a node and its rank in the algorithm, as well as the correlation between the number of outgoing edges from a node and its rank.
The results show that the correlation between the in-degree and the PageRank vector is identical to the correlation between the out-degree and the PageRank vector. This is not surprising, as the graph is symmetric.
Q3: PageRank Analysis of Spammers¶
Suppose that a spammer wants to inflater the PageRank of some pages. Choose random 100 pages and add edges between every pair of them (in both directions). Comptue the resulting PageRank for the new network. Next make two scatter plots: (i.) The page-rank of each page before and after the change to the network by the spammers (ii.) The in-degree vs. the pagerank after the change to the network by the spammer
Explain the changes observed after spamming.
Finally, repeat all steps above (adding spam edges, computing modified page-rank, plotting and explaining the results) for a different spammer strategy: choosing two distinct random sets of 100 nodes, and adding all possible links form the first set to the second set. Which spamming strategy is more effective?
Note: Highlight in different colors the different sets of nodes in your scatter plots
# Randomly select 100 pages
num_nodes = sparse_matrix.shape[0] # Total number of nodes in the network
num_spam_nodes = 100
np.random.seed(42)
spam_nodes = np.random.choice(num_nodes, size=num_spam_nodes, replace=False) # Select 100 random nodes
# Generate edges between every pair of the selected nodes
new_rows, new_cols = [], []
for i in spam_nodes:
for j in spam_nodes:
if i != j: # Avoid self-loops
new_rows.append(i)
new_cols.append(j)
# Add these new edges to the sparse matrix
new_data = np.ones(len(new_rows)) # Assign a weight of 1 to each new edge
spammer_sparse_matrix = coo_matrix(
(np.hstack([sparse_matrix.data, new_data]),
(np.hstack([sparse_matrix.row, new_rows]),
np.hstack([sparse_matrix.col, new_cols]))),
shape=sparse_matrix.shape
).tocsr()
spammer_page_rank = compute_pagerank(spammer_sparse_matrix)
print(spammer_page_rank)
Converged after 104 iterations with difference 9.027335e-07. [[7.91935564e-06] [2.46618905e-06] [8.17937773e-06] ... [4.12551149e-06] [3.65807292e-06] [6.89046689e-06]]
page_rank_befor = page_rank.flatten()
page_rank_after = spammer_page_rank.flatten()
plot_and_correlate(page_rank_befor, page_rank_after, "PageRank Before", "PageRank After", "PageRank Before vs After")
Correlation between X and Y: 0.9986
The correlation of 0.9986 between the PageRank vector before and after the spammer's changes indicates that the overall structure of the graph and the ranking of most nodes remain largely unchanged, despite the addition of 100 random edges.
This high correlation suggests that the random edges introduced by the spammer had a limited impact on the overall PageRank distribution. While local changes (such as the ranks of nodes directly involved in the new edges) may have occurred, the global ranking of nodes remained very similar. This resilience of the PageRank algorithm highlights its robustness to minor changes in the graph.
spammer_sparse_matrix_csr = spammer_sparse_matrix.tocsc()
in_degree_after = np.diff(spammer_sparse_matrix_csr.indptr) # Compute in-degree vector
plot_and_correlate(in_degree_after, page_rank_after, "In-Degree After", "PageRank After", "In-Degree After vs PageRank After")
Correlation between X and Y: 0.6859
When a spammer added 100 random edges, the correlation between the in-degree and the PageRank vector decreased to 0.6859. This indicates that the relationship between the number of edges and the PageRank has weakened.
The drop in correlation suggests that the random edges introduced noise into the graph's structure, disrupting the natural flow of rank distribution and reducing the predictive power of in-degree and out-degree for determining a node's PageRank. This is expected because the added edges create irregularities, making the PageRank algorithm less reflective of the original graph's connectivity.
#Randomly select two distinct sets of 100 nodes each
num_nodes = sparse_matrix.shape[0]
num_spam_nodes = 100
np.random.seed(42)
spam_nodes_1 = np.random.choice(num_nodes, size=num_spam_nodes, replace=False)
remaining_nodes = np.setdiff1d(np.arange(num_nodes), spam_nodes_1)
spam_nodes_2 = np.random.choice(remaining_nodes, size=num_spam_nodes, replace=False)
new_rows = np.repeat(spam_nodes_1, len(spam_nodes_2)) # Source nodes (spam_nodes_1)
new_cols = np.tile(spam_nodes_2, len(spam_nodes_1)) # Target nodes (spam_nodes_2)
new_data = np.ones(len(new_rows))
spammer_sparse_matrix_2 = coo_matrix(
(np.hstack([sparse_matrix.data, new_data]),
(np.hstack([sparse_matrix.row, new_rows]),
np.hstack([sparse_matrix.col, new_cols]))),
shape=sparse_matrix.shape
).tocsr()
spammer_page_rank_2 = compute_pagerank(spammer_sparse_matrix_2)
print(spammer_page_rank_2)
Converged after 104 iterations with difference 9.002983e-07. [[7.91953963e-06] [2.46436015e-06] [8.18184229e-06] ... [4.12697792e-06] [3.66174417e-06] [6.89008866e-06]]
page_rank_befor = page_rank.flatten()
page_rank_after_2 = spammer_page_rank_2.flatten()
plot_and_correlate(page_rank_befor, page_rank_after_2, "PageRank Before", "PageRank After", "PageRank Before vs After for the second strategy")
Correlation between X and Y: 0.9995
The correlation of 0.9995 between the PageRank vector before and after the second strategy (adding edges from one set of 100 random nodes to another distinct set of 100 nodes) indicates that this approach caused slightly more disruption compared to the previous strategy (where the correlation was 0.9986).
This result suggests that while the PageRank distribution remains largely stable, the directed edges between the two distinct sets have introduced more significant changes, likely because they create a directed flow of rank from one group to another. This strategy can amplify the PageRank of the target set (the second group of 100 nodes) while reducing the rank distributed to other parts of the network. However, the overall impact is still limited, as the correlation remains very high, demonstrating the robustness of the PageRank algorithm to such modifications.
spammer_sparse_matrix_csr_2 = spammer_sparse_matrix_2.tocsc()
in_degree_after_2 = np.diff(spammer_sparse_matrix_csr_2.indptr) # Compute in-degree vector
plot_and_correlate(in_degree_after_2, page_rank_after, "In-Degree After", "PageRank After", "In-Degree After vs PageRank After for second stratigy")
Correlation between X and Y: 0.6691
The correlation of 0.6691 between the in-degree and the PageRank vector after the second strategy indicates that the relationship between the number of incoming edges to a node and its PageRank has weakened significantly compared to the original graph.
This drop in correlation suggests that the new directed edges introduced by the strategy disrupted the natural structure of the graph. Such a decrease in correlation reflects the impact of spamming strategies that manipulate connectivity to alter the importance of specific nodes artificially. It highlights how directed, targeted changes can affect the predictive relationship between graph properties and PageRank.
When comparing the two spammer strategies, the second strategy —selecting two distinct sets of 100 nodes and adding directed edges from the first set to the second set— proved to be more effective at disrupting the PageRank distribution.
The correlation between the PageRank vectors before and after the second strategy dropped slightly compared to the first strategy, indicating a greater global impact. Additionally, the second strategy significantly weakened the correlation between in-degree and PageRank, reducing it to 0.6691, whereas the first strategy had less impact on this relationship. This suggests that the second strategy was more successful at disrupting the natural relationship between the graph structure and the PageRank algorithm.
Q4: PageRank Computational Complexity¶
Suppose that $A$ is an $n \times n$ matrix with distinct real eigenvalues and we implement the power method to find the leading eigenvector of $A$.
What is the computational complexity of each iteration of the Power method as a function of $n$ for a general (dense) matrix $A$ as a function of $n$ (use Big-O notation)?
Suppose that we know that $A = B+C$ where $B$ is a sparse matrix with $s$ non-zero values, and $C$ is a low-rank matrix with $rank(C)=k$. Write the algorithm of the Power method using $C$ and $B$ as input and utilizing their structure. What is the computational complexity as a function of $n$, $k$ and $s$? (use Big-O notation). You should represent $B$ and $C$ in a way that will minimize memory usage and computations.
Suppose that we know that the convergence rate of the Power method is quadratic. That is, after $t$ iterations the $L_2$ distance between the pagerank vector $r^t$ computed after $t$ iterations and the true pagerank vector $r$ satisfies $||r^t - r||_2 \leq \frac{a}{t^2}$ for some constant $a > 0$. \ We run the algorithm until reaching tolearance $\epsilon$ , i.e. our algorithm should return a vector $r'$ such that $||r'-r||_2 \leq \epsilon$. What will be the computational complexity of the entire algorithm for the above two cases as a function of $n,k,s$ and $\epsilon$? (you may assume that the constant $a$ is known).
a¶
What is the computational complexity of each iteration of the Power method as a function of n for a general (dense) matrix $A$ as a function of n (use Big-O notation)?
Power Method psoducode with matrix A :
Input: Matrix $A$ $ n \times n$, initial vector $v_0$ $n$-dimensional, tolerance ε, max_iter as numbers. Output: Approximation of the leading eigenvector $v$
Initialize: $v$ ← $v_0$ (Choose a random vector with non-zero entries if not provided) this takes $O(n)$
Repeat until convergence or max iterations:
Compute the new vector: $v_{new}$ ← $A ⋅ v$ this takes $O(n \cdot n)$
Check for convergence: If $||v_{new} - v||$ < ε exit the loop this takes $O(1)$
Update the vector: $v$ ← $v_{new}$ this takes $O(1)$
Increment the counter: $k$ ← $k$ + 1 this takes $O(1)$
If k ≥ max_iter, exit the loop (reached maximum iterations) this takes $O(1)$
Return the leading eigenvector: $v$ this takes $O(1)$
So in each iteration the algotithm takes $O(n^2) $.
b¶
Suppose that we know that $A = B + C$ where $B$ is a sparse matrix with s non-zero values, and $C$ is a low-rank matrix with $Rank( C ) = k$ . Write the algorithm of the Power method using $C$ and $B$ as input and utilizing their structure. What is the computational complexity as a function of $n , k$ and $s$? (use Big-O notation). You should represent $B$ and $C$ in a way that will minimize memory usage and computations.
The psudocode of PowerMethod(B, U, V, x0, num_iterations):¶
Inputs:
- B Sparse matrix with s non-zero values.
- U, V: Factorization of low-rank matrix C. where C = U * V^T, U and V are of size $n × k$.
- $v_0$: Initial guess vector, size n.
- num_iterations: Number of iterations to perform.
Output: $v$: Approximation of the leading eigenvector.
Initialize the vector $v$ with $v_0$. $v$ = $v_0$
for i from 1 to num_iterations:
Compute $B⋅v$ using sparse matrix-vector multiplication. CSR representation ensures memory-efficient storage and operations. $v_1$ = SparseMatrixVectorMultiply(B, v) Time complexity: O(s)
Compute $C ⋅ v$ using the low-rank factorization $C = U ⋅ V^T$. Decompose $C ⋅ v$ as $(U ⋅ (V^T ⋅ v))$.
temp = MatrixVectorMultiply(V^T, v) Time complexity: O(n*k)
$v_2$ = MatrixVectorMultiply(U, temp) Time complexity: O(n*k)
Combine results: $v = v_1 + v_2$. $v$ = VectorAddition($v_1$, $v_2$) Time complexity: O(n)
Normalize $v$ to prevent numerical overflow or underflow. v = NormalizeVector(v) Time complexity: O(n)
return $v$
In each iteration the time compexity of the algorithm is $O(n ⋅ k + s)$ and for $t$ iterations the time comlexity will be $O(t(n ⋅ k + s))$
c¶
Suppose that we know that the convergence rate of the Power method is quadratic. That is, after $t$ iterations the $L_2$ distance between the pagerank vector $r_t$ computed after $t$ iterations and the true pagerank vector $r$ satisfies $|| r_t − r ||^2 ≤ a/t^2$ for some constant $a > 0$. We run the algorithm until reaching tolearance ϵ , i.e. our algorithm should return a vector $r ′$ such that $|| r′ − r ||^2 ≤ ϵ$ . What will be the computational complexity of the entire algorithm for the above two cases as a function of $n , k , s$ and $ϵ$? (you may assume that the constant $a$ is known).
As we saw in the previos point b. the time complexity is $O(t(n ⋅ k + s))$ So we will derive the value of $t$ from the equation $|| r_t − r ||^2 ≤ a/t^2$. as in the algorithm this shold be less than $ϵ$.
$$\frac{a}{t^2} ≤ ϵ$$
$$t^2 ≥ \frac{a}{ϵ}$$
$$t ≥ \sqrt{\frac{a}{ϵ}}$$
So the time compexity of the algorithm is $O(\sqrt{\frac{a}{ϵ}} ⋅ (n ⋅ k + s))$ = $O(\sqrt{\frac{1}{ϵ}} ⋅ (n ⋅ k + s))$.
!jupyter nbconvert --to html --TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True MidTerm_52002_2025_25_212335723_211310578.ipynb