Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

15241 beeding street charlotte nc

18/10/2021 Client: muhammad11 Deadline: 2 Day

Write A Python Code On The Anaconda Navigator

Resource Information
In this assignment, you should work with books.csv file. This file contains the detailed information about books scraped via the Goodreads . The dataset is downloaded from Kaggle website: https://www.kaggle.com/jealousleopard/goodreadsbooks/downloads/goodreadsbooks.zip/6

Each row in the file includes ten columns. Detailed description for each column is provided in the following:

bookID: A unique Identification number for each book.
title: The name under which the book was published.
authors: Names of the authors of the book. Multiple authors are delimited with -.
average_rating: The average rating of the book received in total.
isbn: Another unique number to identify the book, the International Standard Book Number.
isbn13: A 13-digit ISBN to identify the book, instead of the standard 11-digit ISBN.
language_code: Helps understand what is the primary language of the book.
num_pages: Number of pages the book contains.
ratings_count: Total number of ratings the book received.
text_reviews_count: Total number of written text reviews the book received.
Task
Write the following codes:
Use pandas to read the file as a dataframe (named as books). bookIDcolumn should be the index of the dataframe.
Use books.head() to see the first 5 rows of the dataframe.
Use book.shape to find the number of rows and columns in the dataframe.
Use books.describe() to summarize the data.
Use books['authors'].describe() to find about number of unique authors in the dataset and also most frequent author.
Use OLS regression to test if average rating of a book is dependent to number of pages, number of ratings, and total number of written text reviews the book received.
Summarize your findings in a Word file.
Instructions
Please follow these directions carefully.
Please type your codes in a Jupyter Network file and your summary in a word document named as follows:
HW6YourFirstNameYourLastName.

Python, is one of the most important foundational packages for numerical computing in Python.\n", "\n", "One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the\n", "equivalent operations between scalar elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = np.array([[ 0, 1, 2, 3, 4],\n", " [ 5, 6, 7, 8, 9],\n", " [10, 11, 12, 13, 14]])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(b)\n", "type(b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(b.sum(axis=0)) # sum of each column" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.ones( (5,4) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1)\n", "np.random.rand(4,2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "---\n", "---\n", "# pandas (https://pandas.pydata.org/)\n", "\n", "- Developed by Wes McKinney.\n", "- pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.\n", "- While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. \n", "- NumPy, by contrast, is best suited for working with homogeneous numerical array data.\n", "- Can be used to collect data from different sources such as Yahoo Finance!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data = np.random.rand(4,2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### change the array to a pandas dataframe:\n", "A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data_df = pd.DataFrame(my_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_data_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#assign columns name\n", "my_data_df = pd.DataFrame(my_data,columns=[\"first column\", \"Second column\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#assign rows name\n", "my_data_df = pd.DataFrame(my_data,columns=[\"first column\", \"Second column\"],index=['a', 'b', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_data_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#There are many ways to construct a DataFrame, though one of the most common is\n", "# from a dict of equal-length lists or NumPy arrays:\n", "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n", " 'year': [2000, 2001, 2002, 2001, 2002, 2003],\n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t = pd.DataFrame(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#For large DataFrames, the head method selects only the first five rows:\n", "data_t.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:\n", "pd.DataFrame(data, columns=['year', 'state', 'pop'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])\n", "df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.set_index('year',inplace=True)\n", "df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:\n", "data_t2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],\n", " index=['one', 'two', 'three', 'four', 'five', 'six'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#retrieving a column by dict-like notation \n", "data_t2[\"state\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# or by attribute:\n", "data_t2.state" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Rows can be retrieved by position or name with the special loc attribute:\n", "data_t2.loc['three']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Columns can be modified by assignment. \n", "data_t2['debt'] = 16.5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2['debt'] = np.arange(6.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val = pd.DataFrame([2, 4, 5],index=['two', 'four', 'five'])\n", "val" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2['debt'] = val" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2['state'] == 'Ohio'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2['eastern'] = data_t2['state'] == 'Ohio'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#The del method can then be used to remove this column:\n", "del data_t2['eastern']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Index objects are immutable and thus can’t be modified by the user:\n", "data_t2.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.index[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.index[0] = 0 #TypeError" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\n", " 'year': [2000, 2001, 2002, 2001, 2002, 2003],\n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data3 = pd.DataFrame(data, index=data[\"year\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del data3['year']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data3.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'state' in data3.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'state' in data3.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "2003 in data3.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reindexing:\n", "An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame = pd.DataFrame(np.arange(9).reshape((3, 3)),\n", " index=['a', 'c', 'd'],\n", " columns=['Ohio', 'Texas', 'California'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame2 = frame.reindex(['a', 'b', 'c','d'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame2.drop('b') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing, Selection, and Filtering:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#column selection\n", "data_t2['year']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#row selection: using either axis labels (loc) \n", "data_t2.loc[\"two\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#row selection: using either axis integers (iloc) \n", "data_t2.iloc[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.iloc[0,0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.loc[\"one\",\"year\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.iloc[0,0:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.iloc[0,0] = 2010" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.at[\"two\", \"state\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.at[\"two\", \"state\"] = \"Florida\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_t2.loc[\"two\", \"state\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sorting and Ranking" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame = pd.DataFrame(np.arange(8).reshape((2, 4)),\n", " index=['three', 'one'],\n", " columns=['d', 'a', 'b', 'c'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.sort_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.sort_index().sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.sort_index(axis=1, ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame = pd.DataFrame({'rating': [4.3, 5, 1, 2]}, index=['R1','R2','R3','R4'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.rank(ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frame.sort_values(\"rating\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarizing and Computing Descriptive Statistics with pandas (good for handling missing data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],\n", " [np.nan, np.nan], [0.75, -1.3]],\n", " index=['a', 'b', 'c', 'd'],\n", " columns=['one', 'two'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calling DataFrame’s sum method returns column sums:\n", "df.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sum(axis='columns')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#note: NA values are excluded.\n", "df.mean(axis='columns')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#This can be disabled with the skipna option:\n", "df.mean(axis='columns', skipna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### important: describe()\n", "describe provides multiple summary statistics:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### pandas-datareader (https://pandas-datareader.readthedocs.io/en/latest/)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "example: stock prices and volumes obtained from Yahoo! Finance using the add-on pandas-datareader package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install pandas-datareader" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas_datareader.data as web" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "web.get_data_yahoo('AAPL')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(all_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "price" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "volume" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Compute percent changes of the prices\n", "returns = price.pct_change()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns['MSFT'].corr(returns['IBM'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns['MSFT'].cov(returns['IBM'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.corr()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.cov()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.corrwith(volume)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: can you make money with buying and selleing APPLE stock by buying at the opening and selling at the closing?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(all_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data = pd.DataFrame(all_data[\"AAPL\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data['Close-Open'] = apple_data['Close'] - apple_data['Open']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data[\"Close-Open\"].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data[['High','Low']].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apple_data.plot(x='High',y='Low', kind='scatter')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "___\n", "___\n", "___\n", "___\n", "### performing regression using statsmodels library" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "returns.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = returns['AAPL']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = returns[['IBM', 'MSFT', 'GOOG']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = sm.OLS(y,x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = sm.OLS(y,x, missing='drop')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = model.fit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(result.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "___\n", "___\n", "___\n", "___\n", "### read_html\n", "reads tables in a html address as a list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(my_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df = pd.DataFrame(my_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(my_list[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df = pd.DataFrame(my_list[0])\n", "my_list_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.set_index('Rank',inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Revenue(USD millions)']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Revenue(USD millions)']= my_list_df['Revenue(USD millions)'].replace('[\\$,]', '', regex=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Revenue(USD millions)']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Revenue(USD millions)']= my_list_df['Revenue(USD millions)'].astype(float)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Revenue(USD millions)']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Country'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df['Country'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#filtering companies in United States:\n", "indices = my_list_df['Country']=='United States'\n", "US_companies = my_list_df.loc[indices,:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "US_companies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "US_companies.plot(kind='hist',bins=20,alpha=0.8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "revenue_US = US_companies[\"Revenue(USD millions)\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "revenue_US.plot(kind='hist',bins=30,alpha=0.8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(revenue_US)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.groupby(['Country']).count()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.groupby(['Country']).count().sort_values(\"Name\", ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.groupby(['Industry']).count().sort_values(\"Name\", ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list_df.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#export data\n", "my_list_df.to_csv(\"BestCompanies.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Another example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "dealership_data = pd.read_csv(\"dealership.csv\", delimiter=\",\")\n", "dealership_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data = pd.read_csv(\"dealership.csv\", delimiter=\",\", index_col=0)\n", "dealership_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data = pd.read_csv(\"dealership.csv\", delimiter=\",\")\n", "dealership_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data['Profit']= dealership_data['Profit'].replace('[\\$,]', '', regex=True).astype('float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data['Location'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data['Location'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question: Are there any statistically significant differences between the means of profits earned in different locations?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#filtering profit based on location:\n", "indices = dealership_data['Location']=='Tionesta'\n", "Tionesta = dealership_data.loc[indices,:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "indices = dealership_data['Location']=='Sheffield'\n", "Sheffield = dealership_data.loc[indices,:]\n", "indices = dealership_data['Location']=='Kane'\n", "Kane = dealership_data.loc[indices,:]\n", "indices = dealership_data['Location']=='Olean'\n", "Olean = dealership_data.loc[indices,:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#plot them\n", "%matplotlib inline\n", "Tionesta['Profit'].plot(kind='hist',bins=15,alpha=0.8, color = 'red')\n", "Sheffield['Profit'].plot(kind='hist',bins=15,alpha=0.8, color = 'green')\n", "Kane['Profit'].plot(kind='hist',bins=15,alpha=0.8, color = 'blue')\n", "Olean['Profit'].plot(kind='hist',bins=15,alpha=0.8, color = 'yellow')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dealership_data.boxplot('Profit',by='Location')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "from statsmodels.formula.api import ols" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = ols('Profit ~ Location', data = dealership_data).fit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ANOVA_table = sm.stats.anova_lm(model,typ=2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(ANOVA_table)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Helping Hand
Unique Academic Solutions
Ideas & Innovations
Top Grade Essay
Phd Writer
Study Master
Writer Writer Name Offer Chat
Helping Hand

ONLINE

Helping Hand

I am an experienced researcher here with master education. After reading your posting, I feel, you need an expert research writer to complete your project.Thank You

$49 Chat With Writer
Unique Academic Solutions

ONLINE

Unique Academic Solutions

I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics.

$49 Chat With Writer
Ideas & Innovations

ONLINE

Ideas & Innovations

I am a professional and experienced writer and I have written research reports, proposals, essays, thesis and dissertations on a variety of topics.

$21 Chat With Writer
Top Grade Essay

ONLINE

Top Grade Essay

As an experienced writer, I have extensive experience in business writing, report writing, business profile writing, writing business reports and business plans for my clients.

$45 Chat With Writer
Phd Writer

ONLINE

Phd Writer

I am an experienced researcher here with master education. After reading your posting, I feel, you need an expert research writer to complete your project.Thank You

$31 Chat With Writer
Study Master

ONLINE

Study Master

I can assist you in plagiarism free writing as I have already done several related projects of writing. I have a master qualification with 5 years’ experience in; Essay Writing, Case Study Writing, Report Writing.

$31 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Revenue cycle sales to cash collections - Discussion question - Needed Urgently, Accounting homework - Finish line grease gun speedplay - Five pillars of islam compared to the ten commandments - In the direct cutover conversion strategy the new system - Adjust your life chiropractic - Charmate tex offset smoker review - Greg chappell store clayton - Force table lab answers - Week 5 Discussion - New milton infant school - Language forms and features - Examples of classical literature - 8 qualities of hanuman - Need help with capstone final project - I need 1200 words total (3 pages 1.5 spacing) on an essay economics. - Networking and marketing strategies for nurse practitioners - Chatham county building safety and regulatory services savannah ga - Major landforms in texas - Turing Machines Assignment - What does copper and silver nitrate make - Repco taren point contact - Green eggs and ham free printable activities - Discussion w1 650 - Draw the marginal cost curve - Stanley fatmax knife instructions - Jb hi-fi warranty contact number - Price is seldom used as a competitive tool - Ocr criminal law for a2 fourth edition - Energy an object has due to its motion - How is the claim management dialog box displayed in medisoft - Coles job resume sample - Peter and the wolf music lesson - Community DQ 1 - Restorative theory of sleep - Temple university powerpoint template - Momentum energy smilepower flexi - Transatlantic telegraph cable apush - Most powerful type of computer - Diversity - Trader joe's case study harvard analysis - Two ways to belong in america analysis - Who can write me 3-5 sentences - Can someone help me with this - As 2259 nvis antenna - 4 sentences - The threats from genetically modified foods robin mather - Subgame perfect nash equilibrium exercises - P2p file sharing application in java - Falls risk assessment template - Lean manufacturing tools (process planning) - Interpretive airline simulation how to win - 4 mm brass rod - A lesson before dying matthew antoine - 1 page in APA 6th Format Operational Excellence - We want to discuss the characteristics of a good introduction. After reading Chapter 6, - Characters in my sisters keeper - NIST Publications and Outcomes - Exel plc supply chain management at haus mart case analysis - George namay dds 6800 central ave toledo oh 43617 - Ruth mcbride jordan obituary - Network diagram critical path - Grossly uncool things baby boomers still think are cool - Soc 120 week 5 discussion 2 - Why is bubble tea so expensive - Police scanner peter johnson - How to calculate meals per labor hour - Bank line simulation c++ - Conquering schizophrenia a father his son and a medical breakthrough - A capacity cushion is the amount of capacity less than expected demand. - Zf s5 39 transmission - Wk 2, HCS 355: Organizational Research - Chapter two neil simon synopsis - Alternating bands of gray and pigmented hair - Seven domains of typical it infrastructure - What is google's reasoning for collecting your personal information - Cloud computing - Legal Studies - Above and beyond award - Six different images of managing change - Salvador dali blue period - How do you pronounce svedka - Solar powered refrigeration system - Financial reporting in the catholic church case study - Greg alexander hedge fund - MGT-322: Logistics Management assignment - Bell hooks cultural criticism and transformation - Leadership Applications CRJ-565-MCOL3 - All summer in a day - B1 topic form example - Pipe thread engagement rule of thumb - Light and plant growth virtual lab journal answers - Calculate formal charge on each oxygen atom in ozone - Harvardbenefits legitimate - The freshest kids - I need 600 words article in Storm water Management And Drainage for a bridge design - reflect on the web article Big Data Means Big Potential, Challenges for Nurse Execs. Reflect on your own experience with complex health information access and management and consider potential challenges and risks  - Module 1 short story - Gartnavel hearing aid clinic